Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Power Law Distributions in Information Retrieval

Published: 16 February 2016 Publication History

Abstract

Several properties of information retrieval (IR) data, such as query frequency or document length, are widely considered to be approximately distributed as a power law. This common assumption aims to focus on specific characteristics of the empirical probability distribution of such data (e.g., its scale-free nature or its long/fat tail). This assumption, however, may not be always true. Motivated by recent work in the statistical treatment of power law claims, we investigate two research questions: (i) To what extent do power law approximations hold for term frequency, document length, query frequency, query length, citation frequency, and syntactic unigram frequency? And (ii) what is the computational cost of replacing ad hoc power law approximations with more accurate distribution fitting? We study 23 TREC and 5 non-TREC datasets and compare the fit of power laws to 15 other standard probability distributions. We find that query frequency and 5 out of 24 term frequency distributions are best approximated by a power law. All remaining properties are better approximated by the Inverse Gaussian, Generalized Extreme Value, Negative Binomial, or Yule distribution. We also find the overhead of replacing power law approximations by more informed distribution fitting to be negligible, with potential gains to IR tasks like index compression or test collection generation for IR evaluation.

Supplementary Material

petersen (petersen.zip)
Supplemental movie, appendix, image and software files for, Power Law Distributions in Information Retrieval

References

[1]
Lada A. Adamic, Rajan M. Lukose, Amit R. Puniyani, and Bernardo A. Huberman. 2001. Search in power-law networks. Physical Review E 64, 4 (2001), 046135.
[2]
Hirotugu Akaike. 1974. A new look at the statistical model identification. IEEE Transactions on Automated Control 19, 6 (1974), 716--723.
[3]
Avi Arampatzis and Jaap Kamps. 2008. A study of query length. In Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 811--812.
[4]
Avi Arampatzis and Jaap Kamps. 2009. A signal-to-noise approach to score normalization. In Proceedings of the 18th ACM Conference on Information and Knowledge Management. ACM, 797--806.
[5]
Taylor B. Arnold and John W. Emerson. 2011. Nonparametric goodness-of-fit tests for discrete null distributions. The R Journal 3, 2 (2011), 34--39.
[6]
Harshvardhan Asthana, Ruoxun Fu, and Ingemar J. Cox. 2011. On the feasibility of unstructured peer-to-peer information retrieval. In Advances in Information Retrieval Theory. Springer, 125--138.
[7]
Leif Azzopardi. 2009. Query side evaluation: An empirical analysis of effectiveness and effort. In Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’09). ACM, 556--563.
[8]
Harald Baayen. 2001. Word Frequency Distributions. Springer.
[9]
Rohit Babbar, Ioannis Partalas, Eric Gaussier, and Massih-Reza Amini. 2014. Re-ranking approach to classification in large-scale power-law distributed category systems. In Proceedings of the 37th International ACM SIGIR Conference on Research (SIGIR 2014). ACM, 1059--1062.
[10]
David F. Babbel, Vincent J. Strickler, and Ricki S. Dolan. 2009. Statistical string theory for courts: If the data don’t fit. Legal Technology Risk Management 4 (2009), 1.
[11]
Ricardo Baeza-Yates, Aristides Gionis, Flavio Junqueira, Vanessa Murdock, Vassilis Plachouras, and Fabrizio Silvestri. 2007. The impact of caching on search engines. In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 183--190.
[12]
Ricardo Baeza-Yates, Javier Ruiz-del Solar, Rodrigo Verschae, Carlos Castillo, and Carlos Hurtado. 2004. Content-based image retrieval and characterization on specific web collections. In Image and Video Retrieval. Springer, 189--198.
[13]
Ricardo Baeza-Yates and Felipe Saint-Jean. 2003. A three level search engine index based in query log distribution. In String Processing and Information Retrieval. Springer, 56--65.
[14]
Ricardo Baeza-Yates and Alessandro Tiberi. 2007. Extracting semantic relations from query logs. In Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 76--85.
[15]
Albert-László Barabási, Réka Albert, and Hawoong Jeong. 1999. Mean-field theory for scale-free random networks. Physica A: Statistical Mechanics and Its Applications 272, 1 (1999), 173--187.
[16]
Heiko Bauke. 2007. Parameter estimation for power-law distributions by maximum likelihood methods. The European Physical Journal B-Condensed Matter and Complex Systems 58, 2 (2007), 167--173.
[17]
Michael A. Bean. 2001. Probability: The Science of Uncertainty with Applications to Investments, Insurance, and Engineering. Vol. 6. American Mathematical Society.
[18]
Luca Becchetti and Carlos Castillo. 2006. The distribution of pagerank follows a power-law only for particular values of the damping factor. In Proceedings of the 15th International Conference on World Wide Web. ACM, 941--942.
[19]
Casper Beckman. 1999. Chinese character frequencies. http://casper.beckman.uiuc.edu/∼c-tsai4/chinese/charfreq.html. (1999). No longer available.
[20]
Jan Beirlant, Yuri Goegebeur, Johan Segers, and Jozef Teugels. 2006. Statistics of Extremes: Theory and Applications. John Wiley & Sons.
[21]
Andras A. Benczur, Karoly Csalogany, Tamas Sarlos, and Mate Uher. 2005. SpamRank--Fully automatic link spam detection work in progress. In Proceedings of the 1st International Workshop on Adversarial Information Retrieval on the Web.
[22]
Kerstin Bischoff, Claudiu S. Firan, Wolfgang Nejdl, and Raluca Paiu. 2008. Can all tags be used for search?. In Proceedings of the 17th ACM Conference on Information and Knowledge Management. ACM, 193--202.
[23]
Paolo Bolettieri, Andrea Esuli, Fabrizio Falchi, Claudio Lucchese, Raffaele Perego, Tommaso Piccioli, and Fausto Rabitti. 2009. CoPhIR: A test collection for content-based image retrieval. arXiv preprint arXiv:0905.4627 (2009).
[24]
Abraham Bookstein. 1990. Informetric distributions, part I: Unified overview. American Society for Information Science 41, 5 (1990), 368--375.
[25]
George E. P. Box and David R. Cox. 1964. An analysis of transformations. Journal of the Royal Statistical Society. Series B (Methodological) (1964), 211--252.
[26]
Lee Breslau, Pei Cao, Li Fan, Graham Phillips, and Scott Shenker. 1999. Web caching and Zipf-like distributions: Evidence and implications. In Proceedings of the 18th Annual Joint Conference of the IEEE Computer and Communications Societies (INFOCOM’99), Vol. 1. IEEE, 126--134.
[27]
Andrei Broder, Ravi Kumar, Farzin Maghoul, Prabhakar Raghavan, Sridhar Rajagopalan, Raymie Stata, Andrew Tomkins, and Janet Wiener. 2000. Graph structure in the web. Computer Networks 33, 1 (2000), 309--320.
[28]
Mark Buchanan. 2004. Power laws & the new science of complexity management. Strategy+ Business 34 (2004), 1--8.
[29]
Kenneth P. Burnham and David R. Anderson. 2002. Model Selection and Multimodel Inference: A Practical Information-Theoretic Approach. Springer.
[30]
Pedro Cano, Oscar Celma, Markus Koppenberger, and Javier M. Buldu. 2006. Topology of music recommendation networks. Chaos: An Interdisciplinary Journal of Nonlinear Science 16, 1 (2006), 013107.
[31]
Domenico Cantone, Salvatore Cristofaro, Simone Faro, and Emanuele Giaquinta. 2009. Finite state models for the generation of large corpora of natural language texts. In Proceedings of the 7th International Workshop on Finite-state Methods and Natural Language Processing, Vol. 191. IOS Press, 175.
[32]
Huanhuan Cao, Daxin Jiang, Jian Pei, Qi He, Zhen Liao, Enhong Chen, and Hang Li. 2008. Context-aware query suggestion by mining click-through and session data. In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 875--883.
[33]
Deepayan Chakrabarti and Christos Faloutsos. 2006. Graph mining: Laws, generators, and algorithms. ACM Computing Surveys (CSUR) 38, 1 (2006), 2.
[34]
Michael Chau, Yan Lu, Xiao Fang, and Christopher C. Yang. 2009. Characteristics of character usage in Chinese Web searching. Information Processing & Management 45, 1 (2009), 115--130.
[35]
Surajit Chaudhuri, Kenneth Church, Arnd Christian König, and Liying Sui. 2007. Heavy-tailed distributions and multi-keyword queries. In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 663--670.
[36]
Serena H. Chen and Carmel A. Pollino. 2012. Good practice in Bayesian network modelling. Environmental Modelling & Software 37 (2012), 134--145.
[37]
Pasquale Cirillo. 2013. Are your data really pareto distributed? Physica A: Statistical Mechanics and its Applications 392, 23 (2013), 5947--5962.
[38]
Kevin A. Clarke. 2003. Nonparametric model discrimination in international relations. Journal of Conflict Resolution 47, 1 (2003), 72--93.
[39]
Kevin A. Clarke. 2007. A simple distribution-free test for nonnested model selection. Political Analysis 15, 3 (2007), 347--363.
[40]
Aaron Clauset, Cosma R. Shalizi, and Mark E. J. Newman. 2007. Power-law distributions in empirical data. SIAM review 51, 4 (2007), 661--703.
[41]
Maarten Clements, Arjen P. de Vries, and Marcel J. T. Reinders. 2010. The influence of personalization on tag query length in social media search. Information Processing & Management 46, 4 (2010), 403--412.
[42]
Will Cook, Paul Ormerod, and Ellie Cooper. 2004. Scaling behaviour in the number of criminal acts committed by individuals. Journal of Statistical Mechanics: Theory and Experiment 2004, 7 (2004), P07003.
[43]
Gregory W. Corder and Dale I. Foreman. 2009. Nonparametric Statistics for Non-Statisticians: A Step-By-Step Approach. John Wiley & Sons.
[44]
Gordon V. Cormack, Mark D. Smucker, and Charles L. A. Clarke. 2011. Efficient and effective spam filtering and re-ranking for large web datasets. Information Retrieval 14, 5 (2011), 441--465.
[45]
Nick Craswell and Martin Szummer. 2007. Random walks on the click graph. In Proceedings of the 30th annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 239--246.
[46]
Mark E. Crovella and Murad S. Taqqu. 1999. Estimating the heavy tail index from scaling properties. Methodology and Computing in Applied Probability 1, 1 (1999), 55--79.
[47]
Wang Dahui, Li Menghui, and Di Zengru. 2005. True reason for Zipf’s law in language. Physica A: Statistical Mechanics and its Applications 358, 2 (2005), 545--550.
[48]
Russell Davidson and James G. MacKinnon. 1981. Several tests for model specification in the presence of alternative hypotheses. Econometrica: Journal of the Econometric Society (1981), 781--793.
[49]
Shuai Ding, Josh Attenberg, Ricardo Baeza-Yates, and Torsten Suel. 2011. Batch query processing for web search engines. In Proceedings of the Fourth ACM International Conference on Web Search and Data Mining. ACM, 137--146.
[50]
Sandor Dominich and Tamas Kiezer. 2005. Zipfs law, small world and Hungarian language. Alkalmazott Nyelvtudomány 1, 2 (2005), 5--24. In Hungarian.
[51]
Joshua Drucker. 2007. Regional Dominance and Industrial Success: A Productivity-Based Analysis. ProQuest.
[52]
Jan Eeckhout. 2004. Gibrat’s law for (all) cities. American Economic Review (2004), 1429--1451.
[53]
Leo Egghe. 2000. The distribution of N-grams. Scientometrics 47, 2 (2000), 237--252.
[54]
Ramon Ferrer-i Cancho and Brita Elvevåg. 2010. Random texts do not exhibit the real Zipf’s law-like rank distribution. PLoS One 5, 3 (2010), e9411.
[55]
Andrey Feuerverger and Peter Hall. 1999. Estimating a tail exponent by modelling departure from a Pareto distribution. The Annals of Statistics 27, 2 (1999), 760--781.
[56]
Catherine Forbes, Merran Evans, Nicholas Hastings, and Brian Peacock. 2011. Statistical distributions. John Wiley & Sons.
[57]
Xavier Gabaix. 2009. Power laws in economics and finance. Annual Review of Economics 1 (2009), 255--93.
[58]
Wojciech Galuba, Karl Aberer, Dipanjan Chakraborty, Zoran Despotovic, and Wolfgang Kellerer. 2010. Outtweeting the Twitterers - Predicting information cascades in microblogs. In Proceedings of the 3rd Conference on Online Social Networks.
[59]
David Garcia, Pavlin Mavrodiev, and Frank Schweitzer. 2013. Social resilience in online communities: The autopsy of friendster. In Proceedings of the First ACM Conference on Online Social Networks. ACM, 39--50.
[60]
Wolfgang Gatterbauer. 2011. Rules of thumb for information acquisition from large and redundant data. In Advances in Information Retrieval - 33rd European Conference on IR Research, ECIR 2011, Dublin, Ireland, April 18--21, 2011. 479--490.
[61]
Natalie Glance, Matthew Hurst, and Takashi Tomokiyo. 2004. Blogpulse: Automated trend discovery for weblogs. In WWW 2004 Workshop on the Weblogging ecosystem: Aggregation, Analysis and Dynamics, Vol. 2004. ACM.
[62]
Yoav Goldberg and Jon Orwant. 2013. A Dataset of Syntactic-Ngrams over Time from a Very Large Corpus of English Books. Technical Report. Google. http://commondatastorage.googleapis.com/books/syntactic-ngrams/index.html.
[63]
Greg N. Gregoriou. 2009. Operational Risk Toward Basel III: Best Practices and Issues in Modeling, Management, and Regulation. Vol. 481. John Wiley & Sons.
[64]
Peter Grünwald. 2007. The Minimum Description Length Principle. MIT press.
[65]
Cathal Gurrin and Alan F. Smeaton. 2004. Replicating web structure in small-scale test collections. Information retrieval 7, 3--4 (2004), 239--263.
[66]
Matthias Hagen, Martin Potthast, Benno Stein, and Christof Braeutigam. 2010. The power of naive query segmentation. In Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 797--798.
[67]
Harry Halpin, Valentin Robu, and Hana Shepherd. 2007. The complex dynamics of collaborative tagging. In Proceedings of the 16th International Conference on World Wide Web. ACM, 211--220.
[68]
Robert K. Hammond and James E. Bickel. 2013. Reexamining discrete approximations to continuous distributions. Decision Analysis 10, 1 (2013), 6--25.
[69]
Claudia Hauff and Leif Azzopardi. 2005. Age dependent document priors in link structure analysis. In Advances in Information Retrieval. Springer, 552--554.
[70]
Harold S. Heaps. 1978. Information Retrieval: Computational and Theoretical Aspects. Academic Press, Inc., Orlando, FL, USA.
[71]
Daniel Heesch and Stefan Rüger. 2004. NNk networks for content-based image retrieval. In Advances in Information Retrieval. Springer, 253--266.
[72]
Joseph Hilbe. 2011. Negative Binomial Regression. Cambridge University Press.
[73]
Bruce M. Hill. 1975. A simple general approach to inference about the tail of a distribution. The Annals of Statistics 3, 5 (1975), 1163--1174.
[74]
Andreas Hotho, Robert Jäschke, Christoph Schmitz, and Gerd Stumme. 2006. Information retrieval in folksonomies: Search and ranking. The Semantic Web: Research and Applications (2006), 411--426.
[75]
Bernardo A. Huberman and Lada A. Adamic. 1999. Evolutionary dynamics of the world wide web. arXiv Preprint Cond-Mat/9901071 (1999).
[76]
Clifford M. Hurvich and Chih-Ling Tsai. 1989. Regression and time series model selection in small samples. Biometrika 76, 2 (1989), 297--307.
[77]
Hawoong Jeong, Bálint Tombor, Réka Albert, Zoltan N. Oltvai, and Albert-László Barabási. 2000. The large-scale organization of metabolic networks. Nature 407, 6804 (2000), 651--654.
[78]
Hai Jin, Xiaomin Ning, and Hanhua Chen. 2006. Efficient search for peer-to-peer information retrieval using semantic small world. In Proceedings of the 15th International Conference on World Wide Web. ACM, 1003--1004.
[79]
Shudong Jin and Azer Bestavros. 2000. Sources and characteristics of web temporal locality. In Proceedings of the 8th International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems, 2000. IEEE, 28--35.
[80]
Norman L. Johnson, Samuel Kotz, and Narayanaswamy Balakrishnan. 2002. Continuous Multivariate Distributions, Volume 1, Models and Applications. Vol. 59. New York: John Wiley & Sons.
[81]
Jaeyeon Jung, Emil Sit, Hari Balakrishnan, and Robert Morris. 2002. DNS performance and the effectiveness of caching. IEEE/ACM Transactions on Networking 10, 5 (2002), 589--603.
[82]
Jaap Kamps and Marijn Koolen. 2008. The importance of link evidence in Wikipedia. In Advances in Information Retrieval. Springer, 270--282.
[83]
Noriaki Kawamae. 2014. Supervised N-gram topic model. In Proceedings of the 7th ACM International Conference on Web Search and Data Mining (Web Search and Data Mining’14). 473--482.
[84]
Noam Koenigstein, Yuval Shavitt, Ela Weinsberg, and Udi Weinsberg. 2010. On the applicability of peer-to-peer data in music information retrieval research. In International Society for Music Information Retrieval. 273--278.
[85]
Leonid Kopylev. 2012. Constrained parameters in applications: Review of issues and approaches. International Scholarly Research Notices 2012 (2012).
[86]
Beate Krause, Robert Jäschke, Andreas Hotho, and Gerd Stumme. 2008. Logsonomy-social information retrieval with logdata. In Proceedings of the 19th ACM Conference on Hypertext and Hypermedia. ACM, 157--166.
[87]
Jérôme Kunegis and Julia Preusse. 2012. Fairness on the web: Alternatives to the power law. In Proceedings of the 4th Annual ACM Web Science Conference. ACM, 175--184.
[88]
Haewoon Kwak, Changhyun Lee, Hosung Park, and Sue Moon. 2010. What is Twitter, a social network or a news media? In Proceedings of the 19th International Conference on World Wide Web. ACM, 591--600.
[89]
Erich L. Lehmann and Joseph P. Romano. 2006. Testing Statistical Hypotheses. Springer.
[90]
Mark Levy and Mark Sandler. 2009. Music information retrieval using social tags and audio. IEEE Transactions on Multimedia 11, 3 (2009), 383--395.
[91]
Christina Lioma. 2007. Part of Speech n-Grams for Information Retrieval. Ph.D. Dissertation. University of Glasgow.
[92]
Christina Lioma and Iadh Ounis. 2007. Light syntactically-based index pruning for information retrieval. In Proceedings of the 29th European Conference on IR Research Advances in Information Retrieval (ECIR 2007), Rome, Italy, April 2--5, 2007, 88--100.
[93]
Christina Lioma and Iadh Ounis. 2008. A syntactically-based query reformulation technique for information retrieval. Information Processing & Management 44 (2008), 143--162.
[94]
Christina Lioma and Cornelis Joost van Rijsbergen. 2008. Part of speech N-grams and information retrieval. Revue française De Linguistique Appliquée 13, 1 (2008), 9--22.
[95]
Tie-Yan Liu, Yiming Yang, Hao Wan, Hua-Jun Zeng, Zheng Chen, and Wei-Ying Ma. 2005. Support vector machines classification with a very large-scale taxonomy. ACM Knowledge Discovery and Data Mining: Explorations Newsletter 7, 1 (2005), 36--43.
[96]
Wuying Liu, Lin Wang, and Mianzhu Yi. 2013. Power law for text categorization. In Chinese Computational Linguistics and Natural Language Processing Based on Naturally Annotated Big Data. Springer, 131--143.
[97]
Roger Lowenstein. 2000. When Genius Failed: The Rise and Fall of Long-Term Capital Management. Random House Trade Paperbacks.
[98]
Hans P. Luhn. 1958. The automatic creation of literature abstracts. IBM Journal of Research and Development 2, 2 (1958), 159--165.
[99]
Marianne Lykke, Birger Larsen, Haakon Lund, and Peter Ingwersen. 2010. Developing a test collection for the evaluation of integrated search. In Advances in Information Retrieval -32rd European Conference on IR Research (ECIR’10). Springer, 627--630.
[100]
Colin L. Mallows. 1973. Some comments on CP. Technometrics 15, 4 (1973), 661--675.
[101]
Benoit Mandelbrot. 1953. An informational theory of the statistical structure of language. Communication Theory 84 (1953).
[102]
Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze. 2008. Introduction to Information Retrieval. Vol. 1. Cambridge University Press.
[103]
Yuqing Mao and Zhiyong Lu. 2013. Predicting clicks of articles. In Proceedings of the AMIA Annual Symposium, Vol. 2013. American Medical Informatics Association, 947.
[104]
Alberto Maydeu-Olivares and Carlos Garca-Forero. 2010. Goodness-of-fit testing. In International Encyclopedia of Education (3 ed.), Baker E. Peterson, P. and B. McGaw (Eds.). Elsevier, 190--196.
[105]
Alberto Medina, Ibrahim Matta, and John Byers. 2000. On the origin of power laws in internet topologies. ACM SIGCOMM Computer Communication Review 30, 2 (2000), 18--28.
[106]
Mark M. Meerschaert and Hans-Peter Scheffler. 2001. Limit Distributions for Sums of Independent Random vectors: Heavy Tails in Theory and Practice. Vol. 321. John Wiley & Sons.
[107]
Edgar Meij and Maarten de Rijke. 2007. Using prior information derived from citations in literature search. In Recherche d’Information et ses Applications.
[108]
George A. Miller. 1957. Some effects of intermittent silence. American Journal of Psychology (1957), 311--314.
[109]
Staša Milojević. 2010. Power law distributions in information science: Making the case for logarithmic binning. Journal of the American Society for Information Science and Technology 61, 12 (2010), 2417--2425.
[110]
Gilad Mishne and Natalie Glance. 2006. Leave a reply: An analysis of weblog comments. In Proceedings of the 3rd Annual Workshop on the Weblogging Ecosystem.
[111]
Michael Mitzenmacher. 2004. A brief history of generative models for power law and lognormal distributions. Internet Mathematics 1, 2 (2004), 226--251.
[112]
Saeedeh Momtazi and Dietrich Klakow. 2010. Hierarchical Pitman-yor language model for information retrieval. In Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 793--794.
[113]
Fabrice Muhlenbach and Ricco Rakotomalala. 2005. Discretization of continuous attributes. Encyclopedia of Data Warehousing and Mining 1 (2005), 397--402.
[114]
Mark E. J. Newman. 2005. Power laws, Pareto distributions and Zipf’s law. Contemporary Physics 46, 5 (2005), 323--351.
[115]
Christopher R. Palmer and Greg Steffan. 2000. Generating network topologies that obey power laws. In Proceedings of the IEEE Global Telecommunications Conference (GLOBECOM’00),Vol. 1. IEEE, 434--438.
[116]
Greg Pass, Abdur Chowdhury, and Cayley Torgeson. 2006. A picture of search. In Proceedings of the 1st International Conference on Scalable Information Systems (InfoScale’06). Article 1.
[117]
David M. Pennock, Gary William Flake, Steve Lawrence, Eric J. Glover, and Clyde L. Giles. 2002. Winners don’t take all: Characterizing the competition for links on the web. In Proceedings of the National Academy of Sciences 99, 8 (2002), 5207--5211.
[118]
Matjaž Perc. 2010. Zipfs law and log-normal distributions in measures of scientific output across fields and institutions: 40 years of Slovenias research as an example. Journal of Informetrics 4, 3 (2010), 358--364.
[119]
Isabella Peters and Wolfgang G. Stock. 2010. “Power tags” in information retrieval. Library Hi Tech 28, 1 (2010), 81--93.
[120]
Jim Pitman and Marc Yor. 1997. The two-parameter Poisson-Dirichlet distribution derived from a stable subordinator. Annals of Probability (1997), 855--900.
[121]
David Posada and Thomas R. Buckley. 2004. Model selection and model averaging in phylogenetics: Advantages of Akaike information criterion and Bayesian approaches over likelihood ratio tests. Systematic Biology 53, 5 (2004), 793--808.
[122]
Le Quan Ha, Ji Ming, and Francis Jack Smith. 2003. Extension of Zipfs law to word and character n-grams for English and Chinese. Journal of Computational Linguistics and Chinese Language Processing 1, 77--102. Citeseer.
[123]
Venugopalan Ramasubrama nian and Emin Gün Sirer. 2004. Beehive: Exploiting power law query distributions for O (1) lookup performance in peer to peer overlays. In Symposium on Networked Systems Design and Implementation. Usenix, San Francisco CA.
[124]
Sidney Redner. 1998. How popular is your paper? An empirical study of the citation distribution. European Physical Journal B-Condensed Matter and Complex Systems 4, 2 (1998), 131--134.
[125]
William J. Reed. 2003. The Pareto law of incomes: An explanation and an extension. Physica A: Statistical Mechanics and Its Applications 319 (2003), 469--486.
[126]
William J. Reed and Murray Jorgensen. 2004. The double Pareto-lognormal distributiona new parametric model for size distributions. Communications in Statistics-Theory and Methods 33, 8 (2004), 1733--1753.
[127]
Matei Ripeanu and Ian T. Foster. 2002. Mapping the Gnutella network: Macroscopic properties of large-scale peer-to-peer systems. In IPTPS. Computing Research Repository, 85--93.
[128]
Seth Roberts and Harold Pashler. 2000. How persuasive is a good fit? A comment on theory testing. Psychological Review 107, 2 (2000), 358.
[129]
Issei Sato and Hiroshi Nakagawa. 2010. Topic models with power-law using Pitman-Yor process. In Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 673--682.
[130]
Christian D. Schunn and Dieter Wallach. 2005. Evaluating Goodness-of-Fit in Comparison of Models to Data. University of Saarland Press, Saarbrueken, 115--154.
[131]
Gideon Schwarz. 1978. Estimating the dimension of a model. Annals of Statistics 6, 2 (1978), 461--464.
[132]
Ripunjai K. Shukla, Mohan Trivedi, and Manoj Kumar. 2010. On the proficient use of GEV distribution: A case study of subtropical monsoon region in India. Annals of Computer Science Series 8, 1 (2010).
[133]
Börkur Sigurbjörnsson and Roelof van Zwol. 2008. Flickr tag recommendation based on collective knowledge. In Proceedings of the 17th International Conference on World Wide Web. ACM, 327--336.
[134]
Herbert A. Simon. 1955. On a class of skew distribution functions. Biometrika (1955), 425--440.
[135]
Ian Soboroff. 2002. Does wt10g look like the web? In Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 423--424.
[136]
Karen Spärck-Jones. 1972. A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation 28, 1 (1972), 11--21.
[137]
Laura Spierdijk and Mark Voorneveld. 2009. Superstars without talent? The Yule distribution controversy. Review of Economics and Statistics 91, 3 (2009), 648--652.
[138]
Kunwadee Sripanidkulchai, Bruce Maggs, and Hui Zhang. 2003. Efficient content location using interest-based locality in peer-to-peer systems. In Proceedings of the IEEE Societies’ 22nd Annual Joint Conference of the IEEE Computer and Communications (INFOCOM’03), Vol. 3. IEEE, 2166--2176.
[139]
Nitish Srivastava, Ruslan Salakhutdinov, and Geoffrey E. Hinton. 2013. Modeling documents with deep boltzmann machines. In Proceedings of the 29th Conference on Uncertainty in Artificial Intelligence. AUAI Press, 616--625.
[140]
Alexandru Tatar, Panayotis Antoniadis, Marcelo D. De Amorim, and Serge Fdida. 2014. From popularity prediction to ranking online news. Social Network Analysis and Mining 4, 1 (2014), 1--12.
[141]
Jiancong Tong, Gang Wang, Douglas S. Stones, Shizhao Sun, Xiaoguang Liu, and Fan Zhang. 2013. Exploiting query term correlation for list caching in web search engines. In Proceedings of the 22nd ACM International Conference on Information & Knowledge Management. ACM, 1817--1820.
[142]
Yana Volkovich, Nelly Litvak, and Debora Donato. 2007. Determining factors behind the PageRank log-log plot. In Algorithms and Models for the Web-Graph. Springer, 108--123.
[143]
Quang H. Vuong. 1989. Likelihood ratio tests for model selection and non-nested hypotheses. Econometrica: Journal of the Econometric Society (1989), 307--333.
[144]
Gui-Rong Xue, Hua-Jun Zeng, Zheng Chen, Yong Yu, Wei-Ying Ma, WenSi Xi, and WeiGuo Fan. 2004. Optimizing web search using web click-through data. In Proceedings of the 13th ACM International Conference on Information and Knowledge Management (CIKM’04). ACM, 118--126.
[145]
Yiming Yang, Jian Zhang, and Bryan Kisiel. 2003. A scalability analysis of classifiers in text categorization. In Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 96--103.
[146]
Emmanuel J. Yannakoudakis, Ioannis Tsomokos, and Paul J. Hutton. 1990. N-Grams and their implication to natural language understanding. Pattern Recognition 23, 5 (1990), 509--528.
[147]
Mao Ye, Peifeng Yin, Wang-Chien Lee, and Dik-Lun Lee. 2011. Exploiting geographical influence for collaborative point-of-interest recommendation. In Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 325--334.
[148]
Haizheng Zhang and Victor Lesser. 2006. Multi-agent based peer-to-peer information retrieval systems with concurrent search sessions. In Proceedings of the 5th International Joint Conference on Autonomous agents and Multiagent Systems. ACM, 305--312.
[149]
Ding Zhou, Jiang Bian, Shuyi Zheng, Hongyuan Zha, and Clyde L. Giles. 2008. Exploring social annotations for information retrieval. In Proceedings of the 17th International Conference on World Wide Web. ACM, 715--724.
[150]
George K. Zipf. 1935. The Psycho-Biology of Language. Houghton, Mifflin.

Cited By

View all
  • (2024)Navigating Labels and Vectors: A Unified Approach to Filtered Approximate Nearest Neighbor SearchProceedings of the ACM on Management of Data10.1145/36988222:6(1-27)Online publication date: 20-Dec-2024
  • (2024)Embedding Optimization for Training Large-scale Deep Learning Recommendation Systems with EMBarkProceedings of the 18th ACM Conference on Recommender Systems10.1145/3640457.3688111(622-632)Online publication date: 8-Oct-2024
  • (2024)What Matters in a Measure? A Perspective from Large-Scale Search EvaluationProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657845(282-292)Online publication date: 10-Jul-2024
  • Show More Cited By

Index Terms

  1. Power Law Distributions in Information Retrieval

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Information Systems
    ACM Transactions on Information Systems  Volume 34, Issue 2
    April 2016
    220 pages
    ISSN:1046-8188
    EISSN:1558-2868
    DOI:10.1145/2891107
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 16 February 2016
    Accepted: 01 August 2015
    Revised: 01 June 2015
    Received: 01 October 2014
    Published in TOIS Volume 34, Issue 2

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Statistical model selection
    2. power laws

    Qualifiers

    • Research-article
    • Research
    • Refereed

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)34
    • Downloads (Last 6 weeks)5
    Reflects downloads up to 29 Jan 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Navigating Labels and Vectors: A Unified Approach to Filtered Approximate Nearest Neighbor SearchProceedings of the ACM on Management of Data10.1145/36988222:6(1-27)Online publication date: 20-Dec-2024
    • (2024)Embedding Optimization for Training Large-scale Deep Learning Recommendation Systems with EMBarkProceedings of the 18th ACM Conference on Recommender Systems10.1145/3640457.3688111(622-632)Online publication date: 8-Oct-2024
    • (2024)What Matters in a Measure? A Perspective from Large-Scale Search EvaluationProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657845(282-292)Online publication date: 10-Jul-2024
    • (2024)Current engagement with unreliable sites from web search driven by navigational searchScience Advances10.1126/sciadv.adn375010:44Online publication date: Nov-2024
    • (2024)HyGate-GCN: Hybrid-Gate-Based Graph Convolutional Networks with dynamical ratings estimation for personalised POI recommendationExpert Systems with Applications10.1016/j.eswa.2024.125217(125217)Online publication date: Aug-2024
    • (2023)A Two-Level Signature Scheme for Stable Set Similarity JoinsProceedings of the VLDB Endowment10.14778/3611479.361148016:11(2686-2698)Online publication date: 24-Aug-2023
    • (2023)Measuring Service-Level Learning Effects in Search Via Query-Randomized ExperimentsProceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3539618.3592020(2169-2173)Online publication date: 19-Jul-2023
    • (2023)Oblique Logistic Function for the Rank-Frequency Distribution of Letters2023 4th International Informatics and Software Engineering Conference (IISEC)10.1109/IISEC59749.2023.10390992(1-5)Online publication date: 21-Dec-2023
    • (2023)Trust-aware location recommendation in location-based social networksExpert Systems with Applications: An International Journal10.1016/j.eswa.2022.119048213:PBOnline publication date: 1-Mar-2023
    • (2023)Index-Based Batch Query Processing RevisitedAdvances in Information Retrieval10.1007/978-3-031-28241-6_6(86-100)Online publication date: 16-Mar-2023
    • Show More Cited By

    View Options

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media