Abstract
Data is available like never before. We believed that back in the 1990s, but corpora are even larger today than they were then, and corpora will continue to grow for some time to come. Thus far, corpus sizes have been limited by our ability to collect data, but we are rapidly approaching a fundamental limit on supply of written and spoken language. There are only so many people in the world, and they have only so much time to communicate with one another. It is becoming feasible to digitize a non-trivial fraction of the world’s communication. This ability is creating new opportunities for new audiences to join in on the fun. Google Ngrams makes it easy for anyone to apply corpus-based methods to half a trillion words (4% of all books ever printed). The popular press is referring to corpus methods and Google Ngrams as “addictive.” Computer Scientists are talking about “digital immortality” (recording much of human communication and storing it forever). Digital immortality may not be a reality just yet, but psychologists are currently recording most of what children say and hear between 2 months and 2 years of age in order to better understand language acquisition. As the world becomes digitized, there will be many applications of corpus-based methods that include lexicography (and so much more).
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
- 2.
Storage on phones tends to use more expensive solid state disk. Those prices are also falling, though not as rapidly.
- 3.
- 4.
- 5.
- 6.
- 7.
- 8.
Two examples of speech companies in the medical business are: https://www.nuance.com and https://mmodal.com.
- 9.
- 10.
- 11.
- 12.
- 13.
- 14.
- 15.
- 16.
- 17.
- 18.
- 19.
- 20.
- 21.
- 22.
- 23.
- 24.
- 25.
- 26.
- 27.
- 28.
- 29.
- 30.
- 31.
- 32.
- 33.
It is suggested in https://www.wired.com/2015/10/pitfalls-of-studying-language-with-google-ngram/ that the f-word appears to be more common than it is in older books because of a common OCR error involving “f” and “s” discussed in [12]. While that might explain why the f-word appears to be so much more common in the 1700s than the 1800s, it doesn’t explain why so many taboo 4-letter words are more common in the 1700s than the 1800s.
- 34.
- 35.
It is reported in [10] that the collection contains over 5 million books and 500 million words, but we find that the collection is about 10% smaller than that.
- 36.
- 37.
- 38.
- 39.
- 40.
- 41.
- 42.
http://storage.googleapis.com/books/ngrams/books/googlebooks-eng-all-totalcounts-20120701.txt reports the number of words in the corpus by year between 1505 and 2008. Based on those numbers, the corpus is growing about 3% per year, or 35% per decade.
References
Gemmell, J., Bell, G., Lueder, R., Drucker, S., Wong, C.: MyLifeBits: fulfilling the memex vision. In: Proceedings of the Tenth ACM International Conference on Multimedia, pp. 235–238 (2002)
Bell, G., Gray, J.: Digital immortality. CACM 44(3), 28–31 (2001)
Barclay, T., Gray, J., Slutz, D.: Microsoft TerraServer: a spatial data warehouse. ACM SIGMOD Record 29(2), 307–318 (2000)
Szalay, A., Gray, J.: The World-wide Telescope. Science 293(5537), 2037–2040 (2001)
Cieri, C., Graff, D., Kimball, O., Miller, D., Walker, K.: Fisher English Training Speech. Linguistic Data Consortium, Philadelphia (2004)
Godfrey, J., Holliman, E., McDaniel, J.: SWITCHBOARD: telephone speech corpus for research and development. In: ICASSP, pp. 517–520 (1992)
Canavan, A., Graff, D., Zipperlen, G.: Callhome American English Speech. Linguistic Data Consortium, Philadelphia (1997)
Fausey, C., Jayaraman, S., Smith, L.: From faces to hands: changing visual input in the first two years. Cognition 152, 101–107 (2016)
Bush, V.: As we may think. Atl. Monthly 176(1), 101–108 (1945)
Michel, J., Shen, Y., et al.: Quantitative analysis of culture using millions of digitized books. Science 331(6014), 176–182 (2011)
Davies, M.: The Corpus of Contemporary American English as the first reliable monitor corpus of English. Lit. Linguist. Comput. 24(4), 447–464 (2010)
Pechenick, E., Danforth, C., Dodds, P.: Characterizing the Google Books corpus: strong limits to inferences of socio-cultural and linguistic evolution. PLoS ONE 10(10), e0137041 (2015)
Church, K., Hanks, P.: Word association norms, mutual information, and lexicography. Comput. Linguist. 16(1), 22–29 (1990)
Mikolov, T., Sutskever, I., Chen, K., Corrado, G., Dean, J.: Distributed representations of words and phrases and their compositionality. In: NIPS, pp. 3111–3119 (2013)
Pennington, J., Socher, R., Manning, C.: GloVe: global vectors for word representation. In: EMNLP, pp. 1532–1543 (2014)
Levy, O., Goldberg, Y.: Neural word embedding as implicit matrix factorization. In: NIPS, pp. 2177–2185 (2014)
Firth, J.: A Synopsis of Linguistic Theory, 1930–1955. Studies in Linguistic Analysis. Basil Blackwell, Oxford (1957)
Hamilton, W., Leskovec, J., Jurafsky, D.: Diachronic word embeddings reveal statistical laws of semantic change. In: ACL, pp. 1489–1501 (2016)
Francis, N., Kucera, H.: Frequency Analysis of English Usage. Houghton Mifflin Company, Boston (1982)
Sinclair, J.: Looking Up: An Account of the COBUILD Project in Lexical Computing and the Development of the Collins COBUILD English Language Dictionary. Collins, London (1987)
Aijmer, K., Altenberg, B.: English Corpus Linguistics. Routledge, London (2014)
Kilgarriff, A.: Googleology is bad science. Comput. Linguist. 33(1), 147–151 (2007)
Chapman, R.: Roget’s International Thesaurus, 4th edn. Harper and Row, New York (1977)
Chapman, R.: Roget’s International Thesaurus, 5th edn. Harper and Row, New York (1992)
Fillmore, C., Atkins, B.: Toward a frame-based lexicon: the semantics of RISK and its neighbors. In: Frames, Fields, and Contrasts: New Essays in Semantic and Lexical Organization, pp. 75–102. Lawrence Erlbaum Associates, Hillsdale (1992)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Church, K.W. (2017). Corpus Methods in a Digitized World. In: Mitkov, R. (eds) Computational and Corpus-Based Phraseology. EUROPHRAS 2017. Lecture Notes in Computer Science(), vol 10596. Springer, Cham. https://doi.org/10.1007/978-3-319-69805-2_1
Download citation
DOI: https://doi.org/10.1007/978-3-319-69805-2_1
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-69804-5
Online ISBN: 978-3-319-69805-2
eBook Packages: Computer ScienceComputer Science (R0)