Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3437963.3441809acmconferencesArticle/Chapter ViewAbstractPublication PageswsdmConference Proceedingsconference-collections
research-article
Open access

Generative Models are Unsupervised Predictors of Page Quality: A Colossal-Scale Study

Published: 08 March 2021 Publication History

Abstract

Large generative language models such as GPT-2 are well-known for their ability to generate text as well as their utility in supervised downstream tasks via fine-tuning. Its prevalence on the web, however, is still not well understood - if we run GPT-2 detectors across the web, what will we find? Our work is twofold: firstly we demonstrate via human evaluation that classifiers trained to discriminate between human and machine-generated text emerge as unsupervised predictors of "page quality", able to detect low quality content without any training. This enables fast bootstrapping of quality indicators in a low-resource setting. Secondly, curious to understand the prevalence and nature of low quality pages in the wild, we conduct extensive qualitative and quantitative analysis over 500 million web articles, making this the largest-scale study ever conducted on the topic.

References

[1]
Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. 2016. Tensorflow: A system for large-scale machine learning. In 12th {USENIX} Symposium on Operating Systems Design and Implementation (OSDI '16). 265--283.
[2]
Danial Alihosseini, Ehsan Montahaei, and Mahdieh Soleymani Baghshah. 2019. Jointly Measuring Diversity and Quality in Text Generation Models. In Proceedings of the Workshop on Methods for Optimizing and Evaluating Neural Language Generation. Association for Computational Linguistics, Minneapolis, Minnesota, 90--98. https://doi.org/10.18653/v1/W19--2311
[3]
Dimitrios Alikaniotis and Vipul Raheja. 2019. The Unreasonable Effectiveness of Transformer Language Models in Grammatical Error Correction. arXiv preprint arXiv:1906.01733 (2019).
[4]
Ioannis Arapakis, Filipa Peleja, Barla Berkant, and Joao Magalhaes. 2016. Linguistic Benchmarks of Online News Article Quality. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Berlin, Germany, 1893--1902. https://doi.org/10.18653/v1/P16--1178
[5]
Sameer Badaskar, Sachin Agarwal, and Shilpa Arora. 2008. Identifying real or fake articles: Towards better language modeling. In Proceedings of the Third International Joint Conference on Natural Language Processing: Volume-II.
[6]
Anton Bakhtin, Sam Gross, Myle Ott, Yuntian Deng, Marc-Aurelio Ranzato, and Arthur Szlam. 2019. Real or Fake? Learning to Discriminate Machine from Human Generated Text. arXiv preprint arXiv:1906.03351 (2019).
[7]
Michael Bendersky,WBruce Croft, and Yanlei Diao. 2011. Quality-biased ranking of web documents. In Proceedings of the fourth ACM international conference on Web search and data mining. 95--104.
[8]
Tom B Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. arXiv preprint arXiv:2005.14165 (2020).
[9]
Gordon V Cormack, Mark D Smucker, and Charles LA Clarke. 2011. Efficient and effective spam filtering and re-ranking for large web datasets. Information retrieval 14, 5 (2011), 441--465.
[10]
Sumanth Dathathri, Andrea Madotto, Janice Lan, Jane Hung, Eric Frank, Piero Molino, Jason Yosinski, and Rosanne Liu. 2019. Plug and Play Language Models: a Simple Approach to Controlled Text Generation. arXiv:1912.02164 [cs.CL]
[11]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).
[12]
Angela Fan, Mike Lewis, and Yann Dauphin. 2018. Hierarchical neural story generation. arXiv preprint arXiv:1805.04833 (2018).
[13]
Sebastian Gehrmann, Hendrik Strobelt, and Alexander M Rush. 2019. GLTR: Statistical Detection and Visualization of Generated Text. arXiv preprint arXiv:1906.04043 (2019).
[14]
EA Grechnikov, GG Gusev, AA Kustarev, and AM Raigorodsky. 2009. Detection of artificial texts. RCDL?2009 Proceedings. Petrozavodsk (2009), 306--308.
[15]
Karl Moritz Hermann, Tomas Kocisky, Edward Grefenstette, Lasse Espeholt, Will Kay, Mustafa Suleyman, and Phil Blunsom. 2015. Teaching machines to read and comprehend. In Advances in neural information processing systems. 1693--1701.
[16]
Ari Holtzman, Jan Buys, Maxwell Forbes, and Yejin Choi. 2019. The curious case of neural text degeneration. arXiv preprint arXiv:1904.09751 (2019).
[17]
Huggingface. 2019. Write with transformer. 2019. (2019). https:// transformer.huggingface.co/
[18]
Nitish Shirish Keskar, Bryan McCann, Lav R Varshney, Caiming Xiong, and Richard Socher. 2019. Ctrl: A conditional transformer language model for controllable generation. arXiv preprint arXiv:1909.05858 (2019).
[19]
Wouter Kool, Herke Van Hoof, and Max Welling. 2019. Stochastic Beams and Where to Find Them: The Gumbel-Top-k Trick for Sampling Sequences Without Replacement. arXiv preprint arXiv:1903.06059 (2019).
[20]
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019).
[21]
Mohsen Mesgar and Michael Strube. 2018. A neural local coherence model for text quality assessment. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 4328--4339.
[22]
Vangelis Metsis, Ion Androutsopoulos, and Georgios Paliouras. 2006. Spam filtering with naive bayes-which naive bayes?. In CEAS, Vol. 17. Mountain View, CA, 28--69.
[23]
Alexandros Ntoulas, Marc Najork, Mark Manasse, and Dennis Fetterly. 2006. Detecting spam web pages through content analysis. In Proceedings of the 15th international conference on World Wide Web. 83--92.
[24]
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics. Association for Computational Linguistics, 311--318.
[25]
F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. 2011. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research 12 (2011), 2825--2830.
[26]
Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. Improving language understanding by generative pre-training. (2018).
[27]
Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language models are unsupervised multitask learners. (2019).
[28]
John Seabrook. 2019. The Next Word. The New Yorker (2019), 52--63.
[29]
Irene Solaiman, Miles Brundage, Jack Clark, Amanda Askell, Ariel Herbert-Voss, Jeff Wu, Alec Radford, and Jasmine Wang. 2019. Release strategies and the social impacts of language models. arXiv preprint arXiv:1908.09203 (2019).
[30]
TabNine. 2019. Autocompletion with deep learning. 2019. (2019). https://tabnine.com/blog/deep/
[31]
Yi Tay, Dara Bahri, Che Zheng, Clifford Brunk, Donald Metzler, and Andrew Tomkins. 2020. Reverse Engineering Configurations of Neural Text Generation Models. arXiv preprint arXiv:2004.06201 (2020).
[32]
Ashwin K Vijayakumar, Michael Cogswell, Ramprasath R Selvaraju, Qing Sun, Stefan Lee, David Crandall, and Dhruv Batra. 2016. Diverse beam search: Decoding diverse solutions from neural sequence models. arXiv preprint arXiv:1610.02424 (2016).
[33]
NickWalton. 2019. AI Dungeon. 2019. (2019). http://www.aidungeon.io/
[34]
Nick Walton. 2019. GPT-2 Neural Network Poetry. 2019. (2019). https://www.gwern.net/GPT-2
[35]
Sean Welleck, Ilia Kulikov, Stephen Roller, Emily Dinan, Kyunghyun Cho, and Jason Weston. 2019. Neural text generation with unlikelihood training. arXiv preprint arXiv:1908.04319 (2019).
[36]
ThomasWolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, R?emi Louf, Morgan Funtowicz, and Jamie Brew. 2019. HuggingFace?s Transformers: State-of-the-art Natural Language Processing. ArXiv abs/1910.03771 (2019).
[37]
Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. 2016. Google's neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144 (2016).
[38]
Rowan Zellers, Ari Holtzman, Hannah Rashkin, Yonatan Bisk, Ali Farhadi, Franziska Roesner, and Yejin Choi. 2019. Defending Against Neural Fake News. arXiv preprint arXiv:1905.12616 (2019).
[39]
Yaoming Zhu, Sidi Lu, Lei Zheng, Jiaxian Guo, Weinan Zhang, Jun Wang, and Yong Yu. 2018. Texygen: A benchmarking platform for text generation models. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval. 1097--1100.

Cited By

View all
  • (2024)CoAT: Corpus of artificial textsNatural Language Processing10.1017/nlp.2024.38(1-26)Online publication date: 6-Sep-2024
  • (2024)Can AI Provide Useful Holistic Essay Scoring?Computers and Education: Artificial Intelligence10.1016/j.caeai.2024.100255(100255)Online publication date: Jun-2024
  • (2023)Bidirectional English-Marathi Translation using Pretrained Models: A Comparative Study of Different Pre-Trained Models2023 2nd International Conference on Futuristic Technologies (INCOFT)10.1109/INCOFT60753.2023.10425770(1-8)Online publication date: 24-Nov-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
WSDM '21: Proceedings of the 14th ACM International Conference on Web Search and Data Mining
March 2021
1192 pages
ISBN:9781450382977
DOI:10.1145/3437963
This work is licensed under a Creative Commons Attribution International 4.0 License.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 08 March 2021

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. GPT-2
  2. language quality
  3. large language models
  4. neural text generation
  5. text deepfakes

Qualifiers

  • Research-article

Conference

WSDM '21

Acceptance Rates

Overall Acceptance Rate 498 of 2,863 submissions, 17%

Upcoming Conference

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)242
  • Downloads (Last 6 weeks)16
Reflects downloads up to 14 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)CoAT: Corpus of artificial textsNatural Language Processing10.1017/nlp.2024.38(1-26)Online publication date: 6-Sep-2024
  • (2024)Can AI Provide Useful Holistic Essay Scoring?Computers and Education: Artificial Intelligence10.1016/j.caeai.2024.100255(100255)Online publication date: Jun-2024
  • (2023)Bidirectional English-Marathi Translation using Pretrained Models: A Comparative Study of Different Pre-Trained Models2023 2nd International Conference on Futuristic Technologies (INCOFT)10.1109/INCOFT60753.2023.10425770(1-8)Online publication date: 24-Nov-2023
  • (2023)Algorithms, UsersKeywords In and Out of Context10.1007/978-3-031-32530-4_10(141-154)Online publication date: 4-Jun-2023
  • (2022)Don’t Judge by Looks: Search User Interface to Make Searchers Reflect on Their Relevance Criteria and Promote Content-Quality-Oriented Web SearchesProceedings of the 2022 ACM Conference on Information Technology for Social Good10.1145/3524458.3547222(1-8)Online publication date: 7-Sep-2022

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media