research-article

Open access

Generative Models are Unsupervised Predictors of Page Quality: A Colossal-Scale Study

Authors:

Donald Metzler,

Andrew TomkinsAuthors Info & Claims

WSDM '21: Proceedings of the 14th ACM International Conference on Web Search and Data Mining

Pages 301 - 309

https://doi.org/10.1145/3437963.3441809

Published: 08 March 2021 Publication History

Abstract

Large generative language models such as GPT-2 are well-known for their ability to generate text as well as their utility in supervised downstream tasks via fine-tuning. Its prevalence on the web, however, is still not well understood - if we run GPT-2 detectors across the web, what will we find? Our work is twofold: firstly we demonstrate via human evaluation that classifiers trained to discriminate between human and machine-generated text emerge as unsupervised predictors of "page quality", able to detect low quality content without any training. This enables fast bootstrapping of quality indicators in a low-resource setting. Secondly, curious to understand the prevalence and nature of low quality pages in the wild, we conduct extensive qualitative and quantitative analysis over 500 million web articles, making this the largest-scale study ever conducted on the topic.

References

[1]

Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. 2016. Tensorflow: A system for large-scale machine learning. In 12th {USENIX} Symposium on Operating Systems Design and Implementation (OSDI '16). 265--283.

[2]

Danial Alihosseini, Ehsan Montahaei, and Mahdieh Soleymani Baghshah. 2019. Jointly Measuring Diversity and Quality in Text Generation Models. In Proceedings of the Workshop on Methods for Optimizing and Evaluating Neural Language Generation. Association for Computational Linguistics, Minneapolis, Minnesota, 90--98. https://doi.org/10.18653/v1/W19--2311

[3]

Dimitrios Alikaniotis and Vipul Raheja. 2019. The Unreasonable Effectiveness of Transformer Language Models in Grammatical Error Correction. arXiv preprint arXiv:1906.01733 (2019).

[4]

Ioannis Arapakis, Filipa Peleja, Barla Berkant, and Joao Magalhaes. 2016. Linguistic Benchmarks of Online News Article Quality. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Berlin, Germany, 1893--1902. https://doi.org/10.18653/v1/P16--1178

[5]

Sameer Badaskar, Sachin Agarwal, and Shilpa Arora. 2008. Identifying real or fake articles: Towards better language modeling. In Proceedings of the Third International Joint Conference on Natural Language Processing: Volume-II.

[6]

Anton Bakhtin, Sam Gross, Myle Ott, Yuntian Deng, Marc-Aurelio Ranzato, and Arthur Szlam. 2019. Real or Fake? Learning to Discriminate Machine from Human Generated Text. arXiv preprint arXiv:1906.03351 (2019).

[7]

Michael Bendersky,WBruce Croft, and Yanlei Diao. 2011. Quality-biased ranking of web documents. In Proceedings of the fourth ACM international conference on Web search and data mining. 95--104.

Digital Library

[8]

Tom B Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. arXiv preprint arXiv:2005.14165 (2020).

[9]

Gordon V Cormack, Mark D Smucker, and Charles LA Clarke. 2011. Efficient and effective spam filtering and re-ranking for large web datasets. Information retrieval 14, 5 (2011), 441--465.

[10]

Sumanth Dathathri, Andrea Madotto, Janice Lan, Jane Hung, Eric Frank, Piero Molino, Jason Yosinski, and Rosanne Liu. 2019. Plug and Play Language Models: a Simple Approach to Controlled Text Generation. arXiv:1912.02164 [cs.CL]

[11]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).

[12]

Angela Fan, Mike Lewis, and Yann Dauphin. 2018. Hierarchical neural story generation. arXiv preprint arXiv:1805.04833 (2018).

[13]

Sebastian Gehrmann, Hendrik Strobelt, and Alexander M Rush. 2019. GLTR: Statistical Detection and Visualization of Generated Text. arXiv preprint arXiv:1906.04043 (2019).

[14]

EA Grechnikov, GG Gusev, AA Kustarev, and AM Raigorodsky. 2009. Detection of artificial texts. RCDL?2009 Proceedings. Petrozavodsk (2009), 306--308.

[15]

Karl Moritz Hermann, Tomas Kocisky, Edward Grefenstette, Lasse Espeholt, Will Kay, Mustafa Suleyman, and Phil Blunsom. 2015. Teaching machines to read and comprehend. In Advances in neural information processing systems. 1693--1701.

[16]

Ari Holtzman, Jan Buys, Maxwell Forbes, and Yejin Choi. 2019. The curious case of neural text degeneration. arXiv preprint arXiv:1904.09751 (2019).

[17]

Huggingface. 2019. Write with transformer. 2019. (2019). https:// transformer.huggingface.co/

[18]

Nitish Shirish Keskar, Bryan McCann, Lav R Varshney, Caiming Xiong, and Richard Socher. 2019. Ctrl: A conditional transformer language model for controllable generation. arXiv preprint arXiv:1909.05858 (2019).

[19]

Wouter Kool, Herke Van Hoof, and Max Welling. 2019. Stochastic Beams and Where to Find Them: The Gumbel-Top-k Trick for Sampling Sequences Without Replacement. arXiv preprint arXiv:1903.06059 (2019).

[20]

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019).

[21]

Mohsen Mesgar and Michael Strube. 2018. A neural local coherence model for text quality assessment. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 4328--4339.

[22]

Vangelis Metsis, Ion Androutsopoulos, and Georgios Paliouras. 2006. Spam filtering with naive bayes-which naive bayes?. In CEAS, Vol. 17. Mountain View, CA, 28--69.

[23]

Alexandros Ntoulas, Marc Najork, Mark Manasse, and Dennis Fetterly. 2006. Detecting spam web pages through content analysis. In Proceedings of the 15th international conference on World Wide Web. 83--92.

Digital Library

[24]

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics. Association for Computational Linguistics, 311--318.

[25]

F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. 2011. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research 12 (2011), 2825--2830.

Digital Library

[26]

Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. Improving language understanding by generative pre-training. (2018).

[27]

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language models are unsupervised multitask learners. (2019).

[28]

John Seabrook. 2019. The Next Word. The New Yorker (2019), 52--63.

[29]

Irene Solaiman, Miles Brundage, Jack Clark, Amanda Askell, Ariel Herbert-Voss, Jeff Wu, Alec Radford, and Jasmine Wang. 2019. Release strategies and the social impacts of language models. arXiv preprint arXiv:1908.09203 (2019).

[30]

TabNine. 2019. Autocompletion with deep learning. 2019. (2019). https://tabnine.com/blog/deep/

[31]

Yi Tay, Dara Bahri, Che Zheng, Clifford Brunk, Donald Metzler, and Andrew Tomkins. 2020. Reverse Engineering Configurations of Neural Text Generation Models. arXiv preprint arXiv:2004.06201 (2020).

[32]

Ashwin K Vijayakumar, Michael Cogswell, Ramprasath R Selvaraju, Qing Sun, Stefan Lee, David Crandall, and Dhruv Batra. 2016. Diverse beam search: Decoding diverse solutions from neural sequence models. arXiv preprint arXiv:1610.02424 (2016).

[33]

NickWalton. 2019. AI Dungeon. 2019. (2019). http://www.aidungeon.io/

[34]

Nick Walton. 2019. GPT-2 Neural Network Poetry. 2019. (2019). https://www.gwern.net/GPT-2

[35]

Sean Welleck, Ilia Kulikov, Stephen Roller, Emily Dinan, Kyunghyun Cho, and Jason Weston. 2019. Neural text generation with unlikelihood training. arXiv preprint arXiv:1908.04319 (2019).

[36]

ThomasWolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, R?emi Louf, Morgan Funtowicz, and Jamie Brew. 2019. HuggingFace?s Transformers: State-of-the-art Natural Language Processing. ArXiv abs/1910.03771 (2019).

[37]

Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. 2016. Google's neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144 (2016).

[38]

Rowan Zellers, Ari Holtzman, Hannah Rashkin, Yonatan Bisk, Ali Farhadi, Franziska Roesner, and Yejin Choi. 2019. Defending Against Neural Fake News. arXiv preprint arXiv:1905.12616 (2019).

[39]

Yaoming Zhu, Sidi Lu, Lei Zheng, Jiaxian Guo, Weinan Zhang, Jun Wang, and Yong Yu. 2018. Texygen: A benchmarking platform for text generation models. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval. 1097--1100.

Digital Library

Cited By

Shamardina TSaidov MFenogenova ATumanov AZemlyakova ALebedeva AGryaznova EShavrina TMikhailov VArtemova E(2024)CoAT: Corpus of artificial textsNatural Language Processing10.1017/nlp.2024.38(1-26)Online publication date: 6-Sep-2024
https://doi.org/10.1017/nlp.2024.38
Tate TSteiss JBailey DGraham SMoon YRitchie DTseng WWarschauer M(2024)Can AI Provide Useful Holistic Essay Scoring?Computers and Education: Artificial Intelligence10.1016/j.caeai.2024.100255(100255)Online publication date: Jun-2024
https://doi.org/10.1016/j.caeai.2024.100255
Saini LVidhyarthi D(2023)Bidirectional English-Marathi Translation using Pretrained Models: A Comparative Study of Different Pre-Trained Models2023 2nd International Conference on Futuristic Technologies (INCOFT)10.1109/INCOFT60753.2023.10425770(1-8)Online publication date: 24-Nov-2023
https://doi.org/10.1109/INCOFT60753.2023.10425770
Show More Cited By

Recommendations

Principled Hybrids of Generative and Discriminative Models
CVPR '06: Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition - Volume 1

When labelled training data is plentiful, discriminative techniques are widely used since they give excellent generalization performance. However, for large-scale applications such as object recognition, hand labelling of data is expensive, and there is ...
How to Study Effectively
Discriminative unsupervised learning of structured predictors
ICML '06: Proceedings of the 23rd international conference on Machine learning

We present a new unsupervised algorithm for training structured predictors that is discriminative, convex, and avoids the use of EM. The idea is to formulate an unsupervised version of structured learning methods, such as maximum margin Markov networks, ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

WSDM '21: Proceedings of the 14th ACM International Conference on Web Search and Data Mining

March 2021

1192 pages

ISBN:9781450382977

DOI:10.1145/3437963

General Chairs:
Liane Lewin-Eytan
Amazon, Israel
,
David Carmel
Amazon, Israel
,
Elad Yom-Tov
Microsoft, Israel
,
Program Chairs:
Eugene Agichtein
Emory University and Amazon, USA
,
Evgeniy Gabrilovich
Google Health, USA

Copyright © 2021 Owner/Author.

This work is licensed under a Creative Commons Attribution International 4.0 License.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 08 March 2021

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

WSDM '21

Sponsor:

WSDM '21: The Fourteenth ACM International Conference on Web Search and Data Mining

March 8 - 12, 2021

Virtual Event, Israel

Acceptance Rates

Overall Acceptance Rate 498 of 2,863 submissions, 17%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

4
Total Citations
View Citations
695
Total Downloads

Downloads (Last 12 months)242
Downloads (Last 6 weeks)16

Reflects downloads up to 14 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Shamardina TSaidov MFenogenova ATumanov AZemlyakova ALebedeva AGryaznova EShavrina TMikhailov VArtemova E(2024)CoAT: Corpus of artificial textsNatural Language Processing10.1017/nlp.2024.38(1-26)Online publication date: 6-Sep-2024
https://doi.org/10.1017/nlp.2024.38
Tate TSteiss JBailey DGraham SMoon YRitchie DTseng WWarschauer M(2024)Can AI Provide Useful Holistic Essay Scoring?Computers and Education: Artificial Intelligence10.1016/j.caeai.2024.100255(100255)Online publication date: Jun-2024
https://doi.org/10.1016/j.caeai.2024.100255
Saini LVidhyarthi D(2023)Bidirectional English-Marathi Translation using Pretrained Models: A Comparative Study of Different Pre-Trained Models2023 2nd International Conference on Futuristic Technologies (INCOFT)10.1109/INCOFT60753.2023.10425770(1-8)Online publication date: 24-Nov-2023
https://doi.org/10.1109/INCOFT60753.2023.10425770
Martens BMartens B(2023)Algorithms, UsersKeywords In and Out of Context10.1007/978-3-031-32530-4_10(141-154)Online publication date: 4-Jun-2023
https://doi.org/10.1007/978-3-031-32530-4_10
Suzuki MYamamoto Y(2022)Don’t Judge by Looks: Search User Interface to Make Searchers Reflect on Their Relevance Criteria and Promote Content-Quality-Oriented Web SearchesProceedings of the 2022 ACM Conference on Information Technology for Social Good10.1145/3524458.3547222(1-8)Online publication date: 7-Sep-2022
https://dl.acm.org/doi/10.1145/3524458.3547222

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents