research-article

Open access

Going In-Depth: Finding Longform on the Web

Authors:

Virginia Smith,

Isabelle StantonAuthors Info & Claims

KDD '15: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

Pages 2109 - 2118

https://doi.org/10.1145/2783258.2788599

Published: 10 August 2015 Publication History

Abstract

tl;dr: Longform articles are extended, in-depth pieces that often serve as feature stories in newspapers and magazines. In this work, we develop a system to automatically identify longform content across the web. Our novel classifier is highly accurate despite huge variation within longform in terms of topic, voice, and editorial taste. It is also scalable and interpretable, requiring a surprisingly small set of features based only on language and parse structures, length, and document interest. We implement our system at scale and use it to identify a corpus of several million longform documents. Using this corpus, we provide the first web-scale study with quantifiable and measurable information on longform, giving new insight into questions posed by the media on the past and current state of this famed literary medium.

Supplementary Material

MP4 File (p2109.mp4)

Download
241.97 MB

References

[1]

The state of news media, textitPew Research Center, 2013.

[2]

S. Abbar et al. Real-time recommendation of diverse related articles. In WWW, 2013.

Digital Library

[3]

A. Ahmed et al. Unified analysis of streaming news. In WWW, 2011.

Digital Library

[4]

S. Argamon, M. Koppel, and G. Avneri. Routing documents according to style. In IIS Workshop, 1998.

[5]

N. Arnold. The cautiously hopeful renaissance of longform journalism, 2014.

[6]

V. Ashok et al. Success with style: Using writing style to predict the success of novels. Poetry, 2013.

[7]

J. Bennet. Against 'long-form' journalism, textitThe Atlantic, 2013.

[8]

P. N. Bennett, K. Svore, and S. T. Dumais. Classification-enhanced ranking. In WWW, 2010.

Digital Library

[9]

A. Berger, S. Pietra, and V. Pietra. A maximum entropy approach to natural language processing. Computational Linguistics, 1996.

Digital Library

[10]

J. Bollen, H. Mao, and A. Pepe. Modeling public mood and emotion: Twitter sentiment and socio-economic phenomena. In ICWSM, 2011.

[11]

T. Capote. The duke in his domain, textitNew Yorker, 1957.

[12]

C. Cooper. The death of slow journalism, textitAmerican Journalism Review, 2009.

[13]

C. Danescu-Niculescu-Mizil et al. No country for old members: User lifecycle and linguistic change in online communities. In WWW, 2013.

Digital Library

[14]

M.-C. de Marneffe, B. MacCartney, and C. Manning. Generating typed dependency parses from phrase structure parses. In LREC, 2006.

[15]

L. DVorkin. Inside forbes: How longform journalism is finding its digital audience, textitForbes, 2012.

[16]

A. Finn and N. Kushmerick. Learning to classify documents according to genre. JASIST, 2006.

Digital Library

[17]

A. Finn, N. Kushmerick, and B. Smyth. Genre classification and domain transfer for information filtering. In Advances in Information Retrieval. 2002.

Digital Library

[18]

K. Ganchev and M. Dredze. Small statistical models by random feature mixing. In Workshop on Mobile NLP at ACL, 2008.

[19]

M. Garber. Sit back, relax, and read that long story on your phone,textitThe Atlantic, 2014.

[20]

M. Gaulon-Brain. Print media and television: Is longform bound for extinction?, Ina Global, 2013.

[21]

S. Gollapalli et al. Researcher homepage classification using unlabeled data. In WWW, 2013.

Digital Library

[22]

G. Greenwald and E. MacAskill. Nsa prism program taps in to user data of apple, google and others,textitThe Guardian, 2013.

[23]

M. Henzinger. Finding near-duplicate web pages: A large-scale evaluation of algorithms. In SIGIR, 2006.

Digital Library

[24]

E. Hovy, M. Marcus, M. Palmer, L. Ramshaw, and R. Weischedel. Ontonotes: the 90% solution. In NAACL-HLT, 2006.

Digital Library

[25]

C. Johnston. What buzzfeed's evolution says about the future of longform journalism,textitPoynter, 2012.

[26]

Kaggle. Stumbleupon evergreen classification challenge, 2013. https://www.kaggle.com/c/stumbleupon.

[27]

S. Kamdar. Highlighting content creators in search results. Inside Search, Google Search Blog, 2011.

[28]

S. Kandell. What i learned from a year of doing longform at buzzfeed,textitThe Big Round Table, 2013.

[29]

H. Kwak et al. What is twitter, a social network or a news media? In WWW, 2010.

Digital Library

[30]

M. Lewis. Obama's way,textitVanity Fair, 2012.

[31]

J. Liu, P. Dolan, and E. Pedersen. Personalized news recommendation based on click behavior. In IUI, 2010.

Digital Library

[32]

A. Louis and A. Nenkova. What makes writing great? first experiments on article quality prediction in the science journalism domain. TACL, 1:341--352, 2013.

[33]

Y. Lv et al. Learning to model relatedness for news recommendation. In WWW, 2011.

Digital Library

[34]

J. Mahler. When 'long-form' is bad form, The New York Times, 2014.

[35]

F. Manjoo. You won't finish this article,textitSlate, 2013.

[36]

K. McBride. Jill abramson startup to advance writers up to$100k for longform work,textitPoynter, 2014.

[37]

I. Meuret. A short history of long-form journalism,textitIna Global, 2013.

[38]

P. Nayak. In-depth articles in search results. Inside Search, Google Search Blog, 2013.

[39]

J. Nivre et al. The conll 2007 shared task on dependency parsing. In EMNLP-CoNLL, 2007.

[40]

L. Page, S. Brin, R. Motwani, and T. Winograd. The pagerank citation ranking: Bringing order to the web. Technical report, Stanford InfoLab, 1999.

[41]

S. Parker. Buzzfeed's success does not mean we should be slaves to clicks,textitThe Guardian, 2014.

[42]

N. Paumgarten. Up and then down,textitNew Yorker, 2008.

[43]

C. Plante. Street fighter: The movie - what went wrong,textitPolygon Magazine, 2014.

[44]

S. W. Raudenbush and A. S. Bryk. Hierarchical linear models: Applications and data analysis methods. 2002.

[45]

R. Rieder. Long-form journalism makes a comeback,textitUSA Today, 2013.

[46]

Salmon. Jeff bezos and his journalists. Reuters, 2013.

[47]

M. Santini et al. Implementing a characterization of genre for automatic genre identification of web pages. In COLING/ACL, 2006.

Digital Library

[48]

J. Sappell and R. W. Welkos. The scientology story. The Los Angeles Times, 1990.

[49]

S. Sharoff. Classifying web corpora into domain and genre using automatic feature identification. In Web as Corpus Workshop, 2007.

[50]

B. Smith. What the longform backlash is all about,textitMedium.com, 2014.

[51]

D. Starkman. Major papers' longform meltdown,textitColumbia Journalism Review, 2013.

[52]

G. Talese. Frank sinatra has a cold,textitEsquire, 1966.

[53]

A. Tumasjan et al. Predicting elections with twitter. In ICWSM, 2010.

[54]

D. F. Wallace. The string theory,textitEsquire, 1996.

[55]

H. Wang et al. Joint relevance and freshness learning from clickthroughs for news search. In WWW, 2012.

Digital Library

[56]

G. Wong and W. Mason. The hierarchical logistic regression model for multilevel analysis. Journal of the American Statistical Association, 1985.

Index Terms

Going In-Depth: Finding Longform on the Web
1. Computing methodologies
  1. Machine learning
    1. Learning paradigms
      1. Supervised learning
        Supervised learning by classification
    2. Machine learning approaches
      1. Classification and regression trees

Recommendations

Efficient Algorithms for Public-Private Social Networks
KDD '15: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

We introduce the public-private model of graphs. In this model, we have a public graph and each node in the public graph has an associated private graph. The motivation for studying this model stems from social networks, where the nodes are the users, ...
Stream Sampling for Frequency Cap Statistics
KDD '15: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

Unaggregated data, in a streamed or distributed form, is prevalent and comes from diverse sources such as interactions of users with web services and IP traffic. Data elements have keys (cookies, users, queries) and elements with different keys ...
Unified and Contrasting Cuts in Multiple Graphs: Application to Medical Imaging Segmentation
KDD '15: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

The analysis of data represented as graphs is common having wide scale applications from social networks to medical imaging. A popular analysis is to cut the graph so that the disjoint subgraphs can represent communities (for social network) or ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

KDD '15: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

August 2015

2378 pages

ISBN:9781450336642

DOI:10.1145/2783258

General Chairs:
Longbing Cao
University of Technology, Sydney
,
Chengqi Zhang
University of Technology, Sydney
,
Program Chairs:
Thorsten Joachims
Cornell University
,
Geoff Webb
Monash University
,
Dragos D. Margineantu
Boeing Research
,
Graham Williams
Australian Taxation Office

Copyright © 2015 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 10 August 2015

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

KDD '15

Sponsor:

KDD '15: The 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

August 10 - 13, 2015

NSW, Sydney, Australia

Acceptance Rates

KDD '15 Paper Acceptance Rate 160 of 819 submissions, 20%;

Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
1,281
Total Downloads

Downloads (Last 12 months)119
Downloads (Last 6 weeks)10

Reflects downloads up to 01 Sep 2024

Other Metrics

View Author Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents