Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2783258.2788599acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
research-article
Open access

Going In-Depth: Finding Longform on the Web

Published: 10 August 2015 Publication History

Abstract

tl;dr: Longform articles are extended, in-depth pieces that often serve as feature stories in newspapers and magazines. In this work, we develop a system to automatically identify longform content across the web. Our novel classifier is highly accurate despite huge variation within longform in terms of topic, voice, and editorial taste. It is also scalable and interpretable, requiring a surprisingly small set of features based only on language and parse structures, length, and document interest. We implement our system at scale and use it to identify a corpus of several million longform documents. Using this corpus, we provide the first web-scale study with quantifiable and measurable information on longform, giving new insight into questions posed by the media on the past and current state of this famed literary medium.

Supplementary Material

MP4 File (p2109.mp4)

References

[1]
The state of news media, textitPew Research Center, 2013.
[2]
S. Abbar et al. Real-time recommendation of diverse related articles. In WWW, 2013.
[3]
A. Ahmed et al. Unified analysis of streaming news. In WWW, 2011.
[4]
S. Argamon, M. Koppel, and G. Avneri. Routing documents according to style. In IIS Workshop, 1998.
[5]
N. Arnold. The cautiously hopeful renaissance of longform journalism, 2014.
[6]
V. Ashok et al. Success with style: Using writing style to predict the success of novels. Poetry, 2013.
[7]
J. Bennet. Against 'long-form' journalism, textitThe Atlantic, 2013.
[8]
P. N. Bennett, K. Svore, and S. T. Dumais. Classification-enhanced ranking. In WWW, 2010.
[9]
A. Berger, S. Pietra, and V. Pietra. A maximum entropy approach to natural language processing. Computational Linguistics, 1996.
[10]
J. Bollen, H. Mao, and A. Pepe. Modeling public mood and emotion: Twitter sentiment and socio-economic phenomena. In ICWSM, 2011.
[11]
T. Capote. The duke in his domain, textitNew Yorker, 1957.
[12]
C. Cooper. The death of slow journalism, textitAmerican Journalism Review, 2009.
[13]
C. Danescu-Niculescu-Mizil et al. No country for old members: User lifecycle and linguistic change in online communities. In WWW, 2013.
[14]
M.-C. de Marneffe, B. MacCartney, and C. Manning. Generating typed dependency parses from phrase structure parses. In LREC, 2006.
[15]
L. DVorkin. Inside forbes: How longform journalism is finding its digital audience, textitForbes, 2012.
[16]
A. Finn and N. Kushmerick. Learning to classify documents according to genre. JASIST, 2006.
[17]
A. Finn, N. Kushmerick, and B. Smyth. Genre classification and domain transfer for information filtering. In Advances in Information Retrieval. 2002.
[18]
K. Ganchev and M. Dredze. Small statistical models by random feature mixing. In Workshop on Mobile NLP at ACL, 2008.
[19]
M. Garber. Sit back, relax, and read that long story on your phone,textitThe Atlantic, 2014.
[20]
M. Gaulon-Brain. Print media and television: Is longform bound for extinction?, Ina Global, 2013.
[21]
S. Gollapalli et al. Researcher homepage classification using unlabeled data. In WWW, 2013.
[22]
G. Greenwald and E. MacAskill. Nsa prism program taps in to user data of apple, google and others,textitThe Guardian, 2013.
[23]
M. Henzinger. Finding near-duplicate web pages: A large-scale evaluation of algorithms. In SIGIR, 2006.
[24]
E. Hovy, M. Marcus, M. Palmer, L. Ramshaw, and R. Weischedel. Ontonotes: the 90% solution. In NAACL-HLT, 2006.
[25]
C. Johnston. What buzzfeed's evolution says about the future of longform journalism,textitPoynter, 2012.
[26]
Kaggle. Stumbleupon evergreen classification challenge, 2013. https://www.kaggle.com/c/stumbleupon.
[27]
S. Kamdar. Highlighting content creators in search results. Inside Search, Google Search Blog, 2011.
[28]
S. Kandell. What i learned from a year of doing longform at buzzfeed,textitThe Big Round Table, 2013.
[29]
H. Kwak et al. What is twitter, a social network or a news media? In WWW, 2010.
[30]
M. Lewis. Obama's way,textitVanity Fair, 2012.
[31]
J. Liu, P. Dolan, and E. Pedersen. Personalized news recommendation based on click behavior. In IUI, 2010.
[32]
A. Louis and A. Nenkova. What makes writing great? first experiments on article quality prediction in the science journalism domain. TACL, 1:341--352, 2013.
[33]
Y. Lv et al. Learning to model relatedness for news recommendation. In WWW, 2011.
[34]
J. Mahler. When 'long-form' is bad form, The New York Times, 2014.
[35]
F. Manjoo. You won't finish this article,textitSlate, 2013.
[36]
K. McBride. Jill abramson startup to advance writers up to$100k for longform work,textitPoynter, 2014.
[37]
I. Meuret. A short history of long-form journalism,textitIna Global, 2013.
[38]
P. Nayak. In-depth articles in search results. Inside Search, Google Search Blog, 2013.
[39]
J. Nivre et al. The conll 2007 shared task on dependency parsing. In EMNLP-CoNLL, 2007.
[40]
L. Page, S. Brin, R. Motwani, and T. Winograd. The pagerank citation ranking: Bringing order to the web. Technical report, Stanford InfoLab, 1999.
[41]
S. Parker. Buzzfeed's success does not mean we should be slaves to clicks,textitThe Guardian, 2014.
[42]
N. Paumgarten. Up and then down,textitNew Yorker, 2008.
[43]
C. Plante. Street fighter: The movie - what went wrong,textitPolygon Magazine, 2014.
[44]
S. W. Raudenbush and A. S. Bryk. Hierarchical linear models: Applications and data analysis methods. 2002.
[45]
R. Rieder. Long-form journalism makes a comeback,textitUSA Today, 2013.
[46]
Salmon. Jeff bezos and his journalists. Reuters, 2013.
[47]
M. Santini et al. Implementing a characterization of genre for automatic genre identification of web pages. In COLING/ACL, 2006.
[48]
J. Sappell and R. W. Welkos. The scientology story. The Los Angeles Times, 1990.
[49]
S. Sharoff. Classifying web corpora into domain and genre using automatic feature identification. In Web as Corpus Workshop, 2007.
[50]
B. Smith. What the longform backlash is all about,textitMedium.com, 2014.
[51]
D. Starkman. Major papers' longform meltdown,textitColumbia Journalism Review, 2013.
[52]
G. Talese. Frank sinatra has a cold,textitEsquire, 1966.
[53]
A. Tumasjan et al. Predicting elections with twitter. In ICWSM, 2010.
[54]
D. F. Wallace. The string theory,textitEsquire, 1996.
[55]
H. Wang et al. Joint relevance and freshness learning from clickthroughs for news search. In WWW, 2012.
[56]
G. Wong and W. Mason. The hierarchical logistic regression model for multilevel analysis. Journal of the American Statistical Association, 1985.

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
KDD '15: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
August 2015
2378 pages
ISBN:9781450336642
DOI:10.1145/2783258
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 10 August 2015

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. feature engineering
  2. machine learning
  3. natural language processing
  4. web mining

Qualifiers

  • Research-article

Conference

KDD '15
Sponsor:

Acceptance Rates

KDD '15 Paper Acceptance Rate 160 of 819 submissions, 20%;
Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 1,281
    Total Downloads
  • Downloads (Last 12 months)119
  • Downloads (Last 6 weeks)10
Reflects downloads up to 01 Sep 2024

Other Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media