Abstract
The document aboutness problem asks for creating a succinct representation of a document’s subject matter via keywords, sentences or entities drawn from a Knowledge Base. In this paper we propose an approach to solve this problem which improves the known solutions over all known datasets [4, 19]. It is based on a wide and detailed experimental study of syntactic and semantic features drawn from the input document thanks to the use of some IR/NLP tools. To encourage and support reproducible experimental results on this task, we will make accessible our system via a public API: this is the first, and best performing, tool publicly available for the document aboutness problem.
This work has been supported in part by the EU H2020 Program under the scheme “INFRAIA-1-2014-2015: Research Infrastructures” grant agreement #654024 “SoBigData: Social Mining & Big Data Ecosystem”.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
- 2.
Thanks to the authors of [4] for pointing this out to us.
- 3.
Remind that Sel’s implementation is not available.
References
Anick, P.: Using terminological feedback for web search refinement: a log-based study. In: SIGIR, pp. 88–95 (2003)
Boldi, P., Vigna, S.: Axioms for centrality. Internet Math. 10, 222–262 (2014)
Chen, T., Guestrin, C.: XGBoost: a scalable tree boosting system. In: SIGKDD, pp. 785–794 (2016)
Dunietz, J., Gillick, D.: A new entity salience task with millions of training examples. In: EACL, p. 205 (2014)
Gabrilovich, E., Markovitch, S.: Computing semantic relatedness using wikipedia-based explicit semantic analysis. In: IJCAI, pp. 1606–1611 (2007)
Gamon, M., Yano, T., Song, X., Apacible, J., Pantel, P.: Identifying salient entities in web pages. In: CIKM, pp. 2375–2380 (2013)
Hasan, K.S., Ng, V.: Automatic keyphrase extraction. A survey of the (state of the) art. In: ACL, pp. 1262–1273 (2014)
Liu, Z., Huang, W., Zheng, Y., Sun, M.: Automatic keyphrase extraction via topic decomposition. In: EMNLP, pp. 366–376 (2010)
Manning, C.D., et al.: The stanford CoreNLP toolkit. In: ACL, pp. 55–60 (2014)
Mihalcea, R., Tarau, P.: TextRank. Bringing order into texts. In: EMNLP (2004)
Ni, Y., et al.: Semantic documents relatedness using concept graph representation. In: WSDM, pp. 635–644 (2016)
Bruza, P.D., Huibers, T.W.C.: A study of aboutness in information retrieval. Artif. Intell. Rev. 10, 381–407 (1996)
Paranjpe, D.: Learning document aboutness from implicit user feedback and document structure. In: CIKM, pp. 365–374 (2009)
Pedregosa, F., et al.: Scikit-learn: machine learning in python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
Piccinno, F., Ferragina, P.: From TagMe to WAT: a new entity annotator. In: ERD Workshop, Hosted by SIGIR, pp. 55–62 (2014)
Radlinski, F., et al.: Optimizing relevance and revenue in ad search: a query substitution approach. In: SIGIR, pp. 403–410 (2008)
Sandhaus, E.: The New York Times Annotated Corpus. LCM, Philadelphia (2008)
Scaiella, U., Ferragina, P., Marino, A., Ciaramita, M.: Topical clustering of search results. In: WSDM, pp. 223–232 (2012)
Trani, S., et al.: SEL: a unified algorithm for entity linking and saliency detection. In: DocEng, pp. 85–94 (2016)
Turney, P.D.: Learning algorithms for keyphrase extraction. Inf. Retriev. 2, 303–336 (2000)
Usbeck, R., et al.: GERBIL: general entity annotator benchmarking framework. In: WWW, pp. 303–336 (2015)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Ponza, M., Ferragina, P., Piccinno, F. (2017). Document Aboutness via Sophisticated Syntactic and Semantic Features. In: Frasincar, F., Ittoo, A., Nguyen, L., MĂ©tais, E. (eds) Natural Language Processing and Information Systems. NLDB 2017. Lecture Notes in Computer Science(), vol 10260. Springer, Cham. https://doi.org/10.1007/978-3-319-59569-6_53
Download citation
DOI: https://doi.org/10.1007/978-3-319-59569-6_53
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-59568-9
Online ISBN: 978-3-319-59569-6
eBook Packages: Computer ScienceComputer Science (R0)