Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Exploiting neighborhood knowledge for single document summarization and keyphrase extraction

Published: 10 June 2010 Publication History

Abstract

Document summarization and keyphrase extraction are two related tasks in the IR and NLP fields, and both of them aim at extracting condensed representations from a single text document. Existing methods for single document summarization and keyphrase extraction usually make use of only the information contained in the specified document. This article proposes using a small number of nearest neighbor documents to improve document summarization and keyphrase extraction for the specified document, under the assumption that the neighbor documents could provide additional knowledge and more clues. The specified document is expanded to a small document set by adding a few neighbor documents close to the document, and the graph-based ranking algorithm is then applied on the expanded document set to make use of both the local information in the specified document and the global information in the neighbor documents. Experimental results on the Document Understanding Conference (DUC) benchmark datasets demonstrate the effectiveness and robustness of our proposed approaches. The cross-document sentence relationships in the expanded document set are validated to be beneficial to single document summarization, and the word cooccurrence relationships in the neighbor documents are validated to be very helpful to single document keyphrase extraction.

References

[1]
Amini, M. R. and Gallinari, P. 2002. The use of unlabeled data to improve supervised learning for text summarization. In Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR). 105--112.
[2]
Baeza-Yates, R. and Ribeiro-Neto, B. 1999. Modern Information Retrival. ACM Press/Addison Wesley.
[3]
Balabanović, M. and Shoham, Y. 1997. Fab: content-based, collaborative recommendation. Comm. ACM 40, 3, 66--72.
[4]
Barker, K. and Cornacchia, N. 2000. Using nounphrase heads to extract document keyphrases. In Proceedings of the 13th Biennial Conference of the Canadian Society on Computational Studies of Intelligence: Advances in Artificial Intelligence. 40--52.
[5]
Barzilay, R. and Elhadad, M. 1997. Using lexical chains for text summarization. In Proceedings of the ACL Workshop on Intelligent Scalable Text Summarization. 10--17.
[6]
Berger, A. and Mittal, V. 2000. OCELOT: A system for summarizing Web Pages. In Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and development in Information Retrieval (SIGIR). 144--151.
[7]
Böhm, C. and Berchtold, S. 2001. Searching in high-dimensional spaces-index structures for improving the performance of multimedia databases. ACM Comput. Surv. 33, 3, 322--373.
[8]
Carbonell, J. and Goldstein, J. 1998. The use of MMR, diversity-based reranking for reordering documents and producing summaries. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR). 335--336.
[9]
Carenini, G., Ng, R. T., and Zhou, X. 2007. Summarizing email conversations with clue words. In Proceedings of the 16th International Conference on World Wide Web. 91--100.
[10]
Conroy, J. M. and O'Leary, D. P. 2001. Text summarization via Hidden Markov Models. In Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR). 406--407.
[11]
Daumé, H. and Marcu, D. 2006. Bayesian query-focused summarization. In Proceedings of the 21st International Conference on Computational Linguistics and the 44th Annual Meeting of the Association for Computational Linguistics (COLING-ACL). 305--312.
[12]
Edmundson, H. P. 1969. New methods in automatic abstracting. J. ACM 16, 2, 264--285.
[13]
Erkan, G. and Radev, D. R. 2004. LexRank: Graph-based lexical centrality as salience in text summarization. J. Artif. Intell. Res. 22, 457--479.
[14]
Frank, E., Paynter, G. W., Witten, I. H., Gutwin, C., and Nevill-Manning, C. G. 1999. Domain-specific keyphrase extraction. In Proceedings of the 16th International Joint Conference on Artificial Intelligence (IJCAI). 668--673.
[15]
Gong, Y. H. and Liu, X. 2001. Generic text summarization using Relevance Measure and Latent Semantic Analysis. In Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR). 19--25.
[16]
Gutwin, C., Paynter, G. W., Witten, I. H., Nevill-Manning, C. G., and Frank, E. 1999. Improving browsing in digital libraries with keyphrase indexes. J. Dec. Support Syst. 27, 81--104.
[17]
Hammouda, K. M., Matute, D. N., and Kamel, M. S. 2005. CorePhrase: keyphrase extraction for document clustering. In Proceedings of IAPR 4th International Conference on Machine Learning and Data Mining (MLDM). 265--274.
[18]
Harabagiu, S. and Lacatusu, F. 2005. Topic themes for multidocument summarization. In Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR). 202--209.
[19]
Hovy, E. and Lin, C. Y. 1997. Automated text summarization in SUMMARIST. In Proceedings of ACL/EACL Worshop on Intelligent Scalable Text Summarization. 18--24.
[20]
Hulth, A. 2003. Improved automatic keyword extraction given more linguistic knowledge. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP). 216--223.
[21]
Jing, H. 2000. Sentence reduction for automatic text summarization. In Proceedings of the 6th Conference on Applied Natural Language Processing (ANLP). 310--315.
[22]
Jing, H. and McKeown, K. R. 2000. Cut and paste based text summarization. In Proceedings of the 1st North American Chapter of the Association for Computational Linguistics Conference (NAACL). 178--185.
[23]
Kelleher, D. and Luz, S. 2005. Automatic hypertext keyphrase detection. In Proceedings of the 18th International Joint Conference on Artificial Intelligence (IJCAI). 1608--1609.
[24]
Kleinberg, J. M. 1999. Authoritative sources in a hyperlinked environment. J. ACM 46, 5, 604--632.
[25]
Knight, K. and Marcu, D. 2002. Summarization beyond sentence extraction: a probabilistic approach to sentence compression. Artif. Intell. 139, 1, 91--107.
[26]
Kolcz, A., Prabakarmurthi, V., and Kalita, J. 2001. Summarization as feature selection for text categorization. In Proceedings of the 10th International Conference on Information and Knowledge Management (CIKM). 365--370.
[27]
Krulwich, B. and Burkey, C. 1996. Learning user information interests through the extraction of semantically significant phrases. In Spring Symposium on Machine Learning in Information Access (AAAI). 110--112.
[28]
Kupiec, J., Pedersen, J., and Chen, F. 1995. A.trainable document summarizer. In Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR). 68--73.
[29]
Lam-Adesina, A. M. and Jones, G. J. F. 2001. Applying summarization techniques for term selection in relevance feedback. In Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR). 1--9.
[30]
Lin, C. Y. and Hovy, E. 2000. The automated acquisition of topic signatures for text summarization. In Proceedings of the 18th Conference on Computational Linguistics (ACL). 495--501.
[31]
Lin, C. Y. and Hovy, E. H. 2003. Automatic evaluation of summaries using n-gram co-occurrence statistics. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology (NAACL-HLT). 71--78.
[32]
Luhn, H. P. 1958. The automatic creation of literature abstracts. IBM J. Res. Devel. 2, 2, 159--165.
[33]
McDonald, D. and Chen, H. 2002. Using sentence-selection heuristics to rank text segment in TXTRACTOR. In Proceedings of the 2nd ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL). 28--35.
[34]
McDonald, D. and Chen, H. 2006. Summary in context: searching versus browsing. ACM Trans. Inform. Syst. 24, 1, 111--141.
[35]
Medelyan, O. and Witten, I. H. 2006. Thesaurus based automatic keyphrase indexing. In Proceedings of the 6th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL). 296--297.
[36]
Mihalcea, R. and Tarau, P. 2004. TextRank: Bringing order into texts. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP). 404--411.
[37]
Mihalcea, R. and Tarau, P. 2005. A language independent algorithm for single and multiple document summarization. In Proceedings of the 2nd International Joint Conference on Natural Language Processing (IJCNLP): Companion Volume including Posters/Demos and Tutorial Abstracts. 19--24.
[38]
Mihalcea, R. and Ceylan, H. 2007. Explorations in automatic book summarization. In Proceedings of the Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CONLL). 380--389.
[39]
Muñoz, A. 1997. Compound key word generation from document databases using a hierarchical clustering ART model. Intell. Data Anal. 1, 1--4, 25--48.
[40]
Nguyen, T. D. and Kan, M.-Y. 2007. Keyphrase extraction in scientific publications. In Proceedings of the 10th International Conference on Asian Digital Libraries (ICADL). 317--326.
[41]
Nomoto, T. and Matsumoto, Y. 2001. A new approach to unsupervised text summarization. In Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), 26--34.
[42]
Over, P. 2001. Introduction to DUC-2001: an intrinsic evaluation of generic news text summarization systems. In Proceedings of the DUC'01 Workshop on Text Summarization.
[43]
Over, P. and Liggett, W. 2002. Introduction to DUC: an intrinsic evaluation of generic news text summarization systems. In Proceedings of the DUC'02 Workshop on Text Summarization.
[44]
Page, L., Brin, S., Motwani, R., and Winograd, T. 1998. The pagerank citation ranking: Bringing order to the web. Tech. Rep., Stanford Digital Libraries.
[45]
Park, Y., Byrd, R. J., and Boguraev, B. 2002. Automatic glossary extraction: beyond terminology identification. In Proceedings of the 19th International Conference on Computational Linguistics. 1--7.
[46]
Porter, M. F. 1980. An algorithm for suffix stripping. Program 14, 3, 130--137.
[47]
Pouliquen, B., Steinberger, R., and Ignat, C. 2003. Automatic annotation of multilingual text collections with a conceptual thesaurus. In Proceedings of the Workshop ‘Ontologies and Information Extraction’ at the Summer School ‘The Semantic Web and Language Technology - Its Potential and Practicalities’ (EUROLAN). 9--28.
[48]
Radev, D. R., Jing, H. Y., Stys, M., and Tam, D. 2004. Centroid-based summarization of multiple documents. Inform. Proc. Manag. 40, 6, 919--938.
[49]
Radev, D. R. and McKeown, K. R. 1998. Generating natural language summaries from multiple on-line sources. Comput. Ling. 24, 3, 469--500.
[50]
Sakai, T. and Jones, K. S. 2001. Generic summaries for indexing in information retrieval. In Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR). 190--198.
[51]
Shen, D., Chen, Z., Yang, Q., Zeng, H.-J., Zhang, B., Lu, Y., and Ma, W.-Y. 2004. Web-page classification through summarization. In Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR). 242--249.
[52]
Shen, D., Sun, J.-T., Li, H., Yang, Q., and Chen, Z. 2007. Document Summarization using conditional random fields. In Proceedings of the 20th International Joint Conference on Artificial Intelligence (IJCAI). 2862--2867.
[53]
Silber, H. G. and McCoy, K. 2000. Efficient text summarization using lexical chains. In Proceedings of the 5th International Conference on Intelligent User Interfaces. 252--255.
[54]
Song, M., Song, I.-Y., and Hu, X. 2003. KPSpotter: a flexible information gain-based keyphrase extraction system. In Proceedings of the 5th ACM International Workshop on Web Information and Data Management (WIDM), 50--53.
[55]
Steier, A. M. and Belew, R. K. 1993. Exporting phrases: A statistical analysis of topical language. In Proceedings of the Second Symposium on Document Analysis and Information Retrieval. 179--190.
[56]
Sun, J.-T., Shen, D., Zeng, H.-J., Yang, Q., Lu, Y., and Chen, Z. 2005. Web-page summarization using clickthrough data. In Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR). 194--201.
[57]
Tao, T., Wang, X., Mei, Q., and Zhai, C. 2006. Language model information retrieval with document expansion. In Proceedings of the Main Conference on Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics (HLT-NAACL). 407--414.
[58]
Teufel, S. and Moens, M. 2002. Summarizing scientific articles: experiments with relevance and rhetorical status. Comput. Ling. 28, 4, 409--445.
[59]
Tomokiyo, T. and Hurst, M. 2003. A language model approach to keyphrase extraction. In Proceedings of the ACL Workshop on Multiword Expressions: Analysis, Acquisition and Treatment. 33--40.
[60]
Toutanova, K. and Manning, C. D. 2000. Enriching the knowledge sources used in a maximum entropy Part-of-Speech tagger. In Proceedings of the Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora (EMNLP/VLC). 63--70.
[61]
Turney, P. D. 2000. Learning algorithms for keyphrase extraction. Inform. Retrieval 2, 4, 303--336.
[62]
Turney, P. D. 2003. Coherent keyphrase extraction via web mining. In Proceedings of the 16th International Joint Conference on Artificial Intelligence (IJCAI). 434--439.
[63]
Wan, X. and Xiao, J. 2008a. Single document keyphrase extraction using neighborhood knowledge. In Proceedings of the 23rd AAAI Conference on Artificial Intelligence (AAAI). 855--860.
[64]
Wan, X. and Xiao, J. 2008b. CollabRank: Towards a collaborative approach to single-document keyphrase extraction. In Proceedings of the 22nd International Conference on Computational Linguistics (COLING). 969--976.
[65]
Wan, X. and Yang, J. 2007a. Single document summarization with document expansion. In Proceedings of the 22nd AAAI Conference on Artificial Intelligence (AAAI). 931--936.
[66]
Wan, X. and Yang, J. 2007b. CollabSum: Exploiting multiple document clustering for collaborative single document summarizations. In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR). 143--150.
[67]
Wan, X., Yang, J., and Xiao, J. 2007. Towards an iterative reinforcement approach for simultaneous document summarization and keyword extraction. In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics (ACL). 552--559.
[68]
Wang, X., Shen, D., Zeng, H.-J., Chen, Z., and Ma, W.-Y. 2004. Web page clustering enhanced by summarization. In Proceedings of the 13th ACM International Conference on Information and Knowledge Management (CIKM). 242--243.
[69]
Witten, I. H., Paynter, G. W., Frank, E., Gutwin, C., and Nevill-Manning, C. G. 1999. KEA: Practical automatic keyphrase extraction. In Proceedings of the 4th ACM Conference on Digital Libraries (DL). 254--256.
[70]
Wong, T.-L., Lam, W., and Chan, S.-K. 2006. Collaborative information extraction and mining from multiple web documents. In Proceedings of the SIAM International Conference on Data Mining (SDM). 440--450.
[71]
Xue, G.-R., Lin, C., Yang, Q., Xi, W., Zeng, H.-J., Yu, Y., and Chen, Z. 2005. Scalable collaborative filtering using cluster-based smoothing. In Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR). 114--121.
[72]
Yih, W.-T., Goodman, J., and Carvalho, V. R. 2006. Finding advertising keywords on web pages. In Proceedings of the 15th International Conference on World Wide Web (WWW). 213--222.
[73]
Zha, H. Y. 2002. Generic summarization and keyphrase extraction using mutual reinforcement principle and sentence clustering. In Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR). 113--120.
[74]
Zhang, B., Li, H., Liu, Y., Ji, L., Xi, W., Fan, W., Chen, Z., and Ma, W.-Y. 2005. Improving web search results using affinity graph. In Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR). 504--511.

Cited By

View all
  • (2023)Multi-granularity adaptive extractive document summarization with heterogeneous graph neural networksPeerJ Computer Science10.7717/peerj-cs.17379(e1737)Online publication date: 13-Dec-2023
  • (2023)An Extensive Survey on Investigation Methodologies for Text SummarizationIndian Journal of Signal Processing10.54105/ijsp.D1016.1134233:4(1-6)Online publication date: 30-Nov-2023
  • (2023)A New Method for Extractive Text Summarization Using Neural NetworksSN Computer Science10.1007/s42979-023-01806-04:4Online publication date: 9-May-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Information Systems
ACM Transactions on Information Systems  Volume 28, Issue 2
May 2010
165 pages
ISSN:1046-8188
EISSN:1558-2868
DOI:10.1145/1740592
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 10 June 2010
Accepted: 01 April 2009
Revised: 01 November 2008
Received: 01 May 2008
Published in TOIS Volume 28, Issue 2

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Document summarization
  2. graph-based ranking
  3. keyphrase extraction
  4. neighborhood knowledge

Qualifiers

  • Research-article
  • Research
  • Refereed

Funding Sources

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)10
  • Downloads (Last 6 weeks)0
Reflects downloads up to 17 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2023)Multi-granularity adaptive extractive document summarization with heterogeneous graph neural networksPeerJ Computer Science10.7717/peerj-cs.17379(e1737)Online publication date: 13-Dec-2023
  • (2023)An Extensive Survey on Investigation Methodologies for Text SummarizationIndian Journal of Signal Processing10.54105/ijsp.D1016.1134233:4(1-6)Online publication date: 30-Nov-2023
  • (2023)A New Method for Extractive Text Summarization Using Neural NetworksSN Computer Science10.1007/s42979-023-01806-04:4Online publication date: 9-May-2023
  • (2023)Automatic Document Summarization of Unilingual Documents: A ReviewIntelligent Computing and Optimization10.1007/978-3-031-50327-6_36(345-358)Online publication date: 16-Dec-2023
  • (2021)Single document summarization using the information from documents with the same topicKnowledge-Based Systems10.1016/j.knosys.2021.107265228(107265)Online publication date: Sep-2021
  • (2020)An unsupervised semantic sentence ranking scheme for text documentsIntegrated Computer-Aided Engineering10.3233/ICA-20062628:1(17-33)Online publication date: 21-Dec-2020
  • (2020)Estimating Knowledge Category Coverage by Courses Based on Centrality in TaxonomyIEICE Transactions on Information and Systems10.1587/transinf.2019DAP0002E103.D:5(928-938)Online publication date: 1-May-2020
  • (2020)A Novel Sentence Scoring Method for Extractive Text SummarizationProceedings of International Conference on Frontiers in Computing and Systems10.1007/978-981-15-7834-2_16(169-179)Online publication date: 24-Nov-2020
  • (2019)A Novel Summarization-based Approach for Feature Reduction Enhancing Text Classification AccuracyEngineering, Technology & Applied Science Research10.48084/etasr.31739:6(5001-5005)Online publication date: 1-Dec-2019
  • (2019)CSMDSE-Cuckoo Search Based Multi Document Summary ExtractorInternational Journal of Cognitive Informatics and Natural Intelligence10.4018/IJCINI.201910010313:4(56-70)Online publication date: 1-Oct-2019
  • Show More Cited By

View Options

Get Access

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media