Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2487575.2487672acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
research-article

Text-based measures of document diversity

Published: 11 August 2013 Publication History

Abstract

Quantitative notions of diversity have been explored across a variety of disciplines ranging from conservation biology to economics. However, there has been relatively little work on measuring the diversity of text documents via their content. In this paper we present a text-based framework for quantifying how diverse a document is in terms of its content. The proposed approach learns a topic model over a corpus of documents, and computes a distance matrix between pairs of topics using measures such as topic co-occurrence. These pairwise distance measures are then combined with the distribution of topics within a document to estimate each document's diversity relative to the rest of the corpus. The method provides several advantages over existing methods. It is fully data-driven, requiring only the text from a corpus of documents as input, it produces human-readable explanations, and it can be generalized to score diversity of other entities such as authors, academic departments, or journals. We describe experimental results on several large data sets which suggest that the approach is effective and accurate in quantifying how diverse a document is relative to other documents in a corpus.

References

[1]
R. N. Broadus. An investigation of the validity of bibliographic citations. Journal of the American Society for Information Science, 34(2):132--135, 2007.
[2]
D. Davies. Citation idiosyncrasies. Nature, 228:1356, 1970.
[3]
J. Dillon, Y. Mao, G. Lebanon, and J. Zhang. Statistical translation, heat kernels and expected distances. In Proceedings of the Uncertainty in AI Conference (UAI 2007), pages 93--100, 2007.
[4]
M. O. Finkelstein and R. M. Friedberg. The application of an entropy theory of concentration to the Clayton act. Yale Law Journal, 76:677, 1966.
[5]
J. Gibbs and W. Martin. Urbanization, technology, and the division of labor: International patterns. American Sociological Review, pages 667--677, 1962.
[6]
J. Gillenwater, A. Kulesza, and B. Taskar. Discovering diverse and salient threads in document collections. In Proceedings of the 2012 Conference on Empirical Methods in Machine Learning (EMNLP-CoNLL), pages 710--720, 2012.
[7]
T. L. Griffiths and M. Steyvers. Finding scientific topics. Proceedings of the National Academy of Sciences of the United States of America, 101(Suppl 1):5228--5235, 2004.
[8]
S. Lieberson. Measuring population diversity. American Sociological Review, pages 850--862, 1969.
[9]
A. Magurran and A. Magurran. Ecological Diversity and its Measurement, volume 168. Princeton University Press, Princeton, NJ, 1988.
[10]
A. K. McCallum. Mallet: A machine learning for language toolkit. http://www.cs.umass.edu/ mccallum/mallet, 2002.
[11]
National Center for Biotechnology Information, U.S. National Library of Medicine. Central Open Access Initiative. 2010. http://www.ncbi.nlm.nih.gov/pmc/tools/openftlist/.
[12]
M. Nei. Analysis of gene diversity in subdivided populations. Proceedings of the National Academy of Sciences, 70(12):3321--3323, 1973.
[13]
E. C. Pielou. An Introduction to Mathematical Ecology. Wiley-Interscience, 1969.
[14]
A. L. Porter and I. Rafols. Is science becoming more interdisciplinary? measuring and mapping six research fields over time. Scientometrics, 81(3):719--745, 2009.
[15]
A. L. Porter, D. J. Roessner, and A. E. Heberger. How interdisciplinary is a given body of research? Research Evaluation, 17(4):273--282, 2008.
[16]
D. Radev, P. Muthukrishnan, V. Qazvinian, and A. Abu-Jbara. The ACL anthology network corpus. Language Resources and Evaluation, pages 1--26, 2013.
[17]
I. Rafols and M. Meyer. Diversity and network coherence as indicators of interdisciplinarity: Case studies in bionanoscience. Scientometrics, 82(2):263--287, 2010.
[18]
C. Rao. Diversity and dissimilarity coefficients: a unified approach. Theoretical Population Biology, 21(1):24--43, 1982.
[19]
C. Ricotta and L. Szeidl. Towards a unifying approach to diversity measures: bridging the gap between the Shannon entropy and Rao's quadratic index. Theoretical Population Biology, 70(3):237--243, 2006.
[20]
E. Simpson. Measurement of diversity. Nature, page 688, 1949.
[21]
A. Solow, S. Polasky, and J. Broadus. On the measurement of biological diversity. Journal of Environmental Economics and Management, 24(1):60--68, 1993.
[22]
A. Stirling. A general framework for analysing diversity in science, technology and society. Journal of the Royal Society Interface, 4(15):707--719, 2007.
[23]
C. Wagner, J. Roessner, K. Bobb, J. Klein, K. Boyack, J. Keyton, I. Rafols, and K. Börner. Approaches to understanding and measuring interdisciplinary scientific research (IDR): A review of the literature. Journal of Informetrics, 5(1):14--26, 2011.
[24]
M. J. Welch, J. Cho, and C. Olston. Search result diversity for informational queries. In Proceedings of the 20th International Conference on the World Wide Web (WWW), pages 237--246. ACM, 2011.

Cited By

View all
  • (2024)A Transformer-based Approach for Augmenting Software Engineering Chatbots DatasetsProceedings of the 18th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement10.1145/3674805.3686695(359-370)Online publication date: 24-Oct-2024
  • (2024)Post diversity: A new lens of social media WOMJournal of Business Research10.1016/j.jbusres.2023.114329170(114329)Online publication date: Jan-2024
  • (2024)Analyzing research diversity of scholars based on multi-dimensional calculation of knowledge entitiesScientometrics10.1007/s11192-023-04821-3129:11(7329-7358)Online publication date: 1-Nov-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
KDD '13: Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining
August 2013
1534 pages
ISBN:9781450321747
DOI:10.1145/2487575
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 11 August 2013

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. diversity
  2. interdisciplinarity

Qualifiers

  • Research-article

Conference

KDD' 13
Sponsor:

Acceptance Rates

KDD '13 Paper Acceptance Rate 125 of 726 submissions, 17%;
Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

Upcoming Conference

KDD '25

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)42
  • Downloads (Last 6 weeks)2
Reflects downloads up to 11 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)A Transformer-based Approach for Augmenting Software Engineering Chatbots DatasetsProceedings of the 18th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement10.1145/3674805.3686695(359-370)Online publication date: 24-Oct-2024
  • (2024)Post diversity: A new lens of social media WOMJournal of Business Research10.1016/j.jbusres.2023.114329170(114329)Online publication date: Jan-2024
  • (2024)Analyzing research diversity of scholars based on multi-dimensional calculation of knowledge entitiesScientometrics10.1007/s11192-023-04821-3129:11(7329-7358)Online publication date: 1-Nov-2024
  • (2023)DATM: A Novel Data Agnostic Topic Modeling Technique With Improved Effectiveness for Both Short and Long TextIEEE Access10.1109/ACCESS.2023.326265311(32826-32841)Online publication date: 2023
  • (2022)Evolutionary stages and multidisciplinary nature of artificial intelligence researchScientometrics10.1007/s11192-022-04477-5127:9(5139-5158)Online publication date: 16-Aug-2022
  • (2022)Impact of model settings on the text-based Rao diversity indexScientometrics10.1007/s11192-022-04312-x127:12(7751-7768)Online publication date: 12-Mar-2022
  • (2020)A model of concept hierarchy-based diverse patterns with applications to recommender systemInternational Journal of Data Science and Analytics10.1007/s41060-019-00203-2Online publication date: 4-Jan-2020
  • (2020)Finding Attribute Diversified Communities in Complex NetworksDatabase Systems for Advanced Applications10.1007/978-3-030-59419-0_2(19-35)Online publication date: 22-Sep-2020
  • (2020)Capability interactions and adaptation to demand‐side changeStrategic Management Journal10.1002/smj.313741:9(1595-1627)Online publication date: 12-May-2020
  • (2019)Interdisciplinarity as diversity in citation patterns among journals: Rao-Stirling diversity, relative variety, and the Gini coefficientJournal of Informetrics10.1016/j.joi.2018.12.00613:1(255-269)Online publication date: Feb-2019
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media