research-article

Open access

Measuring Similarity Similarly: LDA and Human Perception

Authors:

Carolyn P. Rosé,

James D. HerbslebAuthors Info & Claims

ACM Transactions on Intelligent Systems and Technology (TIST), Volume 8, Issue 1

Article No.: 7, Pages 1 - 28

https://doi.org/10.1145/2890510

Published: 26 September 2016 Publication History

Abstract

Several intelligent technologies designed to improve navigability in and digestibility of text corpora use topic modeling such as the state-of-the-art Latent Dirichlet Allocation (LDA). This model and variants on it provide lower-dimensional document representations used in visualizations and in computing similarity between documents. This article contributes a method for validating such algorithms against human perceptions of similarity, especially applicable to contexts in which the algorithm is intended to support navigability between similar documents via dynamically generated hyperlinks. Such validation enables researchers to ground their methods in context of intended use instead of relying on assumptions of fit. In addition to the methodology, this article presents the results of an evaluation using a corpus of short documents and the LDA algorithm. We also present some analysis of potential causes of differences between cases in which this model matches human perceptions of similarity more or less well.

References

[1]

David Andrzejewski and Xiaojin Zhu. 2009. Latent Dirichlet allocation with topic-in-set knowledge. In Proceedings of the NAACL HLT 2009 Workshop on Semi-Supervised Learning for Natural Language Processing (SemiSupLearn’09). Stroudsburg, PA: Association for Computational Linguistics, 43--48.

Digital Library

[2]

David Andrzejewski, Xiaojin Zhu, Mark Craven, and Benjamin Recht. 2011. A framework for incorporating general domain knowledge into latent Dirichlet allocation using first-order logic. In Proceedings of the 22nd International Joint Conference on Artificial Intelligence - Volume Two (IJCAI’11). Barcelona, Catalonia, Spain: AAAI Press, 1171--1177.

Digital Library

[3]

Brian P. Bailey and Eric Horvitz. 2010. What's your idea?: A case study of a grassroots innovation pipeline within a large software company. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI’10). New York, NY: ACM, 2065--2074.

Digital Library

[4]

Chris Bank and Jerry Cao. 2015. The Guide to Usability Testing, Mountain View, CA: UXPin.

[5]

Osvald M. Bjelland and Robert Chapman Wood. 2008. An Inside View of IBM's “Innovation Jam.” Retrieved June 29, 2016 from http://sloanreview.mit.edu/article/an-inside-view-of-ibms-innovation-jam/.

[6]

David M. Blei, Andrew Y. Ng, and Michael I. Jordan. 2003. Latent Dirichlet allocation. Journal of Machine Learning Research 3, 993--1022.

Digital Library

[7]

Kevin W. Boyack et al. 2011. Clustering more than two million biomedical publications: Comparing the accuracies of nine text-based similarity approaches. PLoSONE 6, 3, e18029.

[8]

K. Selçuk Candan, Luigi Di Caro, and Maria Luisa Sapino. 2012. PhC: Multiresolution visualization and exploration of text corpora with parallel hierarchical coordinates. ACM Transactions on Intelligent Systems Technology 3, 2, 22:1--22:36.

Digital Library

[9]

Jonathan Chang and David Blei. 2009. Relational topic models for document networks. In AISTATS. Clearwater Beach, FL, 81--88.

[10]

Jonathan Chang, Jordan Boyd-Graber, Sean Gerrish, Chong Wang, and David M. Blei. 2009. Reading tea leaves: How humans interpret topic models. In Proceedings of NIPS. Vancouver, BC, Canada.

Digital Library

[11]

Jason Chuang, Daniel Ramage, Christopher Manning, and Jeffrey Heer. 2012. Interpretation and trust: designing model-driven visualizations for text analysis. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI’12). New York, NY: ACM, 443--452.

Digital Library

[12]

Jacob Cohen. 1992. A power primer. Psychological Bulletin 112, 1, 155--159.

[13]

Jacob Cohen. 1988. Statistical Power Analysis for the Behavioral Sciences, New York, NY: L. Erlbaum Associates.

[14]

Gregorio Convertino. 2013. Large-Scale Idea Management and Deliberation Systems Workshop. Retrieved June 29, 2016 from http://comtech13.xrce.xerox.com/comtech13.html.

[15]

Weiwei Cui, Huamin Qu, Hong Zhou, Wenbin Zhang, and Steve Skiena. 2012. Watch the story unfold with textwheel: Visualization of large-scale news streams. ACM Transactions on Intelligent Systems Technology 3, 2, 20, 1--20:17.

Digital Library

[16]

Todd Davies. 2011. Online Deliberation Conferences. (March 2011). Retrieved May 13, 2011 from http://online-deliberation.net/.

[17]

Karthik Dinakar et al. 2012. You too?&excl; mixed-initiative lda story matching to help teens in distress. In Proceedings of the 6th International AAAI Conference on Weblogs and Social Media. Dublin, Ireland: AAAI, 74--81.

[18]

Ran El-Yaniv, Shai Fine, and Naftali Tishby. 1997. Agnostic classification of Markovian sequences. In Advances in Neural Information Processing Systems 10, M. I. Jordan, M. J. Kearns, and S. A. Solla (Eds.). Cambridge, MA: MIT Press, 465--471.

Digital Library

[19]

D. M. Endres and J. E. Schindelin. 2003. A new metric for probability distributions. IEEE Transactions on Information Theory 49, 7, 1858--1860.

Digital Library

[20]

Katayoun Farrahi and Daniel Gatica-Perez. 2011. Discovering routines from large-scale human locations using probabilistic topic models. ACM Transactions on Intelligent Systems Technology 2, 1, 3, 1--3, 27.

Digital Library

[21]

Manaal Faruqui, Jesse Dodge, Sujay K. Jauhar, Chris Dyer, Eduard Hovy, and Noah A. Smith. 2015. Retrofitting word vectors to semantic lexicons. In Proceedings of NAACL.

[22]

Franz Faul, Edgar Erdfelder, Axel Buchner, and Albert-Georg Lang. 2009. Statistical power analyses using g^*power 3.1: Tests for correlation and regression analyses. Behavior Research Methods 41, 4, 1149--1160.

[23]

M. Flynn, L. Dooley, D. O’Sullivan, and K. Cormican. 2003. Idea management for organisational innovation. International Journal of Innovation Management 07, 04, 417--442.

[24]

Sean Goggins, Christopher Mascaro, and Stephanie Mascaro. 2012. Relief work after the 2010 Haiti earthquake: Leadership in an online resource coordination network. In CSCW’12. New York, NY: ACM, 57--66.

Digital Library

[25]

Brynjar Gretarsson et al. 2012. TopicNets: Visual analysis of large text corpora with topic modeling. ACM Transactions on Intelligent Systems Technology 3, 2, 23, 1--23, 26.

Digital Library

[26]

Taher H. Haveliwala, Aristides Gionis, Dan Klein, and Piotr Indyk. 2002. Evaluating strategies for similarity search on the web. In Proceedings of the 11th International Conference on World Wide Web (WWW’02). New York, NY: ACM, 432--442.

Digital Library

[27]

Yifen Huang and Tom Mitchell. 2007. A framework for mixed-initiative clustering. In North East Student Colloquium on Artificial Intelligence (NESCAI’07). Ithaca, NY.

[28]

Kevin O. Hwang et al. 2010. Social support in an Internet weight loss community. International Journal of Medical Informatics 79, 1, 5--13.

[29]

IBM. 2012. A Global Innovation Jam. Retrieved Jun 29, 2016 from http://www-03.ibm.com/ibm/history/ibm100/us/en/icons/innovationjam/.

[30]

IdeaScale. 2012. Save Award 2012. Retrieved June 29, 2016 from http://saveaward2012.ideascale.com/.

[31]

IdeaScale. 2013. The Truth About IdeaScale. Retrieved June 29, 2016 from http://dev.ideascale.com/infocomics/.

[32]

Joshua E. Introne and Marcus Drescher. 2013. Analyzing the flow of knowledge in computer mediated teams. In Proceedings of the 2013 Conference on Computer Supported Cooperative Work (CSCW’13). New York, NY: ACM, 341--356.

Digital Library

[33]

Steven Johnson. 2010. Where Good Ideas Come From. New York, NY: Penguin.

[34]

Kenneth Joseph, Kathleen M. Carley, and Jason I. Hong. 2014. Check-ins in “blau space”: applying Blau's macrosociological theory to foursquare check-ins from New York City. ACM Transactions on Intelligent Systems Technology 5, 3, 46, 1--46, 22.

Digital Library

[35]

Aniket Kittur, E. D. H. Chi, and Bongwon Suh. 2008. Crowdsourcing user studies with mechanical turk. In Proceedings of the 26th Annual SIGCHI Conference on Human Factors in Computing Systems (CHI’08). New York, NY ACM, 453--456.

Digital Library

[36]

Michael David Lee, B. M. Pincombe, and Matthew Brian Welsh. 2005. An empirical evaluation of models of text document similarity. In Stresa, Italy: Cognitive Science Society.

[37]

Jianhua Lin. 1991. Divergence measures based on the Shannon entropy. IEEE Transactions on Information Theory 37, 1, 145--151.

Digital Library

[38]

Shixia Liu et al. 2012. TIARA: Interactive, topic-based visual text summarization and analysis. ACM Transactions on Intelligent Systems Technology 3, 2, 25, 1--25, 28.

Digital Library

[39]

Zhiyuan Liu, Yuzhou Zhang, Edward Y. Chang, and Maosong Sun. 2011. PLDA+: Parallel latent Dirichlet allocation with data placement and pipeline processing. ACM Transactions on Intelligent Systems Technology 2, 3, 26, 1--26, 18.

Digital Library

[40]

Inderjeet Mani. 2001. Evaluation. In Automatic Summarization. Natural Language Processing. Philadelphia: John Benjamins Publishing, 221--259.

[41]

Michael Muller and Sacha Chua. 2012. Brainstorming for Japan: rapid distributed global collaboration for disaster response. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI’12). New York, NY: ACM, 2727--2730.

Digital Library

[42]

Un Yong Nahm. 2004. Text Mining with Information Extraction. Austin, TX: University of Texas at Austin.

[43]

David Newman, Jey Han Lau, Karl Grieser, and Timothy Baldwin. 2010. Automatic evaluation of topic coherence. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics (HLT’10). Stroudsburg, PA: Association for Computational Linguistics, 100--108.

Digital Library

[44]

Rae R. Newton and Kjell Erik Rudestam. 1999. Your Statistical Consultant: Answers to Your Data Analysis Questions, Thousand Oaks, CA: Sage.

[45]

Leysia Palen et al. 2010. A vision for technology-mediated support for public participation & assistance in mass emergencies & disasters. In Proceedings of the 2010 ACM-BCS Visions of Computer Science Conference (ACM-BCS’10). Swinton, UK: British Computer Society, 8, 1--8, 12.

Digital Library

[46]

Brandon Pincombe. 2004. Comparison of Human and Latent Semantic Analysis (LSA) Judgements of Pairwise Document Similarities for a News Corpus, Edinburgh, South Australia: Australian Government Department of Defence: Defence Science and Technology Organisation.

[47]

Gregory J. Privitera. 2015. Student Study Guide with IBM® SPSS® Workbook for Essential Statistics for the Behavioral Sciences. Thousand Oaks, CA: SAGE Publications.

[48]

M. J. Salganik, P. S. Dodds, and D. J. Watts. 2006. Experimental study of inequality and unpredictability in an artificial cultural market. Science 311, 5762, 854.

[49]

Amit Singh, Deepak P, and Dinesh Raghu. 2012. Retrieving similar discussion forum threads: A structure based approach. In Proceedings of the 35th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’12). New York, NY: ACM, 135--144.

Digital Library

[50]

Sergej Sizov. 2012. Latent geospatial semantics of social media. ACM Transactions on Intelligent Systems Technology 3, 4, 64, 1--64, 20.

Digital Library

[51]

Paul E. Spector. 1992. Summated Rating Scale Construction: An Introduction. Thousand Oaks, CA: SAGE.

[52]

Ellen Spertus, Mehran Sahami, and Orkut Buyukkokten. 2005. Evaluating similarity measures: A large-scale study in the Orkut social network. In Proceedings of the 11th ACM SIGKDD International Conference on Knowledge Discovery in Data Mining (KDD’05). New York, NY: ACM, 678--684.

Digital Library

[53]

Mark Steyvers and Tom Griffiths. 2007. Probabilistic topic models. In Handbook of Latent Semantic Analysis, Thomas K. Landauer, Danielle S. McNamara, Simon Dennis, and Walter Kintsch (Eds.). University of Colorado Institute of Cognitive Science Series. Mahwah, NJ: Lawrence Erlbaum Associates, 427--448.

[54]

Cass R. Sunstein. 2006. Infotopia: How Many Minds Produce Knowledge. New York; Oxford: Oxford University Press.

[55]

W. Ben Towne and James D. Herbsleb. 2012. Design considerations for online deliberation systems. Journal of Information Technology & Politics 9, 1, 97--115.

[56]

Jason Tsay, Laura Dabbish, and James Herbsleb. 2014. Let's talk about it: evaluating contributions through discussion in github. In Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering (FSE’14). New York, NY: ACM, 144--154.

Digital Library

[57]

Fernanda B. Viégas, Martin Wattenberg, Jesse Kriss, and Frank van Ham. 2007. Talk before you type: Coordination in Wikipedia. In Proceedings of HICSS’07. 78.

Digital Library

[58]

Ellen M. Voorhees. 2007. TREC: Continuing information retrieval's tradition of experimentation. Communications of the ACM 50, 11, 51--54.

Digital Library

[59]

Thomas P. Walter and Andrea Back. 2013. A text mining approach to evaluate submissions to crowdsourcing contests. In 46th Hawaii International Conference on System Sciences (HICSS’13). 3109--3118.

Digital Library

[60]

Gu Xu and Wei-Ying Ma. 2006. Building implicit links from content for forum search. In Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’06). New York, NY: ACM, 300--307.

Digital Library

[61]

Zhijun Yin, Liangliang Cao, Quanquan Gu, and Jiawei Han. 2012. Latent community topic analysis: Integration of community discovery with topic modeling. ACM Transactions on Intelligent Systems Technology 3, 4, 63, 1--63, 21.

Digital Library

[62]

Zhongwu Zhai, Bing Liu, Hua Xu, and Peifa Jia. 2011. Constrained LDA for grouping product features in opinion mining. In Advances in Knowledge Discovery and Data Mining, Joshua Zhexue Huang, Longbing Cao, and Jaideep Srivastava (Eds.). Lecture Notes in Computer Science. Springer, Berlin, 448--459.

Digital Library

[63]

Haizheng Zhang, Baojun Qiu, C. L. Giles, Henry C. Foley, and J. Yen. 2007. An LDA-based community structure discovery approach for large-scale social networks. In IEEE Intelligence and Security Informatics. 200--207.

[64]

Shiwan Zhao, Michelle X. Zhou, Xiatian Zhang, Quan Yuan, Wentao Zheng, and Rongyao Fu. 2011. Who is doing what and when: social map-based recommendation for content-centric social web sites. ACM Transactions on Intelligent Systems Technology 3, 1, 5, 1--5, 23.

Digital Library

Cited By

Leka S(2024)The Role of Artificial Intelligence in Idea Management Systems and Innovation Processes: An Integrative ReviewProceedings of the Cognitive Models and Artificial Intelligence Conference10.1145/3660853.3660890(160-164)Online publication date: 25-May-2024
https://dl.acm.org/doi/10.1145/3660853.3660890
Cabitza FFamiglini LCampagner ASconfienza LFusco SCaccavella VGallazzi E(2024)Dissimilar Similarities: Comparing Human and Statistical Similarity Evaluation in Medical AIModeling Decisions for Artificial Intelligence10.1007/978-3-031-68208-7_16(187-198)Online publication date: 27-Aug-2024
https://dl.acm.org/doi/10.1007/978-3-031-68208-7_16
Bender MBraun TMöller RGehrke M(2023)Unsupervised Estimation of Subjective Content Descriptions2023 IEEE 17th International Conference on Semantic Computing (ICSC)10.1109/ICSC56153.2023.00052(266-273)Online publication date: Feb-2023
https://doi.org/10.1109/ICSC56153.2023.00052
Show More Cited By

Index Terms

Recommendations

Learning similarity measures from data
Abstract
Defining similarity measures is a requirement for some machine learning methods. One such method is case-based reasoning (CBR) where the similarity measure is used to retrieve the stored case or a set of cases most similar to the query case. ...
Measuring Similarity among Legal Court Case Documents
Compute '17: Proceedings of the 10th Annual ACM India Compute Conference

Computing the similarity between two legal documents is an important challenge in the Legal Information Retrieval domain. Efficient calculation of this similarity has useful applications in various tasks such as identifying relevant prior cases for a ...
Beyond topical similarity: a structural similarity measure for retrieving highly similar documents

Accurately measuring document similarity is important for many text applications, e.g. document similarity search, document recommendation, etc. Most traditional similarity measures are based only on "bag of words" of documents and can well evaluate ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Intelligent Systems and Technology

ACM Transactions on Intelligent Systems and Technology Volume 8, Issue 1

January 2017

363 pages

ISSN:2157-6904

EISSN:2157-6912

DOI:10.1145/2973184

Editor:
Yu Zheng
Microsoft Research, China

Issue’s Table of Contents

Copyright © 2016 Owner/Author.

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 26 September 2016

Accepted: 01 February 2016

Received: 01 December 2015

Published in TIST Volume 8, Issue 1

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

National Science Foundation's

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

22
Total Citations
View Citations
1,346
Total Downloads

Downloads (Last 12 months)220
Downloads (Last 6 weeks)35

Reflects downloads up to 08 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Leka S(2024)The Role of Artificial Intelligence in Idea Management Systems and Innovation Processes: An Integrative ReviewProceedings of the Cognitive Models and Artificial Intelligence Conference10.1145/3660853.3660890(160-164)Online publication date: 25-May-2024
https://dl.acm.org/doi/10.1145/3660853.3660890
Cabitza FFamiglini LCampagner ASconfienza LFusco SCaccavella VGallazzi E(2024)Dissimilar Similarities: Comparing Human and Statistical Similarity Evaluation in Medical AIModeling Decisions for Artificial Intelligence10.1007/978-3-031-68208-7_16(187-198)Online publication date: 27-Aug-2024
https://dl.acm.org/doi/10.1007/978-3-031-68208-7_16
Bender MBraun TMöller RGehrke M(2023)Unsupervised Estimation of Subjective Content Descriptions2023 IEEE 17th International Conference on Semantic Computing (ICSC)10.1109/ICSC56153.2023.00052(266-273)Online publication date: Feb-2023
https://doi.org/10.1109/ICSC56153.2023.00052
Hayashi Y(2023)Modeling Synchronization for Detecting Collaborative Learning Process Using a Pedagogical Conversational Agent: Investigation Using Recurrent Indicators of Gaze, Language, and Facial ExpressionInternational Journal of Artificial Intelligence in Education10.1007/s40593-023-00381-y34:3(1206-1247)Online publication date: 7-Dec-2023
https://doi.org/10.1007/s40593-023-00381-y
Sagadevan SMalim NHusin M(2022)A Seed-Guided Latent Dirichlet Allocation Approach to Predict the Personality of Online Users Using the PEN ModelAlgorithms10.3390/a1503008715:3(87)Online publication date: 8-Mar-2022
https://doi.org/10.3390/a15030087
Nandy AGoucher-Lambert K(2022)Do Human and Computational Evaluations of Similarity Align? An Empirical Study of Product FunctionJournal of Mechanical Design10.1115/1.4053858144:4Online publication date: 1-Mar-2022
https://doi.org/10.1115/1.4053858
Potts CSavaliya AJhala A(2022)Leveraging Multiple Representations of Topic Models for Knowledge DiscoveryIEEE Access10.1109/ACCESS.2022.321052910(104696-104705)Online publication date: 2022
https://doi.org/10.1109/ACCESS.2022.3210529
Ma XMeng XMa H(2021)Interactive Evolution of Multidimensional Information in Social Media for Public Emergency: A Perspective from Optics ScatteringData and Information Management10.2478/dim-2021-00085:4(389-411)Online publication date: Oct-2021
https://doi.org/10.2478/dim-2021-0008
Oppermann MKincaid RMunzner T(2021)VizCommender: Computing Text-Based Similarity in Visualization Repositories for Content-Based RecommendationsIEEE Transactions on Visualization and Computer Graphics10.1109/TVCG.2020.303038727:2(495-505)Online publication date: Feb-2021
https://doi.org/10.1109/TVCG.2020.3030387
Bodendorf FWytopil BFranke J(2021)Business Analytics in Strategic Purchasing: Identifying and Evaluating Similarities in Supplier DocumentsApplied Artificial Intelligence10.1080/08839514.2021.193642335:12(857-875)Online publication date: 19-Jul-2021
https://doi.org/10.1080/08839514.2021.1936423
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Figures

Tables

Media

View Issue’s Table of Contents