Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article
Open access

Measuring Similarity Similarly: LDA and Human Perception

Published: 26 September 2016 Publication History

Abstract

Several intelligent technologies designed to improve navigability in and digestibility of text corpora use topic modeling such as the state-of-the-art Latent Dirichlet Allocation (LDA). This model and variants on it provide lower-dimensional document representations used in visualizations and in computing similarity between documents. This article contributes a method for validating such algorithms against human perceptions of similarity, especially applicable to contexts in which the algorithm is intended to support navigability between similar documents via dynamically generated hyperlinks. Such validation enables researchers to ground their methods in context of intended use instead of relying on assumptions of fit. In addition to the methodology, this article presents the results of an evaluation using a corpus of short documents and the LDA algorithm. We also present some analysis of potential causes of differences between cases in which this model matches human perceptions of similarity more or less well.

References

[1]
David Andrzejewski and Xiaojin Zhu. 2009. Latent Dirichlet allocation with topic-in-set knowledge. In Proceedings of the NAACL HLT 2009 Workshop on Semi-Supervised Learning for Natural Language Processing (SemiSupLearn’09). Stroudsburg, PA: Association for Computational Linguistics, 43--48.
[2]
David Andrzejewski, Xiaojin Zhu, Mark Craven, and Benjamin Recht. 2011. A framework for incorporating general domain knowledge into latent Dirichlet allocation using first-order logic. In Proceedings of the 22nd International Joint Conference on Artificial Intelligence - Volume Two (IJCAI’11). Barcelona, Catalonia, Spain: AAAI Press, 1171--1177.
[3]
Brian P. Bailey and Eric Horvitz. 2010. What's your idea?: A case study of a grassroots innovation pipeline within a large software company. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI’10). New York, NY: ACM, 2065--2074.
[4]
Chris Bank and Jerry Cao. 2015. The Guide to Usability Testing, Mountain View, CA: UXPin.
[5]
Osvald M. Bjelland and Robert Chapman Wood. 2008. An Inside View of IBM's “Innovation Jam.” Retrieved June 29, 2016 from http://sloanreview.mit.edu/article/an-inside-view-of-ibms-innovation-jam/.
[6]
David M. Blei, Andrew Y. Ng, and Michael I. Jordan. 2003. Latent Dirichlet allocation. Journal of Machine Learning Research 3, 993--1022.
[7]
Kevin W. Boyack et al. 2011. Clustering more than two million biomedical publications: Comparing the accuracies of nine text-based similarity approaches. PLoSONE 6, 3, e18029.
[8]
K. Selçuk Candan, Luigi Di Caro, and Maria Luisa Sapino. 2012. PhC: Multiresolution visualization and exploration of text corpora with parallel hierarchical coordinates. ACM Transactions on Intelligent Systems Technology 3, 2, 22:1--22:36.
[9]
Jonathan Chang and David Blei. 2009. Relational topic models for document networks. In AISTATS. Clearwater Beach, FL, 81--88.
[10]
Jonathan Chang, Jordan Boyd-Graber, Sean Gerrish, Chong Wang, and David M. Blei. 2009. Reading tea leaves: How humans interpret topic models. In Proceedings of NIPS. Vancouver, BC, Canada.
[11]
Jason Chuang, Daniel Ramage, Christopher Manning, and Jeffrey Heer. 2012. Interpretation and trust: designing model-driven visualizations for text analysis. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI’12). New York, NY: ACM, 443--452.
[12]
Jacob Cohen. 1992. A power primer. Psychological Bulletin 112, 1, 155--159.
[13]
Jacob Cohen. 1988. Statistical Power Analysis for the Behavioral Sciences, New York, NY: L. Erlbaum Associates.
[14]
Gregorio Convertino. 2013. Large-Scale Idea Management and Deliberation Systems Workshop. Retrieved June 29, 2016 from http://comtech13.xrce.xerox.com/comtech13.html.
[15]
Weiwei Cui, Huamin Qu, Hong Zhou, Wenbin Zhang, and Steve Skiena. 2012. Watch the story unfold with textwheel: Visualization of large-scale news streams. ACM Transactions on Intelligent Systems Technology 3, 2, 20, 1--20:17.
[16]
Todd Davies. 2011. Online Deliberation Conferences. (March 2011). Retrieved May 13, 2011 from http://online-deliberation.net/.
[17]
Karthik Dinakar et al. 2012. You too?! mixed-initiative lda story matching to help teens in distress. In Proceedings of the 6th International AAAI Conference on Weblogs and Social Media. Dublin, Ireland: AAAI, 74--81.
[18]
Ran El-Yaniv, Shai Fine, and Naftali Tishby. 1997. Agnostic classification of Markovian sequences. In Advances in Neural Information Processing Systems 10, M. I. Jordan, M. J. Kearns, and S. A. Solla (Eds.). Cambridge, MA: MIT Press, 465--471.
[19]
D. M. Endres and J. E. Schindelin. 2003. A new metric for probability distributions. IEEE Transactions on Information Theory 49, 7, 1858--1860.
[20]
Katayoun Farrahi and Daniel Gatica-Perez. 2011. Discovering routines from large-scale human locations using probabilistic topic models. ACM Transactions on Intelligent Systems Technology 2, 1, 3, 1--3, 27.
[21]
Manaal Faruqui, Jesse Dodge, Sujay K. Jauhar, Chris Dyer, Eduard Hovy, and Noah A. Smith. 2015. Retrofitting word vectors to semantic lexicons. In Proceedings of NAACL.
[22]
Franz Faul, Edgar Erdfelder, Axel Buchner, and Albert-Georg Lang. 2009. Statistical power analyses using g*power 3.1: Tests for correlation and regression analyses. Behavior Research Methods 41, 4, 1149--1160.
[23]
M. Flynn, L. Dooley, D. O’Sullivan, and K. Cormican. 2003. Idea management for organisational innovation. International Journal of Innovation Management 07, 04, 417--442.
[24]
Sean Goggins, Christopher Mascaro, and Stephanie Mascaro. 2012. Relief work after the 2010 Haiti earthquake: Leadership in an online resource coordination network. In CSCW’12. New York, NY: ACM, 57--66.
[25]
Brynjar Gretarsson et al. 2012. TopicNets: Visual analysis of large text corpora with topic modeling. ACM Transactions on Intelligent Systems Technology 3, 2, 23, 1--23, 26.
[26]
Taher H. Haveliwala, Aristides Gionis, Dan Klein, and Piotr Indyk. 2002. Evaluating strategies for similarity search on the web. In Proceedings of the 11th International Conference on World Wide Web (WWW’02). New York, NY: ACM, 432--442.
[27]
Yifen Huang and Tom Mitchell. 2007. A framework for mixed-initiative clustering. In North East Student Colloquium on Artificial Intelligence (NESCAI’07). Ithaca, NY.
[28]
Kevin O. Hwang et al. 2010. Social support in an Internet weight loss community. International Journal of Medical Informatics 79, 1, 5--13.
[29]
IBM. 2012. A Global Innovation Jam. Retrieved Jun 29, 2016 from http://www-03.ibm.com/ibm/history/ibm100/us/en/icons/innovationjam/.
[30]
IdeaScale. 2012. Save Award 2012. Retrieved June 29, 2016 from http://saveaward2012.ideascale.com/.
[31]
IdeaScale. 2013. The Truth About IdeaScale. Retrieved June 29, 2016 from http://dev.ideascale.com/infocomics/.
[32]
Joshua E. Introne and Marcus Drescher. 2013. Analyzing the flow of knowledge in computer mediated teams. In Proceedings of the 2013 Conference on Computer Supported Cooperative Work (CSCW’13). New York, NY: ACM, 341--356.
[33]
Steven Johnson. 2010. Where Good Ideas Come From. New York, NY: Penguin.
[34]
Kenneth Joseph, Kathleen M. Carley, and Jason I. Hong. 2014. Check-ins in “blau space”: applying Blau's macrosociological theory to foursquare check-ins from New York City. ACM Transactions on Intelligent Systems Technology 5, 3, 46, 1--46, 22.
[35]
Aniket Kittur, E. D. H. Chi, and Bongwon Suh. 2008. Crowdsourcing user studies with mechanical turk. In Proceedings of the 26th Annual SIGCHI Conference on Human Factors in Computing Systems (CHI’08). New York, NY ACM, 453--456.
[36]
Michael David Lee, B. M. Pincombe, and Matthew Brian Welsh. 2005. An empirical evaluation of models of text document similarity. In Stresa, Italy: Cognitive Science Society.
[37]
Jianhua Lin. 1991. Divergence measures based on the Shannon entropy. IEEE Transactions on Information Theory 37, 1, 145--151.
[38]
Shixia Liu et al. 2012. TIARA: Interactive, topic-based visual text summarization and analysis. ACM Transactions on Intelligent Systems Technology 3, 2, 25, 1--25, 28.
[39]
Zhiyuan Liu, Yuzhou Zhang, Edward Y. Chang, and Maosong Sun. 2011. PLDA+: Parallel latent Dirichlet allocation with data placement and pipeline processing. ACM Transactions on Intelligent Systems Technology 2, 3, 26, 1--26, 18.
[40]
Inderjeet Mani. 2001. Evaluation. In Automatic Summarization. Natural Language Processing. Philadelphia: John Benjamins Publishing, 221--259.
[41]
Michael Muller and Sacha Chua. 2012. Brainstorming for Japan: rapid distributed global collaboration for disaster response. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI’12). New York, NY: ACM, 2727--2730.
[42]
Un Yong Nahm. 2004. Text Mining with Information Extraction. Austin, TX: University of Texas at Austin.
[43]
David Newman, Jey Han Lau, Karl Grieser, and Timothy Baldwin. 2010. Automatic evaluation of topic coherence. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics (HLT’10). Stroudsburg, PA: Association for Computational Linguistics, 100--108.
[44]
Rae R. Newton and Kjell Erik Rudestam. 1999. Your Statistical Consultant: Answers to Your Data Analysis Questions, Thousand Oaks, CA: Sage.
[45]
Leysia Palen et al. 2010. A vision for technology-mediated support for public participation & assistance in mass emergencies & disasters. In Proceedings of the 2010 ACM-BCS Visions of Computer Science Conference (ACM-BCS’10). Swinton, UK: British Computer Society, 8, 1--8, 12.
[46]
Brandon Pincombe. 2004. Comparison of Human and Latent Semantic Analysis (LSA) Judgements of Pairwise Document Similarities for a News Corpus, Edinburgh, South Australia: Australian Government Department of Defence: Defence Science and Technology Organisation.
[47]
Gregory J. Privitera. 2015. Student Study Guide with IBM® SPSS® Workbook for Essential Statistics for the Behavioral Sciences. Thousand Oaks, CA: SAGE Publications.
[48]
M. J. Salganik, P. S. Dodds, and D. J. Watts. 2006. Experimental study of inequality and unpredictability in an artificial cultural market. Science 311, 5762, 854.
[49]
Amit Singh, Deepak P, and Dinesh Raghu. 2012. Retrieving similar discussion forum threads: A structure based approach. In Proceedings of the 35th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’12). New York, NY: ACM, 135--144.
[50]
Sergej Sizov. 2012. Latent geospatial semantics of social media. ACM Transactions on Intelligent Systems Technology 3, 4, 64, 1--64, 20.
[51]
Paul E. Spector. 1992. Summated Rating Scale Construction: An Introduction. Thousand Oaks, CA: SAGE.
[52]
Ellen Spertus, Mehran Sahami, and Orkut Buyukkokten. 2005. Evaluating similarity measures: A large-scale study in the Orkut social network. In Proceedings of the 11th ACM SIGKDD International Conference on Knowledge Discovery in Data Mining (KDD’05). New York, NY: ACM, 678--684.
[53]
Mark Steyvers and Tom Griffiths. 2007. Probabilistic topic models. In Handbook of Latent Semantic Analysis, Thomas K. Landauer, Danielle S. McNamara, Simon Dennis, and Walter Kintsch (Eds.). University of Colorado Institute of Cognitive Science Series. Mahwah, NJ: Lawrence Erlbaum Associates, 427--448.
[54]
Cass R. Sunstein. 2006. Infotopia: How Many Minds Produce Knowledge. New York; Oxford: Oxford University Press.
[55]
W. Ben Towne and James D. Herbsleb. 2012. Design considerations for online deliberation systems. Journal of Information Technology & Politics 9, 1, 97--115.
[56]
Jason Tsay, Laura Dabbish, and James Herbsleb. 2014. Let's talk about it: evaluating contributions through discussion in github. In Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering (FSE’14). New York, NY: ACM, 144--154.
[57]
Fernanda B. Viégas, Martin Wattenberg, Jesse Kriss, and Frank van Ham. 2007. Talk before you type: Coordination in Wikipedia. In Proceedings of HICSS’07. 78.
[58]
Ellen M. Voorhees. 2007. TREC: Continuing information retrieval's tradition of experimentation. Communications of the ACM 50, 11, 51--54.
[59]
Thomas P. Walter and Andrea Back. 2013. A text mining approach to evaluate submissions to crowdsourcing contests. In 46th Hawaii International Conference on System Sciences (HICSS’13). 3109--3118.
[60]
Gu Xu and Wei-Ying Ma. 2006. Building implicit links from content for forum search. In Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’06). New York, NY: ACM, 300--307.
[61]
Zhijun Yin, Liangliang Cao, Quanquan Gu, and Jiawei Han. 2012. Latent community topic analysis: Integration of community discovery with topic modeling. ACM Transactions on Intelligent Systems Technology 3, 4, 63, 1--63, 21.
[62]
Zhongwu Zhai, Bing Liu, Hua Xu, and Peifa Jia. 2011. Constrained LDA for grouping product features in opinion mining. In Advances in Knowledge Discovery and Data Mining, Joshua Zhexue Huang, Longbing Cao, and Jaideep Srivastava (Eds.). Lecture Notes in Computer Science. Springer, Berlin, 448--459.
[63]
Haizheng Zhang, Baojun Qiu, C. L. Giles, Henry C. Foley, and J. Yen. 2007. An LDA-based community structure discovery approach for large-scale social networks. In IEEE Intelligence and Security Informatics. 200--207.
[64]
Shiwan Zhao, Michelle X. Zhou, Xiatian Zhang, Quan Yuan, Wentao Zheng, and Rongyao Fu. 2011. Who is doing what and when: social map-based recommendation for content-centric social web sites. ACM Transactions on Intelligent Systems Technology 3, 1, 5, 1--5, 23.

Cited By

View all
  • (2024)The Role of Artificial Intelligence in Idea Management Systems and Innovation Processes: An Integrative ReviewProceedings of the Cognitive Models and Artificial Intelligence Conference10.1145/3660853.3660890(160-164)Online publication date: 25-May-2024
  • (2024)Dissimilar Similarities: Comparing Human and Statistical Similarity Evaluation in Medical AIModeling Decisions for Artificial Intelligence10.1007/978-3-031-68208-7_16(187-198)Online publication date: 27-Aug-2024
  • (2023)Unsupervised Estimation of Subjective Content Descriptions2023 IEEE 17th International Conference on Semantic Computing (ICSC)10.1109/ICSC56153.2023.00052(266-273)Online publication date: Feb-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Intelligent Systems and Technology
ACM Transactions on Intelligent Systems and Technology  Volume 8, Issue 1
January 2017
363 pages
ISSN:2157-6904
EISSN:2157-6912
DOI:10.1145/2973184
  • Editor:
  • Yu Zheng
Issue’s Table of Contents
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 26 September 2016
Accepted: 01 February 2016
Received: 01 December 2015
Published in TIST Volume 8, Issue 1

Check for updates

Author Tags

  1. Perceived similarity
  2. algorithm validation
  3. similarity metrics

Qualifiers

  • Research-article
  • Research
  • Refereed

Funding Sources

  • National Science Foundation's

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)220
  • Downloads (Last 6 weeks)35
Reflects downloads up to 08 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)The Role of Artificial Intelligence in Idea Management Systems and Innovation Processes: An Integrative ReviewProceedings of the Cognitive Models and Artificial Intelligence Conference10.1145/3660853.3660890(160-164)Online publication date: 25-May-2024
  • (2024)Dissimilar Similarities: Comparing Human and Statistical Similarity Evaluation in Medical AIModeling Decisions for Artificial Intelligence10.1007/978-3-031-68208-7_16(187-198)Online publication date: 27-Aug-2024
  • (2023)Unsupervised Estimation of Subjective Content Descriptions2023 IEEE 17th International Conference on Semantic Computing (ICSC)10.1109/ICSC56153.2023.00052(266-273)Online publication date: Feb-2023
  • (2023)Modeling Synchronization for Detecting Collaborative Learning Process Using a Pedagogical Conversational Agent: Investigation Using Recurrent Indicators of Gaze, Language, and Facial ExpressionInternational Journal of Artificial Intelligence in Education10.1007/s40593-023-00381-y34:3(1206-1247)Online publication date: 7-Dec-2023
  • (2022)A Seed-Guided Latent Dirichlet Allocation Approach to Predict the Personality of Online Users Using the PEN ModelAlgorithms10.3390/a1503008715:3(87)Online publication date: 8-Mar-2022
  • (2022)Do Human and Computational Evaluations of Similarity Align? An Empirical Study of Product FunctionJournal of Mechanical Design10.1115/1.4053858144:4Online publication date: 1-Mar-2022
  • (2022)Leveraging Multiple Representations of Topic Models for Knowledge DiscoveryIEEE Access10.1109/ACCESS.2022.321052910(104696-104705)Online publication date: 2022
  • (2021)Interactive Evolution of Multidimensional Information in Social Media for Public Emergency: A Perspective from Optics ScatteringData and Information Management10.2478/dim-2021-00085:4(389-411)Online publication date: Oct-2021
  • (2021)VizCommender: Computing Text-Based Similarity in Visualization Repositories for Content-Based RecommendationsIEEE Transactions on Visualization and Computer Graphics10.1109/TVCG.2020.303038727:2(495-505)Online publication date: Feb-2021
  • (2021)Business Analytics in Strategic Purchasing: Identifying and Evaluating Similarities in Supplier DocumentsApplied Artificial Intelligence10.1080/08839514.2021.193642335:12(857-875)Online publication date: 19-Jul-2021
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Full Access

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media