Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
article

What are developers talking about? An analysis of topics and trends in Stack Overflow

Published: 01 June 2014 Publication History

Abstract

Programming question and answer (Q&A) websites, such as Stack Overflow, leverage the knowledge and expertise of users to provide answers to technical questions. Over time, these websites turn into repositories of software engineering knowledge. Such knowledge repositories can be invaluable for gaining insight into the use of specific technologies and the trends of developer discussions. Previous work has focused on analyzing the user activities or the social interactions in Q&A websites. However, analyzing the actual textual content of these websites can help the software engineering community to better understand the thoughts and needs of developers. In the article, we present a methodology to analyze the textual content of Stack Overflow discussions. We use latent Dirichlet allocation (LDA), a statistical topic modeling technique, to automatically discover the main topics present in developer discussions. We analyze these discovered topics, as well as their relationships and trends over time, to gain insights into the development community. Our analysis allows us to make a number of interesting observations, including: the topics of interest to developers range widely from jobs to version control systems to C# syntax; questions in some topics lead to discussions in other topics; and the topics gaining the most popularity over time are web development (especially jQuery), mobile applications (especially Android), Git, and MySQL.

References

[1]
Adamic LA, Zhang J, Bakshy E, Ackerman MS (2008) Knowledge sharing and Yahoo answers: everyone knows something. In: Proceedings of the 17th international conference on World Wide Web, pp. 665-674.
[2]
Apache Subversion (2012) http://subversion.apache.org/. Accessed 29 Sept 2012.
[3]
Bajracharya S, Lopes C. (2012) Analyzing and mining a code search engine usage log. Empir Software Eng 17:424-466.
[4]
Barnard K, Duygulu P, Forsyth D, De Freitas N, Blei DM, Jordan MI (2003) Matching words and pictures. J Mach Learn Res 3:1107-1135.
[5]
Barua A, Thomas SW, Hassan AE (2012) Replication package. http://sailhome.cs.queensu.ca/ replication/stackoverflow. Accessed 29 Sept 2012.
[6]
Becher M, Freiling FC, Hoffmann J, Holz T, Uellenbeck S, Wolf C. (2011) Mobile security catching up? Revealing the nuts and bolts of the security of mobile devices. In: IEEE symposium on security and privacy, pp. 96-111.
[7]
Blei DM, Lafferty J. (2009) Topic models. Text mining: theory and applications. Taylor and Francis, London.
[8]
Blei DM, Ng AY, Jordan MI (2003) Latent Dirichlet allocation. J Mach Learn Res 3:993-1022.
[9]
Cox D, Stuart A. (1955) Some quick sign tests for trend in location and dispersion. Biometrika 42(1-2):80-95.
[10]
Díaz-Herrera JL (2005) Computing & information sciences: the discipline, careers, and future directions. In: ACM southeast regional conference.
[11]
Dugan RF (2004) Performance lies my professor told me: the case for teaching software performance engineering to undergraduates. In: Proceedings of the 4th international workshop on software and performance, pp 37-48.
[12]
Evans Data Corporation (2011) Software development platforms--2011 rankings. http://www. evansdata.com/reports/viewRelease_download.php?reportID=19. Accessed 29 Sept 2012.
[13]
Gamma E, Helm R, Johnson R, Vlissides J. (1995) Design patterns: elements of reusable object-oriented software. Addison-Wesley, Boston.
[14]
Geman S, Geman D. (1984) Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images. IEEE Trans Pattern Anal Mach Intelli PAMI-6(6):721-741.
[15]
Git SCM (2012) http://git-scm.com/. Accessed 29 Sept 2012.
[16]
Google Play (2012) https://play.google.com/about/features. Accessed 29 Sept 2012.
[17]
Grant S, Cordy JR (2010) Estimating the optimal number of latent concepts in source code analysis. In: Proceedings of the 10th international working conference on source code analysis and manipulation, pp. 65-74.
[18]
Griffiths TL, Steyvers M. (2004) Finding scientific topics. Proc Natl Acad Sci 101:5228-5235.
[19]
Griffiths TL, Steyvers M, Tenenbaum JB (2007) Topics in semantic representation. Psychol Rev 114(2):211-244.
[20]
Gyöngyi Z, Koutrika G, Pedersen J, Garcia-Molina H (2008) Questioning Yahoo! Answers. In: Proceedings of the 1st workshop on question answering on the Web.
[21]
Hall D, Jurafsky D, Manning CD (2008) Studying the history of ideas using topic models. In: Proceedings of the conference on empirical methods in natural language processing, pp. 363-371.
[22]
Hassan AE (2008) The road ahead for mining software repositories. In: Frontiers of software maintenance, pp. 48-57.
[23]
Heymann P, Garcia-Molina H. (2006) Collaborative creation of communal hierarchical taxonomies in social tagging systems. Technical Report 2006-10, Stanford InfoLab. http://ilpubs. stanford.edu:8090/775/. Accessed 29 Sept 2012.
[24]
Hindle A, Godfrey MW, Holt RC (2009) What's hot and what's not: windowed developer topic analysis. In: Proceedings of the 25th international conference on software maintenance, pp. 339-348.
[25]
jQuery (2012) http://docs.jquery.com/How_jQuery_Works. Accessed 29 Sept 2012.
[26]
Kuhn A, Ducasse S, Girba T. (2007) Semantic clustering: identifying topics in source code. Inf Softw Technol 49(3):230-243.
[27]
Linstead E, Lopes C, Baldi P. (2008) An application of latent Dirichlet allocation to analyzing software evolution. In: Proceedings of the 7th international conference on machine learning and applications, pp. 813-818.
[28]
Mamykina L, Manoim B, Mittal M, Hripcsak G, Hartmann B. (2011) Design lessons from the fastest Q&A site in the west. In: Proceedings of the conference on human factors in computing systems, pp 2857-2866.
[29]
Manning CD, Raghavan P, Schtze H. (2008) Introduction to information retrieval. Cambridge University Press, New York.
[30]
McCallum A (2002) MALLET: a machine learning for language toolkit. http://mallet.cs.umass.edu. Accessed 29 Sept 2012.
[31]
McGraw G. (2002) Building secure software: better than protecting bad software. IEEE Softw 19(6):57-58.
[32]
McIntosh S, Adams B, Nguyen TH, Kamei Y, Hassan AE (2011) An empirical study of build maintenance effort. In: Proceedings of the 33rd international conference on software engineering, pp 141-150.
[33]
Mei Q, Shen X, Zhai C. (2007) Automatic labeling of multinomial topic models. In: Proceedings of the 13th international conference on knowledge discovery and data mining, pp. 490-499.
[34]
Microsoft Developer Network (2012) http://msdn.microsoft.com/en-us/. Accessed 29 Sept 2012.
[35]
Microsoft SQL Server (2012) http://www.microsoft.com/sqlserver/en/us/default.aspx. Accessed 29 Sept 2012.
[36]
Microsoft Visual Studio (2012) http://msdn.microsoft.com/en-us/vstudio. Accessed 29 Sept 2012.
[37]
MySQL (2012) http://www.mysql.com/. Accessed 29 Sept 2012.
[38]
Neuhaus S, Zimmermann T. (2010) Security trend analysis with CVE topic models. In: Proceedings of the 21st international symposium on software reliability engineering, pp 111-120.
[39]
Nielsen Company (2012) The mobile media report: state of the media. http://www.nielsen.com/content/ dam/corporate/us/en/reports-downloads/2011-Reports/state-of-mobile-Q3-2011.pdf. Accessed 29 Sept 2012.
[40]
Oracle Java (2012) http://www.java.com/en/. Accessed 29 Sept 2012.
[41]
OSS Watch (2012) Essential tools for running a community-led project. http://www.oss-watch.ac.uk/ resources/communitytools.xml. Accessed 29 Sept 2012.
[42]
Pagano D, Maalej W. (2012) How do open source communities blog? Empirical Software Engineering, Springer Netherlands, pp. 1-35.
[43]
Perforce (2012) http://www.perforce.com/. Accessed 29 Sept 2012.
[44]
Porter MF (1997) An algorithm for suffix stripping. In: Readings in information retrieval. Morgan Kaufmann, San Francisco, pp. 313-316.
[45]
Pressman RS (2005) Software engineering: a practitioner's approach. McGraw-Hill.
[46]
Shah C, Pomerantz J. (2010) Evaluating and predicting answer quality in community QA. In: Proceedings of the 33rd international conference on research and development in information retrieval, pp. 411-418.
[47]
Stack Overflow (2012a) http://www.stackoverflow.com
[48]
Stack Overflow (2012b) Stack overflow creative commons license data dump. http://blog.stackoverflow. com/2009/06/stack-overflow-creative-commons-data-dump/. Accessed 29 Sept 2012.
[49]
Tan C, Wang Y, Lee C. (2002) The use of bigrams to enhance text categorization. Inf Process Manag 38:529-546.
[50]
Thomas SW (2012) Mining software repositories with topic models. Tech. Rep. 2012-586, School of Computing, Queen's University.
[51]
Thomas SW, Adams B, Hassan AE, Blostein D (2010) Validating the use of topic models for software evolution. In: Proceedings of the 10th international working conference on source code analysis and manipulation, pp. 55-64.
[52]
Thomas SW, Adams B, Hassan AE, Blostein D. (2011) Modeling the evolution of topics in source code histories. In: Proceedings of the 8th working conference on mining software repositories, pp. 173-182.
[53]
Thomas SW, Adams B, Hassan AE, Blostein D. (2012) Studying software evolution using topic models. Sci. Comput. Programming.
[54]
Treude C, Barzilay O, Storey M. (2011) How do programmers ask and answer questions on the web? In: Proceedings of the 33rd international conference on software engineering, pp. 804-807.
[55]
Wallach HM, Murray I, Salakhutdinov R, Mimno D. (2009) Evaluation methods for topic models. In: Proceedings of the 26th international conference on machine learning, pp. 1105-1112.
[56]
Yahoo! Answers (2012) http://answers.yahoo.com. Accessed 29 Sept 2012.

Cited By

View all
  • (2025)How are discussions linked? A link analysis study on GitHub DiscussionsJournal of Systems and Software10.1016/j.jss.2024.112196219:COnline publication date: 1-Jan-2025
  • (2025)PTM4Tag+: Tag recommendation of stack overflow posts with pre-trained modelsEmpirical Software Engineering10.1007/s10664-024-10576-z30:1Online publication date: 1-Feb-2025
  • (2024)Semantic Web Approaches in Stack OverflowInternational Journal on Semantic Web & Information Systems10.4018/IJSWIS.35861720:1(1-61)Online publication date: 13-Dec-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Empirical Software Engineering
Empirical Software Engineering  Volume 19, Issue 3
June 2014
355 pages

Publisher

Kluwer Academic Publishers

United States

Publication History

Published: 01 June 2014

Author Tags

  1. Knowledge repository
  2. Latent Dirichlet allocation
  3. Mining software repositories
  4. Q&A websites
  5. Topic models
  6. Trend analysis

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 14 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2025)How are discussions linked? A link analysis study on GitHub DiscussionsJournal of Systems and Software10.1016/j.jss.2024.112196219:COnline publication date: 1-Jan-2025
  • (2025)PTM4Tag+: Tag recommendation of stack overflow posts with pre-trained modelsEmpirical Software Engineering10.1007/s10664-024-10576-z30:1Online publication date: 1-Feb-2025
  • (2024)Semantic Web Approaches in Stack OverflowInternational Journal on Semantic Web & Information Systems10.4018/IJSWIS.35861720:1(1-61)Online publication date: 13-Dec-2024
  • (2024)Collaborative Solutions to Software Architecture Challenges Faced by IT ProfessionalsInternational Journal of Human Capital and Information Technology Professionals10.4018/IJHCITP.34283915:1(1-29)Online publication date: 9-Apr-2024
  • (2024)A Disruptive Research Playbook for Studying Disruptive InnovationsACM Transactions on Software Engineering and Methodology10.1145/367817233:8(1-29)Online publication date: 15-Jul-2024
  • (2024)“It would work for me too”: How Online Communities Shape Software Developers’ Trust in AI-Powered Code Generation ToolsACM Transactions on Interactive Intelligent Systems10.1145/365199014:2(1-39)Online publication date: 9-Mar-2024
  • (2024)Vulnerably (Mis)Configured? Exploring 10 Years of Developers' Q&As on Stack OverflowProceedings of the 18th International Working Conference on Variability Modelling of Software-Intensive Systems10.1145/3634713.3634729(112-122)Online publication date: 7-Feb-2024
  • (2024)On the Helpfulness of Answering Developer Questions on Discord with Similar Conversations and Posts from the PastProceedings of the IEEE/ACM 46th International Conference on Software Engineering10.1145/3597503.3623341(1-13)Online publication date: 20-May-2024
  • (2024)MR${}^{2}$ 2-KG: A Multi-Relation Multi-Rationale Knowledge Graph for Modeling Software Engineering Knowledge on Stack OverflowIEEE Transactions on Software Engineering10.1109/TSE.2024.340310850:7(1867-1887)Online publication date: 1-Jul-2024
  • (2024)Applying short text topic models to instant messaging communication of software developersJournal of Systems and Software10.1016/j.jss.2024.112111216:COnline publication date: 1-Oct-2024
  • Show More Cited By

View Options

View options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media