Representation Learning for Stack Overflow Posts: How Far Are We?

Abstract

1 Introduction

2 Representation Learning Models

3 Downstream Tasks

4 Research Questions and Experimental Settings

5 Experimental Results

6 Discussion

7 Related Work

8 Conclusion and Future Work

9 Data Availability

Footnotes

References

Abstract

1 Introduction

2 Representation Learning Models

3 Downstream Tasks

4 Research Questions and Experimental Settings

5 Experimental Results

6 Discussion

7 Related Work

8 Conclusion and Future Work

9 Data Availability

Footnotes

References

Cited By

Index Terms

Recommendations

Comments

Cited By

2.1 Transformer-Based Language Models

2.2 Existing Representation Models for Stack Overflow Posts

2.3 Representation Models from the SE Domain

2.4 Representation Models from the General Domain

3.1 Tag Recommendation

3.2 API Recommendation

3.3 Relatedness Prediction

4.1 Research Questions

4.2 Experimental Settings

RQ1: How Effective Are the Existing Stack Overflow Post Representation Models?

RQ2: How Effective Are the Popular Transformer-Based Language Models for the Targeted Downstream Tasks?

RQ3: Is Further Pre-Training on Stack Overflow Data Helpful in Building a Better Model?

6.1 Lessons Learned

6.2 Threats to Validity

7.1 Pre-Trained Models for SE

7.2 Mining Stack Overflow Posts

2.1 Transformer-Based Language Models

2.2 Existing Representation Models for Stack Overflow Posts

2.3 Representation Models from the SE Domain

2.4 Representation Models from the General Domain

3.1 Tag Recommendation

3.2 API Recommendation

3.3 Relatedness Prediction

4.1 Research Questions

4.2 Experimental Settings

RQ1: How Effective Are the Existing Stack Overflow Post Representation Models?

RQ2: How Effective Are the Popular Transformer-Based Language Models for the Targeted Downstream Tasks?

RQ3: Is Further Pre-Training on Stack Overflow Data Helpful in Building a Better Model?

6.1 Lessons Learned

6.2 Threats to Validity

7.1 Pre-Trained Models for SE

7.2 Mining Stack Overflow Posts

I Know What You Are Searching for: Code Snippet Recommendation from Stack Overflow Posts

Automatic identification of informative code in stack overflow posts

Automated Summarization of Stack Overflow Posts

Information

Contributors

Bibliometrics

Citations

View options

Get Access

Figures

Other

Share

3.1.1 Task Formulation.

3.1.2 SOTA Technique.

3.2.1 Task Formulation.

3.2.2 SOTA Technique.

3.3.1 Task Formulation.

3.3.2 SOTA Technique.

RQ1: How Effective Are the Existing Stack Overflow Post Representation Models?.

RQ2: How Effective Are the Popular Transformer-Based Language Models for the Targeted Downstream Tasks?.

RQ3: Is Further Pre-Training on Stack Overflow Data Helpful in Building a Better Model?.

Dataset.

Evaluation Metrics.

Implementation Details.

Dataset.

Evaluation Metrics.

Implementation Details.

Dataset.

Evaluation Metrics.

Implementation Details.

3.1.1 Task Formulation.

3.1.2 SOTA Technique.

3.2.1 Task Formulation.

3.2.2 SOTA Technique.

3.3.1 Task Formulation.

3.3.2 SOTA Technique.

RQ1: How Effective Are the Existing Stack Overflow Post Representation Models?.

RQ2: How Effective Are the Popular Transformer-Based Language Models for the Targeted Downstream Tasks?.

RQ3: Is Further Pre-Training on Stack Overflow Data Helpful in Building a Better Model?.

Dataset.

Evaluation Metrics.

Implementation Details.

Dataset.

Evaluation Metrics.

Implementation Details.

Dataset.

Evaluation Metrics.

Implementation Details.

Published In

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Funding Sources

Other Metrics

Article Metrics

Other Metrics

PDF

eReader

Login options

Full Access

Share this Publication link

Share on social media

research-article

Open access

Authors:

Ivana Clairine Irsan,

David LoAuthors Info & Claims

ACM Transactions on Software Engineering and Methodology, Volume 33, Issue 3

Article No.: 69, Pages 1 - 24

https://doi.org/10.1145/3635711

Published: 15 March 2024 Publication History

PDF eReader

The tremendous success of Stack Overflow has accumulated an extensive corpus of software engineering knowledge, thus motivating researchers to propose various solutions for analyzing its content. The performance of such solutions hinges significantly on the selection of representation models for Stack Overflow posts. As the volume of literature on Stack Overflow continues to burgeon, it highlights the need for a powerful Stack Overflow post representation model and drives researchers’ interest in developing specialized representation models that can adeptly capture the intricacies of Stack Overflow posts. The state-of-the-art (SOTA) Stack Overflow post representation models are Post2Vec and BERTOverflow, which are built upon neural networks such as convolutional neural network and transformer architecture (e.g., BERT). Despite their promising results, these representation methods have not been evaluated in the same experimental setting. To fill the research gap, we first empirically compare the performance of the representation models designed specifically for Stack Overflow posts (Post2Vec and BERTOverflow) in a wide range of related tasks (i.e., tag recommendation, relatedness prediction, and API recommendation). The results show that Post2Vec cannot further improve the SOTA techniques of the considered downstream tasks, and BERTOverflow shows surprisingly poor performance. To find more suitable representation models for the posts, we further explore a diverse set of transformer-based models, including (1) general domain language models (RoBERTa, Longformer, and GPT2) and (2) language models built with software engineering related textual artifacts (CodeBERT, GraphCodeBERT, seBERT, CodeT5, PLBart, and CodeGen). This exploration shows that models like CodeBERT and RoBERTa are suitable for representing Stack Overflow posts. However, it also illustrates the “No Silver Bullet” concept, as none of the models consistently wins against all the others. Inspired by the findings, we propose SOBERT, which employs a simple yet effective strategy to improve the representation models of Stack Overflow posts by continuing the pre-training phase with the textual artifact from Stack Overflow. The overall experimental results demonstrate that SOBERT can consistently outperform the considered models and increase the SOTA performance significantly for all the downstream tasks.

Serving as the most popular Software Question and Answer (SQA) forum, Stack Overflow has dramatically influenced modern software development practice. As of August 2023, the forum has accumulated more than 23 million questions and 35 million answers.¹ Stack Overflow is broadly recognized as an invaluable knowledge base and supplemental resource for the Software Engineering (SE) domain [6, 18, 52, 55], which triggered the increased interest of researchers and software developers in a wide range of Stack Overflow post-related tasks, such as recommendation of post tags (a.k.a. tag recommendation) [18], recommendation of APIs according to a Natural Language (NL) query (a.k.a. API recommendation) [6], and the identification of related posts (a.k.a. relatedness prediction) [55].

An essential step in yielding promising results for these Stack Overflow-related tasks is to obtain suitable representations of the posts. A beneficial Stack Overflow representation model can capture the semantic concept of the posts and reveal more explanatory features from the hidden dimensions. As the volume of SE literature on solving Stack Overflow-related tasks [18, 52, 55] continues to burgeon, it has underscored the demand for a quality Stack Overflow representation.

Over the years, numerous representation models have been specifically proposed for modeling Stack Overflow posts. Xu et al. [54] proposed Post2Vec, a Convolutional Neural Network (CNN)-based [24] representation model that leverages the tags of a post to guide the learning process and models the post as the combination of three complementary components (i.e., title, description, and code snippet). Their experimental results demonstrate that it can substantially boost the performance for a wide range of Stack Overflow posts-related tasks [1, 3, 6]. Tabassum et al. [44] leveraged the more advanced transformer architecture and pre-trained BERTOverflow based on 152 million sentences from Stack Overflow. The results demonstrate that the embeddings generated by BERTOverflow have led to a significant improvement over other off-the-shelf models (e.g., ELMo [34] and BERT [13]) in the software Named Entity Recognition (NER) task.

Although these existing Stack Overflow specific methods have been proven to be beneficial, the effectiveness of Post2Vec is only evaluated on limited solutions (i.e., Support Vector Machine (SVM) [55] and Random Forest [5]) and BERTOverflow only experimented for the NER task. These motivate us to further study the performance of existing Stack Overflow-specific representation models on a diverse set of tasks. Unexpectedly, we found that both Post2Vec and BERTOverflow perform poorly. Such findings motivate us to explore the effectiveness of a larger array of representation techniques in modeling Stack Overflow posts.

In addition to the aforementioned Stack Overflow-specific representation models, we further consider nine transformer-based language models that could be potentially suitable for post representation learning. These models can be classified into two types: SE domain-specific models and general domain models. SE domain-specific models are trained with SE-related contents (i.e., Github repositories) and are suitable for capturing the semantics of technical jargon of the SE domain. We consider six SE domain-specific models: CodeBERT [14], GraphCodeBERT [16], seBERT [49], CodeGen [32], CodeT5 [50], and PLBart [2]. We also include models from the general domain as they are usually trained with a more diverse amount of data than domain-specific models. For general domain models, we consider RoBERTa [28], Longformer [4], and GPT2 [38].

We evaluate the performance of the aforementioned representation models on multiple Stack Overflow related downstream tasks (i.e., tag recommendation [54], API recommendation [6], and relatedness prediction [55]). Furthermore, we build SOBERT, a stronger transformer-based language model for modeling Stack Overflow posts. Our experimental results reveal several interesting findings:

(1)

Existing Stack Overflow post representation techniques fail to improve the SOTA performance of considered tasks. Xu et al. [54] demonstrated that the addition of the feature vectors generated by Post2Vec is beneficial for improving the post representation for traditional machine learning techniques. However, we discover that appending the feature vectors from Post2Vec [54] does not derive a beneficial effect on considered deep neural networks. Furthermore, we reveal that the embedding generated by BERTOverflow could only achieve reasonable performance in the API recommendation task and give surprisingly poor performance in the tag recommendation task.

(2)

Among all the considered models, none of them can always perform the best. According to our experiment results, although several representation models can outperform the SOTA approaches, none can always perform the best. As a result, this motivates us to propose a new model for representing Stack Overflow posts.

(3)

Continued pre-training based on Stack Overflow textual artifact develops a consistently better representation model. We propose SOBERT by further pre-training with Stack Overflow data. The overall results show that SOBERT consistently boosts the performance in all three considered tasks, implying a better representation.

Overall, we summarize the contributions of our empirical study as follows:

(1)

We comprehensively evaluate the effectiveness of 11 representation models for Stack Overflow posts in three downstream tasks.

(2)

We propose SOBERT by pre-training based on posts from Stack Overflow and show that SOBERT consistently outperforms other representation models in multiple downstream tasks.

(3)

We derive several insightful lessons from the experimental results to the SE community.

The rest of the article is organized as follows. Section 2 categorizes representation learning models into three groups and briefly describes them. We formulate the downstream tasks (i.e., tag recommendation, API recommendation, relatedness prediction) and their corresponding State-of-the-Art (SOTA) method in Section 3. Section 4 introduces our research questions and the experiment settings. In Section 5, we answer the research questions and report the experiment results. Section 6 further analyzes the result and elaborates the insights with evidence. Section 7 describes related works, and Section 8 summarizes this study.

In this section, we summarize the considered representation models in this article. We explore a wide range of techniques across the spectrum of representing Stack Overflow posts, including two Stack Overflow-specific post representation models (Post2Vec [54] and BERTOverflow [44]), six SE domain-specific transformer-based Pre-Trained Representation Models (PTMs) (CodeBERT [14], GraphCodeBERT [16], seBERT [49], CodeT5 [50], PLBart [2], and CodeGen [32]), and three transformer-based PTMs from the general domain (RoBERTa [28] Longformer [4], and GPT2 [38]).

Transformer-based language models have revolutionized the landscape of representation learning in Natural Language Processing (NLP) [13, 28, 38]. Their efficacy in capturing text semantics has led to unparalleled performance in various applications, such as sentiment analysis [43], POS tagging [45], and question answering [36]. The vanilla transformer architecture [48] is composed of the encoder and decoder components. Based on the usage of these components, transformer-based language models can be categorized into three types: encoder-only, decoder-only, and encoder-decoder models.

Encoder-only models exclusively leverage the encoder stacks of the vanilla transformer [48] architecture. BERT [13] stands as a prominent encoder-only representation model, which learns a bi-directional contextual representation of text. BERT proposes the Masked Language Modeling (MLM) task at the pre-training phase. In MLM, the input data is corrupted by randomly masking 15% of the tokens, and then the BERT model learns to reconstruct the original data by predicting the masked words. BERT is extensively pre-trained on large-scale datasets, which learn a meaningful representation that is reusable for various tasks, thus eliminating the process of training language models from scratch and saving time and resources.

In contrast, decoder-only models consist solely of the decoder components of the original transformer architecture. A notable instance of such models is the GPT [37], which operates under a causal language modeling framework during its training phase. Causal language modeling is a strategy where the model predicts the next token in a sequence while only considering preceding tokens. In other words, this design restricts the model from accessing future tokens in the sequence.

Bridging the preceding approaches, encoder-decoder models integrate both the encoder and decoder components of the transformer architecture. Popular encoder-decoder models involve T5 [39] and BART [25]. The T5 model [39] advocates a unified text-to-text framework that converts various language tasks into a consistent text-to-text format. T5 is pre-trained on the Colossal Clean Crawled Corpus [39], along with a mixture of unsupervised and supervised pre-training tasks. BART [25] introduces a variety of noising functions to corrupt the initial input sequence (i.e., token deletion, document rotation, and sentence shuffling) during the pre-training phase. By corrupting the original sequence through these mechanisms, BART is trained to restore the original input.

Post2Vec [54] is the latest approach proposed specifically for Stack Overflow post representation learning. Post2Vec is designed with a triplet architecture to process three components of a Stack Overflow post (i.e., title, text, and code snippets). It leverages CNNs as feature extractors to encode the three components separately. The corresponding three output feature vectors are then fed to a feature fusion layer to represent the post. In the end, Post2Vec uses tag information of the post, which is considered as the post’s general semantic meaning to supervise the representation learning process. Xu et al. [54] demonstrated that the representation learned by Post2Vec can enhance the feature vectors for Stack Overflow-related downstream tasks (e.g., relatedness prediction and API recommendation). For each downstream task, the vector representation learned by Post2Vec is combined with the feature vector produced by the corresponding SOTA approach to form a new feature vector. The new feature vector is used to boost the performance of the corresponding model for the task. Following the experiment settings of Xu et al., we use Post2Vec as a complementary feature vector to the SOTA approach in this work. Specifically, we concatenate the post representation generated by Post2Vec to the original feature vector of the SOTA approach. This combined feature vector is employed in further training.

BERTOverflow [44] keeps the original BERT architecture, and it leverages 152 million sentences and 2.3 billion tokens from Stack Overflow to pre-train Stack Overflow-specific word embeddings. The authors have leveraged the embedding generated by BERTOverflow to implement a software-related named entity recognizer (SoftNER). The performance of SoftNER is experimented with the NER task for the software engineering domain, focusing on identifying code tokens or programming-related named entities that appear within SQA sites like Stack Overflow. The results show that BERTOverflow outperforms all other models in the proposed task.

CodeBERT [14] is an SE knowledge-enriched bi-modal pre-trained model, which is capable of modeling both NLs and Programming Languages (PLs). The CodeBERT model has shown great effectiveness in a diverse range of SE domain-specific activities, such as code search [14], traceability prediction [27], and code translation [8]. CodeBERT inherits the architecture of BERT [13], and it continues pre-training based on the checkpoint of RoBERTa [28] with the NL-PL data pairs obtained from the CodeSearchNet dataset [22]. It has two popular pre-training objectives: MLM and Replaced Token Detection (RTD) [10]. Rather than masking the input like MLM, RTD corrupts the input by replacing certain tokens with plausible substitutes. RTD then predicts if each token in the altered input was replaced or remained unchanged. The eventual loss function for CodeBERT at the pre-training stage is the combination of both MLM and RTD objectives, where \(\theta\) denotes the model parameters:

\begin{equation} \underset{\theta }{\text{mini}}(\mathcal {L}_{RTD}(\theta)+\mathcal {L}_{MLM}(\theta)). \end{equation}

(1)

GraphCodeBERT [16] incorporates a hybrid representation in source code modeling. Apart from addressing the pre-training process over NL and PL, GraphCodeBERT utilizes the dataflow graph of source code as additional inputs and proposes two structure-aware pre-training tasks (i.e., Edge Prediction and Node Alignment) aside from the MLM prediction task. GraphCodeBERT is evaluated in code search [14], clone detection [51], code translation [8], and code refinement [46], respectively. It outperforms CodeBERT and all the other baselines, including RoBERTa (code version) [16], Transformer [48], and LSTM [19].

seBERT [49] aims to advance the previous PTMs in the SE context with a larger model architecture and more diverse pre-training data. The authors pre-trained seBERT using the BERT \(_{LARGE}\) architecture (i.e., with 24 layers, a hidden layer size of 1,024, and 16 self-attention heads) with a total of 340 million parameters. seBERT is pre-trained with more than 119 GB of data from four data sources: Stack Overflow posts, GitHub issues, Jira issues, and GitHub commit messages. The model’s effectiveness is verified in three classification tasks: issue type prediction, commit intent prediction, and sentiment mining. The authors observe the experiment results showing that seBERT is significantly better than BERTOverflow in these tasks.

CodeGen [32] is a decoder-only transformer-based PTM for program synthesis, and it undergoes pre-training on an extensive dataset comprising both NLs and PLs. This model innovates a multi-turn programming synthesis paradigm and creates a comprehensive benchmark for multi-turn programming tasks. The multi-turn program synthesis approach involves users and the model together in multiple steps. The user communicates with the model by progressively providing specifications in NL while receiving responses from the model in the form of synthesized sub-programs. CodeGen demonstrates SOTA performance on Python code generation on HumanEval [7] and a set of tasks in the multi-turn programming benchmark. The model is pre-trained on the BigQuery dataset.² This dataset contains GitHub repositories with multiple PLs, including C, C++, Go, Java, JavaScript, and Python.

CodeT5 [50] is an encoder-decoder PTM that is designed to better consider the code semantics conveyed from the identifiers from code. It has the same architecture as T5 [39] and employs a multi-task training process to support both code understanding and generation tasks. CodeT5 utilizes a novel identifier-aware pre-training task that enables the model to distinguish which code tokens are identifiers and to recover them when they are masked. To improve the NL-PL alignment, CodeT5 further incorporates a bi-modal dual-learning objective for a bi-directional conversion between NLs and PLs.

PLBart [2] is also an encoder-decoder PTM, which is capable of performing a broad spectrum of program-related understanding and generation tasks. PLBART employs the same architecture as BART [25] and is pre-trained on a large collection of Java and Python functions and associated NL documentation from GitHub repositories and Stack Overflow posts. Experiments showed that PLBART can achieve promising performance in code summarization, code generation, and code translation.

RoBERTa [28] is a replication study on the pre-training objectives of BERT [13] and analyzed the impact of key hyper-parameters. The insights from the replication study have led to the development of RoBERTa, which is an improved version of BERT. In comparison with BERT, RoBERTa has made several modifications to the pre-training stage: (1) training with larger batch size, more data, and longer training time; (2) abandoning the next sentence prediction task of BERT and showing that removal of next sentence prediction slightly improves the model efficiency; (3) training with longer sequences; and (4) masking the training data dynamically rather than statically.

Longformer [4] aims to alleviate the limitation of transformer-based models in processing long sequences. The self-attention mechanism of the transformer suffers from the \(O(n^2)\) quadratic computational complexity problem, which restricts the ability of transformer-based models to model long sequences. Pre-trained models like BERT [13] and RoBERTa [28] only accept a maximum input of 512 tokens. Longformer leverages a combination of sliding window attention and global attention mechanism such that the computational memory consumption scales linearly as the sequence becomes longer. In contrast to models like RoBERTa and CodeBERT, which could only accept a maximum of 512 tokens as input, Longformer supports sequences of length up to 4,096. Similar to CNN [24], Longformer lets each input token only attend to surrounding neighbors that are within a fixed window size. Denoting the window size as \(w\) , each token could only attend to \(\frac{1}{2}w\) tokens on both sides, thus decreasing the computation complexity to \(O(n \times w)\) . However, the sliding window may compromise the performance, as it cannot capture the whole context. To compensate for the side effect, global tokens are selected. Such tokens are implemented with global attention, which attends to all other tokens, and other tokens also attend to the global tokens.

GPT2 [38] is a decoder-only model. It is trained with a simple objective: predicting the next word, given all of the previous words within the text. GPT-2 is trained on a dataset of 8 million web pages. The diversity of the dataset causes this simple goal to contain naturally occurring demonstrations of many tasks across diverse domains. GPT-2 is a direct scale-up of GPT, with more parameters and trained on more than 10 times the amount of data.

In this section, we formulate the target problems that are used to measure the effectiveness of the representation models and then describe the corresponding SOTA solution. We select multiple Stack Overflow-related downstream tasks, which have been popular research topics for Stack Overflow posts. To be more specific, we consider Tag Recommendation [18, 54], API Recommendation [6, 52], and Relatedness Prediction [33, 55], covering a multi-label classification problem, a ranking problem, and multi-class classification problem. All selected tasks operate on the abstraction of a post, which could be benefited from a high-quality Stack Overflow post representation.

The user-annotated tags of a Stack Overflow post serve as helpful metadata and have a critical role in organizing the contents of Stack Overflow posts across different topics. Suitable tags precisely summarize the message of a post, whereas redundant tags and synonym tags make it more difficult to maintain the content of the site. A tag recommendation system could effectively simplify the tagging process and minimize the effect of manual errors, therefore avoiding problems like tag synonyms and tag redundancy.

We formulate the tag recommendation task as a multi-label classification problem. Given \(\mathcal {X}\) as the corpus of Stack Overflow posts and \(\mathcal {Y}\) the total collection of tags, we represent each post as \(x_i\) , where \(0 \le i \le |X|, i \in \mathbb {N,}\) and the tags of each post as \(y_i \subset \mathcal {Y}\) . The goal is to recommend the most relevant set of tags \(y_i\) to \(x_i\) .

PTM4Tag [18] leverages three pre-trained models to solve the tag recommendation problem. The three pre-trained models are responsible for modeling the title, description, and code snippet, independently.

The modern software development process heavily relies on third-party APIs, which leads to the research of an automated API recommendation approach that is intended to simplify the process of API search [53]. Questions related to APIs are one of the most viewed topics on Stack Overflow [20]. Stack Overflow consists of an enormous amount of discussion about API usage. Developers are more intended to search for relevant Stack Overflow posts and pick out the APIs that seem useful in the discussions [21] rather than checking API documentation. Thus, it makes Stack Overflow the primary source for building a dataset of the API recommendation task.

We follow the exact task definition as the previous literature [15, 20, 52]. Given an NL query that describes programming requirements, the goal is to recommend relevant APIs that implement the function for the query. Thus, the task aims to inform developers which API to use for a programming task. Formally speaking, given the corpus of NL query \(\mathcal {Q}\) , we denote each query as \(q_i\) . The goal of the API recommendation system is to find a set of relevant APIs \(y_i \subset \mathcal {Y}\) for \(q_i\) , where \(\mathcal {Y}\) is the total set of available APIs.

Wei et al. [52] proposed CLEAR, an automated approach that recommends API by embedding queries and Stack Overflow posts with a PTM (distilled version of the RoBERTa³). Given an NL query, CLEAR first picks a sub-set of candidate Stack Overflow posts based on the embedding similarity to reduce the search space. Then, CLEAR ranks the candidate Stack Overflow posts and recommends the APIs from the top-ranked Stack Overflow posts.

The notion of a Knowledge Unit (KU) is defined as a set containing a question along with all of its answers [3, 33, 55]. To find a comprehensive technical solution for a given problem, developers usually need to summarize the information from multiple related KUs. However, searching for related KUs can be time consuming, as the same question can be rephrased in many different ways. Thus, researchers have proposed several techniques to automate the process of identifying the related KUs [3, 33, 55], which could significantly improve the efficiency of the software development cycle.

The task is commonly formulated as a multi-class classification problem [3, 33, 55]. The relatedness between questions is classified into four classes, from the most relevant to irrelevant:

—

Duplicate: Two KUs correspond to a pair of semantically equivalent questions. The answer of one KU can also be used to answer another KU.

—

Direct: One KU is beneficial in answering the question in another KU, for example, by explaining certain concepts and giving examples.

—

Indirect: One KU provides relevant information but does not directly answer the questions of another KU.

—

Isolated: The two KUs are semantically uncorrelated.

Given the set \(K\) of all KUs, the goal of relatedness prediction is to predict the degree of relatedness between any two KUs: \(k_i\) and \(k_j\) . The relatedness class is denoted as \(C\) , where \(C=\lbrace {\it Duplicate}, {\it Direct}, {\it Indirect}, {\it Isolated} \rbrace\) . Formally, the task of relatedness prediction is defined as obtaining the function \(R\) such that \(R(k_i, k_j) = c\) , where \(c \in C\) .

In Table 1, we demonstrate examples on pairs of KUs with different relatedness:

Table 1.

	Post ID	Title
Original KU	513832	How do I compare strings in Java?
Duplicate KU	3281448	Strings in Java : equals vs ==
Direct KU	34509566	“==” in case of string concatenation in Java
Indirect KU	11989261	Does concatenating strings in Java always lead to new strings being created in memory?

Table 1. Examples of Duplicate, Direct, and Indirect KU Pairs for the Relatedness Prediction Task

—

Original KU: This KU addresses the topic of string comparison in Java.

—

Duplicate KU: This KU introduces the difference of “equals” and “==”. Essentially, it offers a varied perspective on the same topic of string comparison in Java.

—

Direct KU: This KU focuses on the behavior of “==” during string concatenation in Java. As it provides insights directly beneficial to understanding the original KU’s topic without being a duplicate, it is categorized as directly related.

—

Indirect KU: This KU discusses the memory allocation during string concatenation in Java. While it does not directly address the main topic of string comparison, its relevance to the direct KU concerning string operations classifies it as an indirectly related unit.

Recently, Pei et al. [33] introduced ASIM, which yielded SOTA performance in the relatedness prediction task. Pei et al. pre-trained word embeddings specialized to model Stack Overflow posts with a corpus collected from the Stack Overflow data dump. Then ASIM uses BiLSTM [42] to extract features from Stack Overflow posts and implements the attention mechanism to capture the semantic interaction among the KUs.

In this section, we first introduce our research questions and then describe the corresponding experiment settings.

Various methods have been proposed in modeling Stack Overflow posts. However, there is still a lack of analysis of the existing Stack Overflow-specific representation methods. For instance, Xu et al. [54] have demonstrated that Post2Vec is effective in boosting the performance of traditional machine learning algorithms (i.e., SVM and Random Forest). However, the efficacy of Post2Vec in facilitating deep learning-based models has not yet been investigated. Moreover, Tabassum et al. [44] only leveraged the embeddings from BERTOverflow in the software-related NER task, but not for other popular Stack Overflow-related tasks. In light of this research gap, we aim to evaluate the current Stack Overflow-specific representation methods for popular Stack Overflow-related tasks under the same setting for this research question.

In addition to the existing Stack Overflow representation models, we explore the effectiveness of a wider spectrum of representation models. Transformer-based language models have shown great performance and generalizability in representation learning. Representations generated by such models have demonstrated promising performance in a broad range of tasks with datasets of varying sizes and origins. Borrowing the best-performing representation models from various domains and investigating their performance can derive interesting results, as recent literature [57, 58] has revealed that they are potentially great candidates for representing posts as well. This motivates us to employ RoBERTa [28] and Longformer [4] from the general domain and CodeBERT [14], GraphCodeBERT [16], and seBERT [49] from the SE domain. We set up the exact same experimental settings for each model.

Further pre-trained models with domain-specific corpus have been common practice in the NLP domain; however, their effectiveness is not verified for representing Stack Overflow posts. In this research question, we introduce SOBERT, which is obtained by continuing the pre-training process on CodeBERT with Stack Overflow data, and we aim to investigate whether further pre-training with Stack Overflow data improves the performance.

4.2.1 Tag Recommendation.

The dataset used by He et al. [18] in the training of PTM4Tag only includes the Stack Overflow posts dated before September 5, 2018. To address this limitation, we use the Stack Overflow data dump released in August 2022 to construct a new dataset for our experiment. Ideally, a tag recommendation approach should only learn from high-quality questions. Therefore, we remove the low-quality questions when constructing the dataset. According to the classification criteria of question quality defined by Ponzanelli et al. [35], we first filter out the questions that do not have an accepted answer and further remove the questions with a score of less than 10. Additionally, we exclude the rare tags and rare posts. Previous literature in tag recommendation [18, 54] has defined a tag as rare if it occurs less than 50 times within the dataset, and a post is considered rare if all of its tags are rare tags. The usage of rare tags is discouraged since it implies the unawareness of the tag among developers. We follow the same definition as the previous literature and set the frequency threshold for rare tags as 50. In the end, the resultant dataset consists of 527,717 posts and 3,207 tags. We split the dataset into a training set, a validation set, and a test set according to the 8:1:1 ratio, which corresponds to 422,173, 52,772, and 52,772 posts, respectively.

During the training process, we only consider the question posts from Stack Overflow and ignore the answer posts. We check the post-IDs of the question posts in our dataset to ensure that each post has a unique post-ID. Each question post consists of two components, which are the title and body. The code snippets within the body of a post are enclosed in HTML tags <pre><code> and </code></pre>, we cleaned the redundant HTML tags with regular expression, and we first clean the HTML tag by using regular expressions <pre><code>([\s\S]*?)<//code><//pre>. After that, we concatenate the title and body together to form the final input data. Thus, our input data has integrated the title, NLs in the body, and code snippets in the body. This input data is then fed into the tokenizer of the representation model to be tokenized.

We report the performance for this task using Precision@k, Recall@k, and F1-score@k, where k indicates the top-k recommendations. Such metrics are extensively used in previous works [18, 26, 54, 59], and we calculate the average score for each of them. Mathematically speaking, the evaluation metrics are computed as follows.

\begin{equation*} Precision@k = \frac{ \vert \text{Tag}_{\text{True} } \cap \text{Tag}_{ \text{Predict} } \vert }{k} \end{equation*}

\begin{equation*} Recall@k_i = {\left\lbrace \begin{array}{ll} \frac{| \text{Tag}_{\text{True}} \cap \text{Tag}_{\text{Predict}}}{k}| & \text{if } | \text{Tag}_{\text{True}}| \gt k\\ \frac{| \text{Tag}_{\text{True}} \cap \text{Tag}_{\text{Predict}} | }{|\text{Tag}_{\text{True}}|} & \text{if } |\text{Tag}_{\text{True}}| \le k\\ \end{array}\right.} \end{equation*}

\begin{equation*} F1\text{-}score@k = 2 \times \frac{ Precision@k \times Recall@k }{ Precision@k + Recall@k } \end{equation*}

In the preceding formulas, \(\text{Tag}_{\text{True}}\) refers to the ground truth tags and \(\text{Tag}_{\text{Predict}}\) refers to the predicted tags. Notice that the preceding formula of Recall@k is determined by conditions since Recall@k naturally disfavors small k. The revisited Recall@k has been widely adopted in previous experiments of tag recommendation [18, 54, 59]. Since Stack Overflow posts cannot have more than five tags, we report the results by setting the k as 1, 3, and 5.

For Longformer, we set the maximum accepted input sequence as 1,024, and for other transformer-based language models, the maximum input sequence is set as 512. This setting of the input sequence is kept the same for the other two tasks (API recommendation and relatedness prediction). For encoder-decoder and decoder-only representation models, we select the last decoder hidden state as the representation following previous literature [25, 50].

We set the learning rate as 5e-5, batch size as 512, and epoch number as 30, and use the Adam optimizer to update the parameters. We save the model at the end of each epoch and select the model with the smallest validation loss to run the evaluation.

4.2.2 API Recommendation.

We use the BIKER dataset crafted by Huang et al. [21], which is the same dataset used by Wei et al. [52]. The training set contains 33K questions with corresponding relevant APIs in the accepted answers. The test dataset contains manually labeled questions from Stack Overflow, which are looking for an API to solve programming problems and label the ground truth API for these questions based on their accepted answers.

When creating the BIKER dataset, Huang et al. [21] first select question posts from Stack Overflow satisfying the following three criteria: (1) the question has a positive score, (2) at least one answer to the question contains API entities, and (3) the answer has a positive score. Huang et al. then manually inspected the collected questions and removed the questions that were not about searching APIs for programming tasks.

Huang et al. [21] aim to create queries that do not have too many words, thus only the titles of the Stack Overflow posts are used as queries. The ground truth APIs were extracted from the code snippets in the accepted answers in filtered posts. Again, the extracted APIs were manually checked to ensure their correctness. Eventually, the test dataset contains 413 questions along with their ground truth APIs after the manual labeling process. The titles of these questions are used as the queries for API searching.

We use the same evaluation metrics as previous literature [20, 52] for the API recommendation task. The metrics are Mean Reciprocal Rank (MRR), Mean Average Precision (MAP), Precision@k, and Recall@k. Mathematically, MRR is defined as

\begin{equation*} MRR = \frac{1}{|Q|} \sum _{i=1}^{|Q|} \frac{1}{rank_i}, \end{equation*}

where \(Q\) refers to all queries and \(rank_i\) refers to the rank position of the first relevant API from the recommended API list for the query \(_i\) . For a given query, MRR gives a score of \(\frac{1}{rank}\) , where \(rank\) is the ranking of the first correct API in the recommended list. In other words, the score of MRR is inversely proportional to the rank of the first correct API. MAP is the mean of average precision scores ( \(AveP\) ) for each query. Whereas MRR gives the score based on the ranking of the first correct answer, MAP considers the ranks of all correct answers to measure the quality of the recommended list. Mathematically, MAP is defined as

\begin{equation*} MAP = \frac{1}{|Q|} \sum _{i=1}^{|Q|} AveP(i). \end{equation*}

\(AveP(i)\) itself is defined as

\begin{equation*} AveP (i) = \frac{1}{|K|} \sum _{k \in K} \frac{num(k)}{k}, \end{equation*}

where \(K\) is the set of ranking positions of the relevant APIs from the APIs list of query \(_i\) , and \(num(k)\) represents the number of relevant API in the top-k.

Differently from tag recommendation, the Recall@k metrics used in this task follow the conventional definition, which is

\begin{equation*} Recall@k = \frac{|\text{API}_{\text{True}} \cap \text{API}_{\text{Predict}}|}{|\text{API}_{\text{True}}|}. \end{equation*}

To be consistent with Wei et al. [52], we use k \(\in {1,3,5}\) .

CLEAR shows SOTA performance in the API recommendation task by leveraging BERT sentence embedding and contrastive learning. The original architecture of CLEAR is implemented based on DistilRoBERTa⁴ during the training process. In this study, we also explore the effectiveness of other representation methods by replacing the embedding of DistilRoBERTa in CLEAR. For Post2Vec, we concatenate the post representation from Post2Vec to the original implementation of CLEAR.

For this task, we set the batch size as 256 and the epoch number as 30. The same as in the description in Section 4.2.1, we select the model with the smallest validation loss to run the test set.

4.2.3 Relatedness Prediction.

The experiments are conducted based on the KU dataset provided by Shirani et al. [3]. This dataset⁵ contains 34,737 pairs of KUs. To ensure a fair comparison with the prior work [33], we use the same data for training, validation, and testing, containing 208,423, 34,737, and 104,211 pairs of KU, respectively. Our input data is the concatenation of two KUs. Specifically, each KU contains one question post and three corresponding answer posts. For the question post, we include the title, body, and code snippets. For the answer post, we consider the body and code.

Following prior work [33], we adopt the micro-averaging method to calculate Micro-precision, Micro-recall, and Micro-F1 as evaluation metrics.

We concatenate a pair of posts as the input to train a multi-class classifier. We fine-tuned Longformer on a sequence length of 1,024 and fine-tuned other pre-trained models on a sequence length of 512. For all experiments, we set the batch size as 32 and the epoch number as 5. We select the model with the smallest validation loss to run the evaluation.

This section describes the experiment results and answers our research questions. The experimental results are summarized in Tables 2, 3, and 4, respectively.

Table 2.

Group	Representation	P@1	R@1	F1@1	P@3	R@3	F1@3	P@5	R@5	F1@5
SOTA	PTM4Tag	0.875	0.875	0.875	0.586	0.756	0.641	0.417	0.805	0.526
Stack Overflow-specific	Post2Vec	0.875	0.875	0.875	0.585	0.754	0.639	0.416	0.804	0.525
Stack Overflow-specific	BERTOverflow	0.088	0.088	0.088	0.089	0.094	0.095	0.083	0.163	0.105
General domain	RoBERTa	0.878	0.878	0.878	0.591	0.761	0.646	0.418	0.804	0.527
	Longformer	0.852	0.852	0.852	0.559	0.721	0.612	0.397	0.769	0.502
	GPT2	0.884	0.884	0.884	0.593	0.763	0.648	0.418	0.805	0.528
SE domain	CodeBERT	0.876	0.876	0.876	0.588	0.758	0.642	0.418	0.805	0.527
	GraphCodeBERT	0.874	0.875	0.875	0.582	0.751	0.636	0.410	0.791	0.517
	seBERT	0.088	0.088	0.088	0.089	0.094	0.095	0.083	0.163	0.105
	CodeT5	0.887	0.887	0.887	0.599	0.770	0.653	0.420	0.809	0.530
	PLBart	0.883	0.883	0.883	0.600	0.773	0.656	0.422	0.811	0.532
	CodeGen	0.872	0.872	0.872	0.584	0.751	0.638	0.411	0.792	0.519
Our model	SOBERT	0.905	0.905	0.905	0.615	0.790	0.671	0.437(+3.4%)	0.836(+3.0%)	0.551(+3.4%)

Table 2. Experiment Results for the Tag Recommendation Task

Table 3.

Group	Representation	MRR	MAP	P@1	P@3	P@5	R@1	R@3	R@5
SOTA	CLEAR	0.739	0.753	0.482	0.560	0.562	0.629	0.766	0.793
Stack Overflow-specific	Post2Vec	0.735	0.745	0.471	0.560	0.556	0.625	0.774	0.801
Stack Overflow-specific	BERTOverflow	0.753	0.778	0.521	0.639	0.651	0.681	0.774	0.762
General domain	RoBERTa	0.777	0.790	0.537	0.640	0.653	0.689	0.782	0.815
	Longformer	0.767	0.782	0.525	0.623	0.646	0.683	0.772	0.793
	GPT2	0.766	0.782	0.528	0.641	0.650	0.683	0.772	0.795
SE domain	CodeBERT	0.781	0.800	0.564	0.641	0.659	0.712	0.772	0.793
	GraphCodeBERT	0.784	0.804	0.537	0.652	0.663	0.693	0.803	0.829
	seBERT	0.754	0.777	0.525	0.624	0.635	0.678	0.749	0.772
	CodeT5	0.779	0.796	0.544	0.643	0.651	0.693	0.786	0.809
	PLBart	0.762	0.782	0.521	0.619	0.633	0.679	0.768	0.795
	CodeGen	0.721	0.735	0.556	0.627	0.636	0.660	0.705	0.718
Our model	SOBERT	0.807(+2.9%)	0.826(+2.7%)	0.579	0.678	0.684	0.732	0.801	0.832

Table 3. Experimental Results for the API Recommendation Task

Table 4.

Group	Representation	F1-Score	Precision	Recall
SOTA	ASIM	0.785	0.785	0.785
Stack Overflow-specific	Post2Vec	0.768	0.768	0.768
Stack Overflow-specific	BERTOverflow	0.697	0.697	0.697
General domain	RoBERTa	0.787	0.787	0.787
	Longformer	0.786	0.786	0.786
	GPT2	0.765	0.765	0.765
SE domain	CodeBERT	0.803	0.803	0.803
	GraphCodeBERT	0.801	0.801	0.801
	seBERT	0.799	0.799	0.799
	CodeT5	0.784	0.784	0.784
	PLBart	0.770	0.770	0.770
	CodeGen	0.765	0.765	0.765
Our model	SOBERT	0.825(+2.7%)	0.825(+2.7%)	0.825(+2.7%)

Table 4. Experiment Result for the Relatedness Prediction Task

The experimental results of our tag recommendation experiments are summarized in Table 2. PTM4Tag achieves a performance of 0.417, 0.805, and 0.526 in terms of Precision@5, Recall@5, and F1-score@5, respectively. However, the extra inclusion of Post2Vec lowers the scores to 0.416, 0.804, and 0.525, respectively. In contrast, BERTOverflow struggles in the task with surprisingly low scores of 0.083, 0.163, and 0.105.

For API recommendation, as shown in Table 3, combining Post2Vec with the SOTA approach CLEAR also fails to boost the performance. CLEAR itself obtains an MRR score of 0.739 and MAP score of 0.753. Yet, with the integration of Post2Vec, these values diminish slightly to 0.735 and 0.745, respectively. Notably, BERTOverflow achieves scores of 0.753 in MRR and 0.778 in MAP.

In the relatedness prediction task, as detailed in Table 4, integrating Post2Vec with ASIM leads to a minor decrease in the F1-score, moving from 0.785 to 0.768. BERTOverflow, however, lags behind ASIM with an F1-score of 0.697.

Overall, Post2Vec cannot enhance the performance of the SOTA solutions across the evaluated downstream tasks. Furthermore, BERTOverflow demonstrates poor results in classification tasks and only achieves comparable performance with the SOTA solution in API recommendation.

Answer to RQ1: The existing Stack Overflow representation methods fail to improve SOTA performance in the three evaluated downstream tasks.

In the tag recommendation task, as demonstrated in Table 2, the SOTA approach PTM4Tag is outperformed by numerous transformer-based PTMs. Whereas the F1-score@5 of PTM4Tag is 0.526, PLBart achieves an F1-score@5 of 0.532, which makes it the best transformer-based pre-trained model. In contrast, seBERT significantly underperformed in this task, with an F1-score@5 of only 0.105.

Table 3 shows that CLEAR is no longer the best-performing method in API recommendation. Replacing the embedding of Distilled RoBERTa in the original design of CLEAR with other transformer-based language models increases the performance. Particularly, GraphCodeBERT boosts the performance of CLEAR by 3.8% and 5.0% in terms of MRR and MAP. For Precision@1,3,5 and Recall@1,3,5, GraphCodeBERT outperforms CLEAR by 6.7% to 22.0%. The worst representation model is CodeGen, which achieves MRR and MAP of 0.721 and 0.735. Meanwhile, both CodeBERT and GraphCodeBERT can surpass an MRR of 0.78.

The results of the relatedness prediction task are presented in Table 4. We observe that ASIM, the SOTA technique in relatedness prediction, is outperformed by other transformer-based language models. Whereas ASIM achieves a score of 0.785 in the F1-score, CodeBERT drives forward the SOTA performance by 2.3% with an F1-score of 0.803. RoBERTa, GraphCodeBERT, Longformer, and seBERT have an F1-score of 0.787, 0.801, 0.786, and 0.799, all outperforming ASIM.

Overall, models like CodeBERT, RoBERTa, and GraphCodeBERT can consistently give promising representations in all three tasks, proving their generalizability and effectiveness in a wide range of SE-related tasks.

Answer to RQ2: Representations generated by CodeBERT, RoBERTa, and GraphCodeBERT consistently outperform each SOTA technique from the targeted downstream tasks. However, none of the models can always be the best performer.

Our experimental results show that there is no “one-size-fits-all” model in representing Stack Overflow posts, which could consistently outperform others in the considered tasks. Such a phenomenon delivers an intuition that there is an improvement opportunity in the representation technique for Stack Overflow. Based on the common practice that a second phase of in-domain pre-training leads to performance gains [17], we conduct additional pre-training for a transformer-based model (i.e., CodeBERT) with the Stack Overflow dataset. We name it SOBERT.

Pre-Training Details. We have leveraged the Stack Overflow dump dated June 2023, which has 23 million question posts as the training corpus. The raw dataset has a size of approximately 70 G. We also make sure that there is no overlapping between the test/validation datasets of three downstream tasks and the pre-training data of SOBERT. Many previous works have removed the code snippets of a Stack Overflow post during the pre-processing stage [26, 59]. According to the statistics conducted by Xu et al. [54], more than 70% of the Stack Overflow contains at least one code snippet. As a result, the removal of code snippets would result in losing a significant of information, and they should be considered to learn an effective post representation. As the code snippets within the body of a post are enclosed in HTML tags <pre><code> and </code></pre>, we cleaned the redundant HTML tags with regular expression <pre><code>([\s\S]*?)<//code><//pre>. As a result, the pre-training data of SOBERT contains the title, body, and code snippets of a Stack Overflow question post. We have initialized SOBERT based on the checkpoint of the CodeBERT model and pre-trained SOBERT using the MLM objective with a standard masking rate of 15%. The batch size is set as 256, and the learning rate is 1e-4. The training process takes around 100 hours for eight NVIDIA V100 GPUs with 16 GB of memory to complete. The detailed code used is included in the replication package provided.

The experimental results show that SOBERT achieves the best performance for every downstream task. For tag recommendation, SOBERT achieves an F1-score@5 of 0.551 and beats PLBart by 3.4%; for API recommendation, SOBERT performs with 0.807 in terms of MRR and outperforms GraphCodeBERT by 2.9%.; and for relatedness prediction, it accomplishes an F1-score of 0.824 and outperforms CodeBERT by 2.7%.

We conduct the Wilcoxon signed rank [12] at a 95% significance level (i.e., p-value \(\lt\) 0.05) and calculate Cliff’s delta [11] on the paired data corresponding to SOBERT and the best-performing competing representation model in each task (i.e., PLBart in tag recommendation, CodeBERT in relatedness prediction, and GraphCodeBERT in API recommendation). The significance test has been conducted on the values of evaluation metrics (F1-score@5 in tag recommendation, F1-score in relatedness prediction, and MRR in API recommendation). For Cliff’s delta, we consider delta that are less than 0.147, between 0.147 and 0.33, between 0.33 and 0.474, and above 0.474 as Negligible (N), Small (S), Medium (M), and Large (L) effect size, respectively, following previous literature [11]. We observe that SOBERT significantly (p-value \(\lt\) 0.05) and substantially (Cliff’s delta is 0.31–0.55) outperforms the comparing model.

Answer to RQ3: Further pre-training with the Stack Overflow data yields better representation in modelling Stack Overflow posts. SOBERT consistently achieves SOTA performance in all targeted downstream tasks.

Lesson #1: Incorporating post embeddings from an external approach does not boost the performance of neural network models. Xu et al. [54] demonstrated that appending the distributed post representation learned by Post2Vec to the manually crafted feature vector can increase the performance of traditional machine learning algorithms, such as SVM [55] and Random Forest [5], in a set of Stack Overflow-related tasks. However, these benefits are not observed for the SOTA techniques that are based on deep neural networks. This is potentially caused by the design of neural networks that automatically extract feature vectors and continuously optimize the representations. It indicates that deep neural networks may lose the effectiveness of external embeddings while optimizing the parameters of the feature extractor.

Lesson #2: Models with broader background knowledge derive better results than those with specific knowledge.

Intuitively, BERTOverflow is expected to produce the desired Stack Overflow post representation, as it is specifically designed for Stack Overflow data. A major difference between BERTOverflow and others is the vocabulary. As BERTOverflow is pre-trained from scratch with the Stack Overflow data, its vocabulary should be more suitable for modeling Stack Overflow posts than general domain models.

Surprisingly, our experiment results show that other transformer-based language models outperform BERTOverflow by a substantial margin across all three tasks. It gives an extremely poor performance in the tag recommendation task. By inspecting the prediction results of BERTOverflow in the tag prediction task, we notice that the top-5 predictions made by BERTOverflow are always the most frequent tags (‘python’, ‘java’, ‘c#’, ‘java-script’, and ‘android’) from the dataset. We observe that seBERT has performance similar to BERTOverflow in the tag recommendation task.

We hypothesize that the poor performance of BERTOverflow and seBERT is because these models lack a sufficient amount of pre-training to perform well. Since seBERT and BERTOverflow are trained from scratch, they require much more pre-training effort than continued pre-training with existing models. To prove this concept, we perform additional pre-training on BERTOverflow with the same dataset as SOBERT. The further pre-training was done with the same hyper-parameters as SOBERT, and it took 23 hours on four 16-GB NVIDIA V100 GPUs to complete. We denote this new model as BERTOverflow \(_{\text{NEW}}\) .⁶

To independently evaluate the effect of domain-specific vocabulary, we train additional two transformer-based models from scratch. Both models contain 12 layers of encoder modules, but one with the vocabulary of BERTOverflow and the other with the vocabulary of RoBERTa. The pre-training process is the same as for BERTOverflow \(_{\text{NEW}}\) . We refer to these two models as BERTOverflow \(_{\text{vocab}}\) , and RoBERT \(_{\text{vocab}}\) . From Table 8 (presented later), we observe the significant performance improvements of BERTOverflow \(_{\text{NEW}}\) compared to BERTOverflow. We also observe that the performance of BERTOverflow \(_{\text{vocab}}\) and RoBERTa \(_{\text{vocab}}\) are quite similar to each other. Overall, our experiments have shown that domain-specific vocabulary has a negligible effect on all three tasks. The more important factor tends to be the amount of pre-training. Moreover, pre-training from scratch is commonly considered an expensive process. Initializing new representation models based on the checkpoint of a popular model reduces the risk, and the tag recommendation task is a good indicator to demonstrate the generalizability and the sufficiency of pre-training of transformer-based representation models.

For a more comprehensive insight, we analyze the lengths of tokenized text for different representation models on the datasets used in this article. Tables 5 through 7 show the statistics for the lengths of different tokenizers. Given that representation models like CodeBERT, GraphCodeBERT, Longformer, GPT2, and CodeGen share the same tokenizer as RoBERTa, we omit these representation models from this analysis. We notice that PLBart’s tokenizer can generate the shortest tokenizations across all datasets. Interestingly, even though BERTOverflow and seBERT are being developed with SE-specific vocabularies, the lengths of their tokenized text are almost the same as that of RoBERTas. For example, BERTOverflow’s average tokenization length in the tag recommendation task is 250, whereas RoBERTa’s is slightly higher at 254.7.

Lesson #3: Despite considering a longer input length, Longformer does not produce better representations for posts.

Conventional transformer-based models like CodeBERT and RoBERTa cannot handle long sequences due to the quadratic complexity of the self-attention mechanism [48] and accept a maximum of 512 tokens as the input. From Tables 5 through 7, we can observe that the ratios of data that are longer than 512 tokens are approximately 9%, 0%, and 94% in tag recommendation, API recommendation, and relatedness prediction, respectively.

In Table 6, we can see that the dataset of API recommendation has a short length, where the longest tokenized text has a length of 53. This is because only the title of a post is considered in this task. As Longformer is implemented with a simplified attention mechanism (introduced in Section 2), which only gives its advantage in handling long text, this explains why CodeBERT and RoBERTa outperform Longformer in API recommendation.

From Tables 5 and 7, we can see that both datasets contain data samples that are longer than the 512 limit. Especially in related prediction (see Table 7), the average length of each KU is more than 1,800 tokens. The lengthy text is because each KU consists of question posts and a set of corresponding answer posts. Surprisingly, Longformer fails to perform better than the other model that belongs to the general domain (i.e., RoBERTa), as well as models from the SE domain, even though it takes much longer input in this task.

Table 5.

Model	Mean	Min	25%	50%	75%	Max	Longer than 512
CodeT5	253.3	8	93	158	278	27449	9.3%
RoBERTa	254.7	9	92	157	279	27437	9.4%
PLBart	239	8	90	152	266	19467	8.4%
BERTOverflow	250	8	90	156	278	27596	9.3%
seBERT	255.4	8	92	158	283	27600	9.7%

Table 5. Comparison of the Results from Different Tokenizers on the Dataset of Tag Recommendation

Table 6.

Table 6. Comparison of the Results from Different Tokenizers on the Dataset of API Recommendation

Table 7.

Model Name	Mean	Min	25%	50%	75%	Max	Longer than 512
CodeT5	1826.6	102	919	1407	2213	33989	94.6%
RoBERTa	1897.7	99	944	1454	2301	34540	94.8%
PLBart	1717.9	96	877	1340	2090	32995	93.7%
BERTOverflow	1862.7	98	937	1444	2264	41346	94.7%
seBERT	1917.4	98	961	1485	2331	41750	95.0%

Table 7. Comparison of the Results from Different Tokenizers on the Dataset of Relatedness Prediction

We further compare the performance of Longformer by varying the input size considering the first 512 and 1,024 tokens. The additional experimental results are shown in Table 8. These additional settings do not differ in performance. This indicates that diversifying the input size does not affect Longformer’s performance on post representation. A potential interpretation would be the important features for representing Stack Overflow posts lie in the first part of each post (e.g., Title serves as a succinct summary of the post). It is not worth trying Longformer unless one strictly needs the entire content of Stack Overflow posts.

Table 8.

	Tag Recommendation			API Recommendation		Relatedness Prediction
	P@5	R@5	F1@5	MRR	MAP	P	R	F1
BERTOverflow	0.083	0.163	0.105	0.753	0.778	0.697	0.697	0.697
BERTOverflow-New	0.411	0.791	0.519	0.779	0.793	0.789	0.789	0.789
BERTOverflow-vocab	0.411	0.790	0.518	0.771	0.785	0.788	0.788	0.788
RoBERTa-vocab	0.412	0.794	0.520	0.778	0.792	0.793	0.793	0.793
Longformer-512	0.397	0.768	0.502	0.768	0.783	0.785	0.785	0.785
Longformer-1024	0.397	0.769	0.502	0.767	0.782	0.786	0.786	0.786

Table 8. Results for Variants of BERTOverflow, RoBERTa, and Longformer

Lesson #4: We advocate that future studies related to Stack Overflow consider SOBERT as the underlying baseline.

Our experiment results demonstrate that further pre-training based on in-domain data leads to better Stack Overflow post representation. By initializing SOBERT with the CodeBERT checkpoint and performing further pre-training on Stack Overflow data, we have noticed that SOBERT consistently outperforms the original CodeBERT and produces new SOTA performance for all three tasks.

In Table 9, we present three examples of the prediction results of CodeBERT and SOBERT for the tag recommendation task. We observe that CodeBERT is making wrong predictions like “.net” and “c#” when the question is about “haskell,” whereas SOBERT is capable of making the correct predictions. CodeBERT may lack knowledge of PLs like Haskell and Lua since it is pre-trained on artifacts from Python, Java, JavaScript, PHP, Ruby, and Go. Taking the Stack Overflow post with ID 13202867 as another example, the question is about Flexslider, a jQuery slider plugin. In the given example, SOBERT could successfully make connections to tags like ‘jQuery’ and ‘css’ while CodeBERT struggles to give meaningful predictions.

Table 9.

Post ID	Post Title	CodeBERT Tag Prediction	SOBERT Tag Prediction	True Tag
13202867	Fixed size of Flexslider	apache-flex, frameworks, ios, swift, xcode	css, html, image, javascript, jquery	css, html, javascript
30434343	What is the right way to typecheck dependent lambda abstraction using ‘bound’?	.net, binding, c#, lambda, type-inference	functional-programming, haskell, lambda, type-inference, types	haskell
17849870	Closed type classes	.net, c++, d, f#, performance	applicative, ghc, haskell, typeclass, types	haskell, static-analysis, typeclass, types

Table 9. Examples of Predictions Made by CodeBERT and SOBERT in the Tag Recommendation Task

Overall, by continuing the pre-training process on Stack Overflow data, SOBERT outperforms CodeBERT in three popular Stack Overflow-related tasks. We advocate future studies to consider SOBERT as their underlying baseline. To facilitate the usage of the SOBERT proposed in this work, we plan to release it to HuggingFace⁷ so that it can be used by simply calling the interface.

Threats to Internal Validity. To ensure the correct implementation of the baseline methods (i.e., Post2Vec, PTM4Tag, CLEAR, and ASIM), we reused the replication package released by the original authors. \({}^{{8}--{11}}\) When investigating the effectiveness of various pre-trained models, we used the implementation of each from the popular open source community HuggingFace. Another threat to the internal validity is the hyper-parameter setting we used to pre-train SOBERT and fine-tune the representation models. To mitigate this threat, the leveraged hyper-parameters in this article in both the pre-training and fine-tuning phases were reported in other prior reputable literature as recommended or optimal [14, 28, 52].

Threats to External Validity. One threat to external validity relates our results may not generalize to those newly emerging topics or other Stack Overflow-related downstream tasks. We have minimized this threat by considering multiple downstream tasks.

Threats to Construct Validity. We reuse the same evaluation metrics in our baseline methods [18, 33, 52]. To further reduce the risk, we conduct the Wilcoxon signed-rank statistical hypothesis test and Cliff’s delta to check whether the output between the two competing approaches is significant and substantial.

In this section, we review two lines of research that most relate to our work: pre-trained models for SE and mining Stack Overflow posts.

Inspired by the success of transformer-based pre-trained models in NLP, there is an increasing research interest in exploring pre-training tasks and applying pre-trained models for SE [9, 23, 27, 29, 30, 31, 47, 50, 58].

One set of research focuses on learning semantic and contextual representations of source code; after pre-training, these models can be fine-tuned to solve SE downstream tasks. ContraCode [23] is an encoder-only model that uses a contrastive pre-training task to learn code functionality. It classifies JavaScript programs into positive pairs (i.e., functionally similar) and negative pairs (i.e., functionally dissimilar). During the contrastive pre-training, query programs are used to retrieve positive programs. Positive programs are pushed together, whereas negative programs are pushed apart. CodeGPT [29] is a pure decoder model trained in PLs. It leverages the Python and Java corpora from the CodeSearchNet dataset [22].

Another set of research focuses on leveraging the transformer-based model to automate SE challenges [9, 27, 30, 31, 47, 58]. Zhang et al. [58] conduct a comparative study on transformer-based pre-trained models with prior SE-specific tools in sentiment analysis for SE. The experimental results show that transformer-based pre-trained models are more ready for real use than the prior tools. Lin et al. [27] find that BERT can boost the performance of traceability tasks in open source projects. They investigate three BERT architectures: Single-BERT, Siamese-BERT, and Twin-BERT. The results indicate that Single-BERT can generate the most accurate links, whereas a Siamese-BERT architecture produced comparable effectiveness with significantly better efficiency. Ciniselli et al. [9] investigate the potential of transformer-based models in the code completion task, ranging from single token prediction to the prediction of the entire code blocks. Experiments are conducted on several variants of two popular transformer-based models, namely RoBERTa and T5. The experimental results demonstrate that T5 is the most effective in supporting code completion. T5 variants can achieve prediction accuracies up to 29% for whole block prediction and 69% for token-level prediction. Mastropaolo et al. [31] further explore the effusiveness of T5 for bug fixing, injecting code mutants, generating assert statements, and code summarization. The study finds that the T5 can outperform the previous SOTA deep learning-based approaches for those tasks. Tufano et al. [47] perform an empirical evaluation of the T5 model in automating the code review process. The experiments are performed on a much larger and realistic code review dataset, and the T5-based model outperforms previous deep learning models in this task. Mastropaolo et al. [30] present LANCE, a log statement recommendation system using the transformer architecture. LANCE can accurately determine where to place a log statement at 65.9% of the time, select the correct log level at 66.2% of the time, and generate a fully accurate log statement with a relevant message in 15.2% of instances.

Different from these works, we focus on a comprehensive set of Stack Overflow-related tasks in this article. In addition to fine-tuning the transformer-based PTMs, we also further pre-trained SOBERT on Stack Overflow data.

We address tag recommendation [18, 54], API recommendation [6, 52], and relatedness prediction [33, 55] in this work. Others also explored other tasks for mining Stack Overflow posts to support software developers, such as post recommendation [41], multi-answer summarization [56], and controversial discussions [40].

Rubei et al. [41] propose an approach named PostFinder, which aims to retrieve Stack Overflow posts that are relevant to API function calls that have been invoked. They make use of Apache Lucene to index the textual content and code in Stack Overflow to improve efficiency. In both the data collection and query phase, they make use of the data available at hand to optimize the search process. Specifically, they retrieve and augment posts with additional data to make them more exposed to queries. In addition, they boost the context code to construct a query that contains the essential information to match the stored indexes.

Xu et al. [56] investigate the multi-answer posts summarization task for a given input question, which aims to help developers get the key points of several answer posts before they dive into the details of the results. They propose an approach named AnswerBot, which contains three main steps: relevant question retrieval, useful answer paragraph selection, and diverse answer summary generation.

Ren et al. [40] investigate the controversial discussions in Stack Overflow. They find that there is a large scale of controversies in Stack Overflow, which indicates that many answers are wrong, less optimal, and out-of-date. Our work and their work are complementary to each other, and all aim to boost automation in understanding and utilizing Stack Overflow contents.

In this article, we empirically studied the effectiveness of varying techniques for modeling Stack Overflow posts, including approaches that are specially designed for Stack Overflow posts (i.e., Post2Vec and BERTOverflow), SE domain representation models (i.e., CodeBERT, GraphCodeBERT, seBERT, CodeT5, PLBart, CodeGen), and general domain representation models (i.e., RoBERTa, LongFormer, and GPT2). We evaluated the performance of these representation models on three popular and representative Stack Overflow-related tasks, which are tag recommendation, API recommendation, and relatedness prediction.

Our experimental results showed that Post2Vec is unable to enhance the representations that are automatically extracted by deep learning-based methods and that BERTOverflow performs surprisingly worse than other transformer-based language models. Furthermore, there does not exist one representation technique that could consistently outperform other representation models. Our findings indicate the current research gap in representing Stack Overflow posts. Thus, we proposed SOBERT with a simple-yet-effective strategy. We pre-trained SOBERT with the posts from Stack Overflow. As a result, SOBERT improves the performance of the original CodeBERT and consistently outperforms other models on all three tasks, confirming that further pre-training on Stack Overflow data helps build Stack Overflow representation.

In the future, we would also extend our research to other SQA sites, such as AskUbuntu.¹¹ Moreover, we would also take other Stack Overflow-related downstream tasks into account in the future.

The replication package of the data and code used in this work is available at https://figshare.com/s/7f80db836305607b89f3.

https://stackexchange.com/sites?view=list#traffic

https://cloud.google.com/bigquery/public-data

https://huggingface.co/distilroberta-base

⁴

https://huggingface.co/distilroberta-base

⁵

https://anonymousaaai2019.github.io

⁶

Please note that we could not apply further pre-training to seBERT due to the constraints of limited resources to handle BERT \(_{\text{Large}}\) architecture.

⁷

https://huggingface.co/

⁸

https://github.com/maxxbw54/Post2Vec

⁹

https://github.com/soarsmu/PTM4Tag

¹⁰

https://github.com/Moshiii/CLEAR-replication

¹¹

https://github.com/Anonymousmsr/ASIM

¹¹

https://askubuntu.com/

[1]

Md. Ahasanuzzaman, Muhammad Asaduzzaman, Chanchal K. Roy, and Kevin A. Schneider. 2018. Classifying Stack Overflow posts on API issues. In Proceedings of the 2018 IEEE 25th International Conference on Software Analysis, Evolution, and Reengineering (SANER’18). IEEE, Los Alamitos, CA, 244–254.

BERTOverflow