SE3M

SE3M: A Model for Software Effort Estimation Using Pre-trained
Embedding Models
Eliane M. De Bortoli Fávero Dalcimar Casanova Andrey Ricardo Pimentel
elianedb@utfpr.edu.br dalcimar@utfpr.edu.br andrey@inf.ufpr.br
UTFPR - Federal University of UTFPR - Federal University of UFPR - Federal University of Paraná
Technology Paraná Technology Paraná Curitiba, PR, Brazil
Pato Branco, PR, Brazil Pato Branco, PR, Brazil
ABSTRACT points per use cases. These models make use of Machine Learning
techniques (e.g. linear regression, neural networks) and are also
arXiv:2006.16831v1 [cs.SE] 30 Jun 2020
Estimating effort based on requirement texts presents many chal-

lenges, especially in obtaining viable features to infer effort. Aiming called models by analogy [65]. According to the authors, in some
to explore a more effective technique for representing textual re- ways, this method is a systematic way of judging specialists, who
quirements to infer effort estimates by analogy, this paper proposes are already experts in seeking similar situations to inform their
to evaluate the effectiveness of pre-trained embeddings models. For opinions. Manikavelan et al. [47] says that these models are built
this, two embeddings approach, context-less and contextualized from historical data from projects to generate custom models.
models are used. Generic pre-trained models for both approaches According to Idri and Abran [34], the accuracy of the estimate is
went through a fine-tuning process. The generated models were improved when the analogy is combined with Artificial Intelligence
used as input in the applied deep learning architecture, with linear (AI) techniques to generate estimates. In this way, fuzzy systems,
output. The results were very promising, realizing that pre-trained decision trees, neural networks, and collaborative filtering are are
incorporation models can be used to estimate software effort based some examples of techniques that improve Analogy-based Software
only on requirements texts. We highlight the results obtained to Effort Estimation (ABSEE). Idri et al. [35] reinforces that software
apply the pre-trained BERT model with fine-tuning in a single effort estimation models by analogy are reproducible and closely re-
project repository, whose value is the Mean Absolute Error (MAE) semble human reasoning because they based on experience gained
is 4.25 and the standard deviation of only 0.17, which represents from past projects.
a result very positive when compared to similar works. The main ABSEE can fit either agile or traditional models as long as the
advantages of the proposed estimation method are reliability, the estimation approach is based on data and previous team experi-
possibility of generalization, speed, and low computational cost ences to estimate software projects. One challenge that has been
provided by the fine-tuning process, and the possibility to infer presented for using these techniques, especially in agile models, is
new or existing requirements. the lack of project data and their requirements in the early stages
of the development process. The basic specification of software
KEYWORDS requirements used in these models is the user story, which is user
needs, usually written informally [20].
Software effort estimation, pre-trained model, context-less embed-
Assigning effort estimates to software requirements, especially in
ding, contextualized embedding, domain-specific model, BERT
the early stages of development, becomes quite critical as it depends
on the empirical expertise of the experts involved (e.g. project
1 INTRODUCTION managers and systems analysts), as there are not always complete
Estimating software effort is a challenging and important activity records of historical data on projects and requirements. But this
in the software development process. This activity depends on the is still a limitation for most companies, as in most cases there are
success of other crucial aspects of a project, mainly related to the only textual requirements that briefly describe user needs. This fact
achievement of time and budget constraints, directly impacting on makes this task very complex and sometimes even unfeasible.
the quality of the software product developed. The success of any The difficulty with textual requirements (e.g. use cases, user
particular software project depends heavily on how accurate its stories) is related to intrinsic informality in many software devel-
effort estimates are [36]. An accurate estimate assists in contract opment processes, which makes it difficult to use as a basis for
negotiations, scheduling, and synchronization of project activities predicting software costs. This limitation occurs because these
and efficient allocation of resources. texts include a diversity of domain-specific information such as
The importance of the accuracy of estimates has explored by natural language text containing source code, numeric data (e.g.
studies in the field of software engineering (SE) published in re- IP addresses, artifact identification codes), among others. A very
cent years, such as: [39], [5], [6], [28], [3], [52] and [37]. These common aspect is the occurrence of different words, but they are
studies continually seek to explore computational techniques indi- used in the same context; in this case, they should be considered
vidually or in combination, always seeking to achieve better levels similars (polysemy) because their context is similar. Or, equal words
of precision for effort estimation techniques. (ambiguous), but applied in different contexts, therefore, should not
Among the existing classifications for software effort estimation be represented in the same way.
techniques, we highlight in this article non-algorithmic models - On the other hand, in the AI world, especially in the Natural
which do not use predefined metrics, such as points per function and Language Process (NLP) field, word embedding methods mainly
2020-07-01 00:53. Page 1 of 1–17.
Eliane M. De Bortoli Fávero, Dalcimar Casanova, and Andrey Ricardo Pimentel
aim to capture the semantics of a given word in a specific context. Although there has been a lot of research on the application
This method allows words to be represented densely and with of word embedding in various areas, so far, only [37] and [17]
low dimensionality, facilitating machine learning tasks that use have sought to apply embeddings to estimate or run software. No
textual characteristics. Breakthroughs in training word embedding research has explored the application of contextualized pre-trained
models for a variety of purposes began in years recent with the embedding models following the task of ABSEE.
emergence of Word2Vec [51] and GloVe [56], enabling models to In this way, this article aims to present a model for the in-
learn from a very large corpus. Thus, contextual representations ference of effort estimates by analogy, both for existing and
through embeddings models have been very useful in identifying new software requirements, using contextualized pre-trained
context-sensitive similarity [32], disambiguation of the meaning of embeddings models, having as input the exclusive use of tex-
the word [12], [15], the induction of the meaning of the word [38], tual requirements (e.g. user stories), generated in the initial
lexical substitution by the creation of a generic embeddings model stage of development. This approach was named Software Effort
[50], sentence complementation [45], among others. Estimation Embedding Model (SE 3 M). With this model, greater pre-
Some studies have been conducted specifically in the field of SE, cision is sought for this activity, since it has currently been carried
such as recommending domain-specific topics from Stack Overflow out based on human empiric experience. This characteristic makes
question tags [14], recommending similar bugs [71], sentiment these estimates quite subjective.
analysis in SE [13], embedding model using Word2Vec for the SE As noted by Howard and Ruder [31], even though deep learning
domain [24], ambiguity detection in requirements engineering [26], models have achieved state-of-the-art in many NLP tasks, these
among others. models are trained from scratch, requiring large datasets, and days
More specifically applied in the generation of software effort to converge. Thus, pre-trained embeddings models make it easy
estimates, the use of embeddings is highlighted in the studies by to perform NLP tasks related to SE, without the need for training
[37] and [17]. In the first case, Ionescu (2017) explores the use of from scratch and with a low computational cost. According to
word embeddings generated by a context-less approach (Word2Vec), the authors, this is possible through the fine-tuning technique,
aggregated with design attributes and textual metrics (e.g. TF-IDF), eliminating the need for a representative corpus.The fine-tuning
from which positive results were obtained. Choetkiertikul et al. [17] approach consists of changing minimal task-specific parameters
it seeks to infer estimates from the text of user stories, which were and is trained on the downstream tasks by simply fine-tuning all
given as input to a deep learning architecture, with an embedding pre-trained parameters [23]. Most language representation models
layer as input. However, these initiatives face two main limitations, (e.g. Word2Vec, ELMo, OpenAI GPT) are classified as context-less,
which make it difficult to solve in the specific field of SE. Are they: i.e. each token considers only words on the left or right as part of
its context [69]. This makes fine-tuning approaches difficult, where
(1) Domain-specific terms present their meanings changed
it is relevant to incorporate the context bidirectionally, the reason
according to the context in which they are used: in this
for using BERT models. BERT is a contextual representation model
article, domain-specific terms are a set of words common to
that solves the one-way constraint mentioned earlier, which will
a specific area (e.g. software engineering, medicine), among
be further explained in section 2.3.
which there are strong semantic relationships. Some studies
Therefore, the approach also aims to infer software effort esti-
were carried out, seeking to develop resources to facilitate
mation using pre-trained embeddings models, with and without
textual representation in software engineering [24, 68], but
fine-tuning on a SE specific corpus. Thus, with the results of this
no complete solution for the presentation of contexts. Still, re-
article, we intend to answer the following research questions (RQ):
garding the context representation, the textual requirements
are usually short, bringing an inherent difficulty in identi- • RQ1. Does a generically pre-trained word embedding model
fying the context to which they refer. This reality makes it show similar results with a software engineering pre-trained
difficult to extract significant characteristics from the ana- model?
lyzed texts, making the inference process difficult. • RQ2. Would embedding models generated by context-less
(2) Lack of domain-specific SE data to train smart models: methods (i.e. Word2Vec) be effective as models generated by
this aspect makes it very difficult to train deep neural net- contextualized methods (i.e. BERT)?
works, which tend to overfiting themselves in these small • RQ3. Are pre-trained embeddings models useful to a text-
training data sets, not reaching generalization. This reality is based software effort estimation?
no different for textual software requirements and becomes • RQ4. Are pre-trained embeddings models useful to a text-
more critical when we need these texts to be accompanied based software effort estimation, both on new and existing
by their labels, which should represent the effort required to projects?
implement them. • RQ5. Are the results found generalizable, aiming to generate
estimates of effort between projects or companies?
To solve both problems, this work explores the use of contextu-
alized pre-training embedding models (e.g. BERT [23]) to infer an It is worth mentioning that, unlike the more similar approaches
estimate of software effort from textual requirements. Pre-trained ([37], [17]), the proposed approach proposes to be generalizable,
models present the concept of transfer learning [31], or allow to that is, it should allow the generation of estimates between projects
solve these problems, because we can train or model on a generic and/or between companies, both for new requirements, as for ex-
dataset (solve a lack of data problems) and adjust it (solver o problem isting ones (e.g. maintenance). In addition, similar approaches that
of the specific meaning of words in different contexts). perform the estimation process through text representation, apply
2020-07-01 00:53. Page 2 of 1–17.
SE3 M: A Model for Software Effort Estimation Using Pre-trained Embedding Models
textual representation methods by embeddings without context, Thus, the estimation of software effort by analogy is perceived as
which disregard the actual context of each word. a very appropriate technique when the input resources are require-
The structure of this article to organize as follows. Session 2 ments specifications in the unstructured text format, as is the case
presents the background of software effort estimation and word em- of this research. Since it is possible to submit textual characteris-
beddings (context-less and contextualized), in sequence to present tics, drawn from these specifications, as input to machine learning
the related works. Session 4 presents the theoretical aspects nec- models (e.g. neural networks).
essary for the proposed approach. Then, in session 5 the results
of each step of the proposed method are presented, ending with
session 6, where the initial research questions are to answer before 2.2 Word Embeddings
concluding and presenting the future works. Unlike lexical dictionaries (e.g. WordNet), which basically consist
of a thesaurus, grouping words based on their meanings [25] and
2 BACKGROUND which are usually built with human support, a word embedding
model is made up of word representation vectors. Through these
In this section, we first introduce aspects related to software effort
vectors, it is possible to identify a semantic relationship between
estimation (Section 2.1). The following are the concepts of word
words in a given domain, based on the properties observed in a
embedding and the context-less and contextualized paradigms (Sec-
training body and their automatically created generation [51].
tions 2.2 and 2.3).
Word embeddings are currently being a strong trend in NLP.
Word embedding models use neural networks arquitectures to rep-
2.1 Software Effort Estimation resent each word as a dense vector with low dimensionality and
Various classifications for software effort estimation models have focusing on the relationship between words [51]. Such vectors are
been applied in the last decades, with small differences, according used independently to calculate similarities between terms and as
to each author’s point of view. According to Shivhare [66], software a basis of representation for NLP tasks (e.g. text classification, en-
effort estimation models can subdivide into algorithmic/parametric tity recognition, sentiment analysis). The word embedding models
and non-algorithmic. The first ones are those that use algorithmic emerged to solve some limitations imposed by the bag-of-words
models, applied to project attributes and/or requirements to calcu- (BOW) model, which usually present sparse and high dimensional-
late their estimate, presenting themselves as reproducible methods ity matrices.
in substitution to non-algorithmic human expert methods [41]. Ex- Word2Vec, one of the most popular methods for generating
amples of algorithmic models are COCOMO II [10] and Function word embeddings from a text corpus, is an unsupervised learn-
Point Analysis [18]. Non-algorithmic ones are those based on Ma- ing algorithm that automatically attempts to learn the relationship
chine Learning techniques (e.g. linear regression, neural networks) between words by grouping words that have similar meanings
and are also called models by analogy, which rely on historical data into similar clusters [61]. In the Word2Vec model [51], a neural
to generate custom models that can learn from this data [47]. network is trained to represent each word in the vocabulary as
Chiu and Huang [16] point out that the ABSEE estimate deals an n-dimensional vector. The general idea is that the distance be-
with the process of identifying one or more historical projects tween the vectors representing the word embedding is smaller if
similar to the target project, and from them infer the estimate. In the corresponding words are semantically more similar (or related).
other words, but using the same line of reasoning, Shepperd’s [65], Pennington et al. [56] adds that for the generation of these vectors,
says that the basis for the ABSEE technique is to describe (in terms the method captures the distributional semantics and co-occurrence
of several variables) the project that must be estimated, and then, statistics for each word in the presented training corpus.
to use this description to find other similar projects that have been The Word2Vec model internally implements two neural network-
completed. In this way, the known effort values for these completed based approaches: Common Bag of words (CBOW) and skip-gram.
projects can be used to create an estimate for the new project. Both are models for word embedding widely used in information
Therefore, the ABSEE is classified as non-algorithmic. Idri and retrieval tasks. The goal of the CBOW model is to predict the current
Abran [36] also classify a technique by analogy as a machine learn- word based on its context, while the skip-gram model is intended
ing technique. These authors further point out that machine learn- to predict the current surrounding words for the given word. For
ing models have also gained significant attention for effort estima- both cases, the model consists only of a single weight matrix (in
tion purposes, as they can model the complex relationship between addition to the word analyzed), which results in training capable of
effort and software attributes (cost factors), especially when this capturing semantic information [51]. After training, each word is to
relationship is not linear, and it does not appear to have any prede- map to a low dimension vector. Results presented by their authors
termined form. Analog-based reasoning approaches have proven [51] show that words with similar meanings are much closer to
to be promising in the field of software effort estimation, and their those with different meanings. Generally speaking, the key concept
use has increased among software researchers [34]. of Word2Vec is to find words that share common contexts in the
Idri and Abran [36] conducted a systematic review of the litera- training corpus, close to the vector space compared to others.
ture on ABSEE and found that these techniques outperform other Word2Vec, like other models (e.g. Glove, FastText) is considered
prediction techniques. This conclusion to support by most of the a context-less (static) method for generating pre-trained textual
works selected in their mapping. Among the main advantages of representations. This feature means that these models have con-
ABSEE is the similarity with human reasoning by analogy and, straints regarding the representation of the context of words in a
therefore, they are easier to understand. text, making sentence-level or even fine-tuning token tasks difficult.
2020-07-01 00:53. Page 3 of 1–17.
Also, according to their authors [51], these models are to consider With this, BERT combines the pre-training tasks of both tasks
too shallow, as they represent each word by only one layer, and (MLM and NSP), making it a task-independent model. For this,
there is a limit to how much information they can capture. And their authors provided pre-trained models in a generic corpus but
finally, these models do not consider word polysemy, that is, the allowing fine-tuning. It means that instead of taking days to pre-
same word to use in different contexts can have different meanings, workout, it only takes a few hours. According to the authors of
which is not dealt with by these models, BERT [23], a new state of the art has been achieved in all NLP tasks
The Figure 1 presents a non-exhaustive differentiation between they have attempted (e.g. Question Answering (QA) and Natural
contextualized and context-less models. As we can see, Word2Vec Language Inference (NLI)).
is a form of static word embeddings such as Glove [56], Fast Text
[11], among others.
3 RELATED WORKS
Performing a systematic mapping focusing on ABSEE, we found
2.3 BERT few studies, as ([1], [33], [52], [17], [73], [37], [53], [4]), that obtain
The BERT is an innovative method, considered the state of the the effort estimate from text using texts.
art in pre-trained language representation [23]. BERT models are It was observed in most of the studies presented the bag of
considered contextualized or dynamic models, and have shown words approaches are applied, considering word-level features (e.g.
much-improved results in several NLP tasks [21], [57], [60], [31] tf, tf-idf, part-of-speech tag), which to treated individually, that, is
as sentiment classification, calculation of semantic tasks of textual based on quantitative and qualitative data, not employing specific
similarity and recognition of tasks of textual linking. knowledge about the text structure of the requirements, ignoring
This model originated from various ideas and initiatives aimed aspects of context. Only two studies ([37], [17]) differ from these
at textual representation that have emerged in the area of NLP in attributes, as they apply word embedding models, but none of them
recent years, such as: coVe [48], ELMo [57], ULMFiT [31], CVT [19], use pre-trained embeddings models.
context2Vec [49], the OpenAI transformer (GPT and GPT-2) [60] Ionescu [37] proposed a machine learning-based method for
and the Transformer [69]. estimating effort for software development, using as input the text
BERT is characterized as a dynamic method, mainly because it of project management requirements and metrics. The authors
has an attention mechanism, also called Transformer [23], which applied an original statistical preprocessing method to try out better
allows analyzing the context of each word in a text individually, results. First, a custom vocabulary was made. It is done using the
including checking if each word has been previously used in a text standard deviation of the effort of those requirements where each
with the same context. This allows the method to learn contextual word appears in the training set. For each requirement, a percentage
relationships between words (or subwords) in a text. of your words is maintained based on this statistic. The resulting
BERT consists of several Transformer models [69] whose pa- requirements are concatenated with available project metrics. A
rameters are pre-trained on an unlabeled corpus like Wikipedia modified TF-IDF calculation was also used, and numerical data
and BooksCorpus [74]. It can say that for a given input sentence, were produced to form a bag-of-words, which is used as input to a
BERT “looks left and right several times” and outputs a dense vector linear regression algorithm.
representation for each word. For this reason, BERT is classified as Choetkiertikul et al. [17] proposed the use of deep learning. Two
a profoundly two-way model because it learns two representations neural network models were to combine into the proposed deep
of each word, one on the right and one on the left, and this learning learning architecture: The Long Short Term Memory (LSTM), which
to repeat n times. These representations are concatenated to obtain are long term memories and the recurrent highway network. The
a final representation to use in future tasks. model is trainable from start to finish with raw input data that has
The preprocessing model adopted by BERT accomplishes two only gone through a preprocessing step. The model learns from the
main tasks: masked language modeling (MLM) and next sentence story point estimated by previous projects to predict the effort of
prediction (NSP). In the MLM task, the authors argue [23] that it is new stories. This proposal [17] uses context-less word embeddings
possible to predict a particular masked word from the context. For as input to the LSTM layer. As input data, the title and description
example, let’s say we have a phrase: "I love reading data science of the requirements report were combined into a single sentence,
articles." We want to train a contextualized language model. In this where the description follows the title.
case, you need to replace "data" with "[MASK]". It is a token to The embeddings vectors generated in this first layer serve as
indicate that it is missing. We will then train the model so that it input to the LSTM layer, which then generates a representation
can predict "date" as the missing token: "I love reading articles from vector for the full sentence. It should be to note that this process
[MASK] science". of training the embedding layer and then the LSTM layer, to then
This technique aims to make the model learn the relationship generate the embedding vectors for each sentence, becomes com-
between words, improving the level of learning, avoiding a possible putationally expensive. For this reason, the authors pre-train these
“vicious cycle”, in which the prediction of a word to base on the layers, and only then make these models available for use. This
word itself. Devlin et al. [23] used 15-20% of words as masked words. sentence vector is then to feed into the recurrent highway network,
The task of NSP is to learn the relationship between sentences. which transforms the document vector several times before pro-
As with MLM, given two sentences (A and B), we want to know if ducing a final vector that represents each sentence. Finally, the
B is the next sentence after A in the corpus or if it would be any sentence vectors undergo simple linear regression, predicting their
sentence. effort.
2020-07-01 00:53. Page 4 of 1–17.
Figure 1: Differentiation between contextualized and context-less (static) embedding models (adapted of Haj-Yahia et al., 2019)
The possible bottleneck of this approach is the difficulty to feed- It is important to note the nature of the software’s requirement
ing the model with new data. This feedback would fine-tune mod- texts, which are usually presented informally, that is, they do not
els, making them increasingly accurate. This difficulty occurs be- have a standard format. In addition, these texts have a series of very
cause with each new insertion into the dataset, the pre-training specific elements (e.g. links, numeric addresses, method names).
process, and consequently, its cost needs to be to repeat. Besides, Considering these attributes, we propose that software automati-
Choetkirtkul’s [17] method realizes inter-project prediction, which cally learn the characteristics of the original text.
is not repository independent. For the purpose of a comparison, pre-trained models generated
Another aspect to point is that, because requirement texts do from two approaches for modeling language will be applied: context-
not usually have a structured form, more words may not represent less and contextualized models. Figure 2 presents the steps that
more complexity [73] and [33] and therefore greater effort. Actually, make up the overall architecture of the proposed model, which will
it is the context that influences the effort most. be described below.
As a differential, our visa method for making effort estimation
(1) Data collection and pre-processing: in this step, the data
in a generic way, that is, using a single requirements repository,
collection and preparation procedures are performed for
independent of projects or repositories. A method of incorporating
later use during the feature learning step, which will gen-
contextualized words (i.e. BERT) allows you to deeply learn the
erate a context vector (i.e. numerical representation) for a
context of each selected word, solving problems of polysemy and
given requirement text. For the proposed model, two corpus
ambiguity. When feeding the model with new training data, the ad-
of texts containing software requirements will be required:
justment can be made a few hours, that is, cheaper computationally
corp_SE e corpPret_SE (as shown in Table 2). One of them
than the generation of a pre-trained model from zero.
will be the fine-tuning corpus, in which the texts are not
labeled. The other corpus will be used during the training
and testing stages of the inference model, in which each text
4 CONSTRUCTION APPROACH will be labeled with their respective efforts. The texts for
The main objective of this research is to evaluate the efficacy of both corpus go through basic pre-processing procedures (e.
pre-trained embedding models contextualized according to effort g. removal of special characters and stopwords).
estimation and based on requirement texts. (2) Textual representation model: this step consists in apply-
A requirement in this context can be a case of use, a user’s ing methods of deep learning to the textual characteristics
story, or any software requirement, provided that the data is in text [8], aiming to generate the vector representation for each
format, and aligned with the target effort. For this paper, user’s of the texts. Therefore, these methods do not use manual
stories serve as the input to the proposed model, and are composed activity for the generation of characteristics. To achieve
of their description and their effort, provided in points per story. this goal, methods to generate context-less embeddings (e.g.
This data’s textual format requires some basic preprocessing (e.g. Word2Vec) and contextualized embeddings (e.g. BERT) are
removal of special characters and stop words) before their use in used. Therefore, this step comprises the fine-tuning proce-
the models. dure of the generic pre-trained models for both approaches,
2020-07-01 00:53. Page 5 of 1–17.
that sequence, generating a representation for the sentence

(that is, for each text). Subsequently, two dense layers with
nonlinear activation functions are used (the dimensions of
the hidden layers are 50 and 10 respectively), ending with a
linear regression layer. This is an even smaller architecture,
when compared to the one used in the work of [17], exclud-
ing the role of recurring networks [27], [62] and applying a
single sequence feedforward after the representation layer.
This is possible due to the feature learning methods applied
in the previous step. A simplified architecture also aims to
not mask the results generated from the expected inputs to
the network.
After performing an inference of the estimates for each textual
requirement for the training and testing sets, metrics for perfor-
mance evaluation were applied, identified in the learning model
used (according to the section 4.1), in order to analyze the feasibility
of the application of the model developed. This evaluation process
was carried out using applicable statistical and graphic techniques,
always related to the real development environment.
4.1 Evaluation Metrics

Figure 2: General architecture of the proposed model.
The evaluation metrics used to evaluate model performance, which
refers to the distance between the test set values and the predicted
values. For this, some metrics were selected, which have been rec-
which are given as input to the inference model. For fine- ommended for the evaluation of regression-based software effort
tuning, an unlabeled corpus (corpPret_SE - according to Ta- prediction models [63],[17], [61], They are: Mean Absolute Error
ble 2) is applied together with generic pre-trained models for (MAE), Median Absolute Error (MdAE) and Mean Square Error
each incorporation approach (word2vec_base and BERT_base (MSE).
- according to Table 3). As an output, two pre-trained adjusted
models are generated: word2vec_SE and BERT_SE (according N
1 Õ
to Table6. Then, the pre-trained and adjusted models are MAE = |actual_e f f − estimated_e f f i | (1)
N i=1
used to extract the textual representations for each require-
ment that makes up the training and testing corpus, using Where N is the number of textual requirements (e.g. user stories)
appropriate pooling techniques (e.g. mean, sum) applied to that make up the test suite used to evaluate model performance,
embeddings for each word. This textual representation is actual_eff is the current effort measure, and estimated_effi is the
given by a matrix containing the number of samples from estimated effort measure for a given textual requirement i. We
the training and test corpuses in relation to the number of also used the Median Absolute Error (MdAE), suggested as a more
dimensions of the respective embeddings (context-less and robust metric for large outliers [17]. MdAE is defined as:
contextualized). The vector representation of each require-
ment is given as an input to the ABSEE inference model, MdAE = median|actual_e f f i − estimated_e f f i | (2)
aiming to learn and infer new estimates. The Mean Squared Error (MSE) metric, represented by Equation
(3) Inference model for ABSEE: the textual representation x, was also applied:
for the set of training and testing requirements is submitted v
u
as input to the inference model. This model is composed of t N
1 Õ
a deep learning architecture, which is considered to be quite MSE = (actual_e f f i − estimated_e f f i )2 (3)
simplified, when compared to VGG-type models [67], Resnet N i=1
[29], among others. It is important to note that the concept
of deep learning is not related to the number of layers of 4.2 Data Collection and Pre-Processing
neural networks that make up the architecture, but rather In order to obtain and prepare the data that makes up the training
to the fact that this architecture executes a deep learning and testing corpus, and the fine-tuning corpus, the data collection
of the text’s characteristics, through the process of learn- and pre-processing step was performed using API’s for the NLP,
ing features, which begins by representing the texts using based on models previously established by the literature. These
an embedding model. Therefore, the characteristics learned steps precede the process of representing textual characteristics,
during this process are applied through the embedding layer that is, the generation of the context vector for each requirement.
of the deep learning architecture used. Since the network Thus, in order to create the training and testing data set, a corpus
entry is a sequence of words (100 words are considered for specific to the software engineering context (corp_SE) was used,
each text), an LSTM layer has the function of processing composed of textual requirements, more specifically user’s stories
2020-07-01 00:53. Page 6 of 1–17.
Corpus Specification Aplication Labeled

[17], labeled according to their respective development efforts. It corp_SE Contains 23.313 user stories and con- Train and test YES
is important to highlight that, despite being referred to as user’s sists of 16 large open source projects
stories, the text with requirements does not have a standard struc- in 9 repositories (Apache, Appceler-
ator, DuraSpace, Atlassian, Moodle,
ture. Regarding the effort attributed to each requirement, it is worth Lsstcorp, Mulesoft, Spring e Talend-
mentioning that no single measurement scale was adopted (ex. Fi- forge).
bonacci). The corp_SE (Table 2) is considered by the authors to be corpPret_SE It consists of more than 290 thousand Fine-tuning NO
texts of software requirements of dif-
the first data set where the focus is on the level of requirements ferent projects (ex. Apache, Moodle,
(e.g. user’s stories) and not just on the project level, as in most data Mesos).
sets available for SE research. Table 2: Corpus used in the experiments carried out.
The requirements texts, as well as the effort given to each of them,
were obtained from large open sources from project management
systems (e.g. Jira), totaling 23.313 requirements (Table 1), which
were initially made available by [58]. Subsequently, [17] used the
same database to carry out his research, aiming to estimate software (shown in the Table (2), is used. It is not labeled, therefore the train-
effort by analogy. Despite the difficulty in obtaining the real effort ing carried out is unsupervised, and is composed of texts with
to implement a software requirement, the authors claim to have specifications of requirements, obtained from open source reposi-
been able to obtain the implementation time based on the situation tories, according to the procedure described by [17].
(status) of the requirement. Thus, the effort was obtained beginning While exploring the data available in corp_SE, some relevant
from the moment when the situation was defined as "in progress" aspects that interfere with the inference model’s settings were
until the moment when it was changed to "resolved". Thus, [17] observed. The histogram of Figure 3 allows for one to evaluate the
applied two statistical tests (Spearman’s and Pearson correlation) maximum number of words to be considered per text. The average
[75], which suggested a correlation between the points throughout number of words per text, accompanied by its standard deviation,
the history and their real effort. Therefore, this same database was is 53 ± 108.6.
applied to the research proposed for this paper. It is known that
these story points were estimated by human teams and, therefore,
may contain biases and, in some cases, may not be accurate, which
may cause some level of inaccuracy in the models.
Projct ID Description Requirements/project
0 Mesos 1680
1 Usergrid 482
2 Appcelerator Studio 2919
3 Aptana Studio 829
4 Titanium SDK/CLI 2251
5 DuraCloud 666
6 Bamboo 521
7 Clover 384
8 JIRA Software 352
9 Moodle 1166
10 Data Management 4667
11 Mule 889
12 Mule Studio 732
13 Spring XD 3526
14 Talend Data Quality 1381
15 Talend ESB 868
Total 23.313
Table 1: Number of textual requirements (user stories) and Figure 3: Histogram representing the number of words in
description of each of the 16 projects used in the experi- each sentence of the dataset of training and testing.
ments [17].
Figure 4 shows the frequency of distribution of the effort size

Typically, user’s stories are measured on a scale based on a in the training and testing database (corp_SE). It can be seen that
series of Fibonacci [64], called Planning Poker (e.g. 1, 2, 3, 5, 8, most of the samples have smaller efforts (between 1 and 8 points
13, 21, 40, 100) [20]. As there is no standardized use of this scale per story).
among the projects used to create corp_SE, there was no way of The cross validation k-fold method was applied in order to parti-
approximating the points by the history provided. Therefore, 100 tion the data set used to carry out the experiments (corp_SE). Thus,
possible predictions were considered, distributed over the corp_SE, a number of equal subsets (nsplit) was defined, represented by k
of which some are nonexistent, as can be seen in the histogram of with a value of 10. Thus, the data from the corp_SE were divided
Figure 4. In this way, the effort estimate was treated as a regression into a set of training and validation (90% of texts), and a set of tests
problem. (10% of texts). For each subset of the data, the mean and standard
For the fine-tuning process, which makes up the proposed model, deviation for the metrics applied in the performance evaluation
a corpus of specific texts from software engineering, the corpPret_SE were obtained, as described in the section 4.1.
2020-07-01 00:53. Page 7 of 1–17.
• A vocabulary file (vocab.txt) for mapping WordPiece [72] for

word identification.
• A configuration file (bert_config.json) that specifies the model’s
hyperparameters.
Then, these two generic models go through a fine-tuning process,
as presented in the section 4.4.
4.4 Fine-tuning
It is worth noting that the fine-tuning process consists of the use
of a pre-trained embedding model (trained on a generic dataset)
in an unsupervised way, which is adjusted, that is, retrained on a
known data set that is specific to the area of interest. In this case, the
fine-tuning was performed on the generic models word2vec_base
and BERT_base, using corpus corpPret_SE (shown in Table 2).
The following are the pipelines (Figures 5 and 6) used to adjust
Figure 4: Histogram representing the size of the effort in re- and generate the representation of the texts corp_SE. This represen-
lation to its frequency in the corp_SE. tation was used as input to the proposed sequential architecture.
4.3 Textual Representation Model

The purpose of the procedures described in this section is to gen-
erate models to represent the texts that make up the training and
test corpus. These representation models will be obtained from the
generic and adjusted pre-trained embeddings models, that must
consider the diversity of existing contexts. Thus, the representation
models (according to the Figure 2) serve as input to the proposed
sequential architecture.
To perform the experiments, two pre-trained generic word em-
beddings models were applied, one using the context-less approach
(Word2Vec) and the other the contextualized approach (BERT ).
Thus, for the context-less approach, a pre-trained model called
word2vec_base was used, the specifications of which are shown in
Table 3. As a contextualized model, a generic BERT model (BERT_base
uncased) had previously been pre-trained and made available by its Figure 5: Pipeline of the word2vec_base fine-tuning process
authors [23] for free use in PLN tasks, as specified in Table3. and generation of textual representation for the corp_SE.
Pre-trained model Specification
word2vec_base trained on a corpus from Wikipedia us-
ing the Word2Vec [51] algorithm. For
this, the following hyperparameters
A fine-tuning of the generic model word2vec_base (Figure 5) was
were used: number of dimensions of the performed using specific methods for this purpose provided by the
hidden layer = 100; method applied to Gensim library in the Python language. This process generated the
the learning task = cbow.
BERT_base Bert_base uncased: 12 layers for each to- word2vec_SE model. This adjusted model was used to generate the
ken, 768 hidden layers, 12 heads of at- average representation (see section 4.5) for each requirement text
tention, 110 million parameters. of the training and testing corpus (corp_SE).
The uncased specification means that
the text was converted to lower case The fine-tuning process of the pre-trained BERT model consists
before tokenization based on WordPiece, of two main steps [23]:
in addition, removes any accent marks.
This model was trained with english (1) Preparation of data for pre-training: initially the input
texts (Wikipedia) with lowercase letters.
data is generated for pre-training. This is done by converting
Table 3: Pre-trained models used in the performed experi- the input sentences into the format expected by the BERT
ments. model (using create_pretraining_data algorithm). As BERT
can receive one or two sentences as input, the model expects
an input format in which special tokens mark the beginning
The BERT_base model, as well as the other pre-trained BERT and end of each sentence, as shown in Table 4. In addition, the
models, offers 3 components [23]: tokenization process needs to be performed. BERT provides
• A TensorFlow checkpoint (bert_model.ckpt) that contains pre- its own tokenizer, which generates output as shown in Table
trained weights (consisting of 3 files). 5.
2020-07-01 00:53. Page 8 of 1–17.
Entry of two sentences Entry of a sentence

[CLS] The man went to the store. [SEP] [CLS] The man went to the store. [SEP]
After performing the fine-tuning, two new pre-trained models
He bought a gallon of milk. [SEP] are available, as shown in Table 6, which will also compose the
Table 4: Example of formatting input texts for pre-training experiments.
with BERT. It is noteworthy that the proposed model requires pre-training
only for the embedding layer. This allows, for example, for this pre-
trained model to be made available for other software engineering
tasks, or even for different effort estimation tasks. Thus, this pre-
Input sentence "Here is the sentence I want embeddings for."
Text after tokenizer [’[CLS]’, ’here’, ’is’, ’the’, ’sentence’, ’i’, ’want’, ’em’,
trained model may undergo successive adjustments, according to
’##bed’, ’##ding’, ’##s’, ’for’, ’.’, ’[SEP]’] the need of the task to which it will be applied.
Table 5: Example application of tokenizer provided by BERT.
Pre-trained models Specification
word2vec_SE consists of the word2vec_base model after fine-tuning with
the corpus corpPret_SE.
BERT_SE consists of the BERT_base model after fine-tuning with
the corpus corpPret_SE.
(2) Application of the pre-training method: the method used
for pre-training by BERT (run_pretraining) was made avail- Table 6: Adjusted pre-trained models applied to the per-
able by its authors. The necessary hyperparameters were formed experiments.
informed, the most important being:
• input_file: directory containing pre-formatted pre-training
data (as per step 1).
• output_dir: output file directory. 4.5 Obtaining Characteristics
• max_seq_length: defining the maximum size of the input After the fine-tuning was completed, processing was performed to
texts (set at 100). obtain textual representations from the four models of embeddings
• batch-size: maximum lot size (set at 32, per use guidance (word2vec_base, BERT_base, word2vec_SE and BERT_SE). For the
of the pre-trained model BERT_base. context-less embeddings model, represented by the word2vec_base
• bert_config_file: BERT model configuration file, supplied and word2vec_SE models (as shown in Figure 5), the embeddings
with the pre-trained model (bert_config.json). vectors of the words contained in each text were averaged [55, 70].
• init_checkpoint: files of the pre-trained model used con- As for the contextualized embeddings model, the textual rep-
taining the weights (bert_model.ckpt). resentations generated by the BERT model (according to Figure
For the generic BERT model (Table 3, we opted for its ver- 6) present a different structure from the context-less embeddings
sion Uncased_L-12_base, here called BERT_base (Table 3). The fine- models (e.g. Word2Vec). This is primarily due to the fact that the
tuning process for the BERT model also used the corpPret_SE (Figure number of dimensions of the embeddings vectors is not defined
6) and was performed as shown above. by the user, but by the model itself. Therefore, the number of di-
mensions of word embeddings for the model BERT_base is defined
in 768. In addition, each word in this model is represented by 12
layers (standard for BERT_base), with the need to pool [44] the
embeddings of some of the layers for each word. In order to define
the pooling strategies to be applied, the article by [23] was taken
as a basis.
Thus, one of the proposed strategies was used, considering that
there were no significant differences between the results obtained
with the other tested strategies. Thus, for the models BERT_base
and BERT_SE, the strategy chosen was to use the penultimate layer
of each word to generate a vector of average embeddings for each
sentence. More details on pooling strategies can be found in [2, 44].
4.6 Exploratory data analysis

Before presenting the results of the effort estimation, it is first
Figure 6: Pipeline of the fine-tuning process of the important to highlight some observable aspects regarding the vector
BERT_base model and generation of the textual representa- of textual representation obtained after the embeddings layer. These
tion for the corp_SE. sentence embeddings were generated from the models specified
in Tables 3 and 6, that is, generic and fine-tuned models for both
approaches (without context and contextualized).
The entire process, from data preparation to fine-tuning the In order to show the characteristic of the generated embeddings,
BERT model, used the algorithms produced in the repository [22], the t-Distributed Stochastic Neighbor Embedding (t-SNE) algorithm
in which [23] provides the full framework developed in the Python was applied. This algorithm has been used to represent complex
language. data graphically and in smaller dimensions, while preserving the
2020-07-01 00:53. Page 9 of 1–17.
relationships between neighboring words [46], which greatly facili- obtained according to the procedures presented in section 4.5, are
tates their understanding. given as a parameters to the layers dense of the architecture deep
Thus, Figure 7 shows sentence embeddings generated from the learning, as shown in Figures 8 and 9.
BERT_SE for each textual requirement. Only 100 instances of re-
quirements were included for each of the 16 projects analyzed (Table
1) (represented by “idProj”). The t_SNE algorithm reduced the 768-
dimension model (BERT standard) to just two dimensions, which
allowed for a visual analysis of the representations obtained and,
based on these representations, some conclusions were reached.
Figure 8: Architecture deep learning with the pre-trained em-

bedding layer using Word2Vec.
Figure 7: Embeddings generated by BERT_SE. The points rep-

resent the effort for each requirement, according to its size. Figure 9: Architecture deep learning with the pre-trained em-
The larger the point size, the greater the effort. bedding layer using BERT.
Table 7 present examples of requirement texts extracted from The embedding() layers (Figures 8 and 9) are represented by
2 different groups according to Figure 7. The texts, in each of the the vector of pre-trained weights for each sentence. These vectors
tables, have minimum distances between them. are processed by two dense nonlinear layers, followed by a linear
In this example (Table 7), the context of the requirements pre- regression layer. The output is the estimate of the predicted effort
sented is related to connection and database operations. Among (e.g. points per story).
the highlighted words (in bold), one can find, for example "sqoop". Each instance of textual requirement submitted to the Embed-
This word is part of to an application that transfers data between ding () layer is represented by an average vector representing each
relational databases and Hadoop 1 . Therefore, the identification of sentence. A sentence consists of a maximum of 100 words, each
similar contexts is more clearly perceived, even with very different of which is represented by 100 text embeddings for the Word2Vec
words, as is the case of "sqoop", "sqlserver" and "persistent". These models and 768 dimension embeddings according to BERT models,
are different words, but part of the same context. Thus, although as justified below.
the groupings did not demonstrate clear groups, either by project The maximum number of words per text was defined based on
or by effort, the groupings demonstrated, at a certain level, a rep- the representation of the histogram shown in Figure 3, which shows
resentation of requirements from the same context, even if from that this number would include most of the sentences used in the
different projects and/or efforts. training and test database, with reduced data loss.
As presented in the architecture specification SE 3 M (section 4),
4.7 Effort Estimation Model Settings it is a very simple architecture, post layer of embedding, consisting
only of two dense non-linear layers and a linear regression layer.
In this section, the stages of the SE 3 M model will be presented,
The fact that the deep learning architecture is very simplified was
in which representation vectors for each textual requirement are
purposeful, considering that the objective of this thesis is to present
1 Hadoop is an open source software platform for the storage and distributed process- how to infer effort estimates in software projects by analogy using
ing of large databases. The services Hadoop provides includes storage, processing,
access, governance, security and data operations, making use of powerful hardware pre-trained embeddings models, verifying whether these models
architectures, which are usually on loan. are promising for text-based software effort estimation. In this way,
2020-07-01 00:53. Page 10 of 1–17.
ID Text
Texto 18994 for the sqlserver and postgresql connection can not show the structure correctly.1. create apostgresql sqlserver connection set or not set the catalog
parameter(doesn’t set the textbfschema). 2. check the structure, only one schema show under each catalog. in fact i have several. please check it same issue as
informix db.
Texto 17552 sqoop - unable to create job using merge command as a user, i need to use xd sqoop module to support the merge command. currently, the sqoop runner
createfinalarguments method forces the requirement for connect, username and password options which are not valid for the merge option. a check of the
module type to not force these options being assigned to sqoop arg list would be preferred.
Texto 15490 need to create a persistent-job-registry in order to hook up the to get access to all the jobs available the job registry has to be shared. currently the only
implementation is the mapjobregistry. testability. the admin will need to be see all jobs created by its containers.
Table 7: Contextual similarity between grouped texts.
a more robust architecture could mask the results generated at each Table 9 presents the average values for the MAE, MdAE and MSE
of the different network entrances. measurements obtained after the application of cross-validation 10-
A textual requirement, also referred to in this study as a sentence, fold during the execution of the proposed sequential architecture,
is represented by an average vector of the word embeddings that in which each model was used as input for pre-trained embedding
compose this sentence. This average vector is generated from each available for each of the experiments (according to Table 8).
of the generated models (Tables 3 and 6, considering the particular- As can be seen in Table 9, the models that underwent fine-tuning
ities of each applied approach (without context and contextualized), (with SE suffix), regardless of the approach used, gave better results
as presented in section 4.5. Therefore, each generated embedding for MAE, MSE and MdAE. Thus, it is clear that a pre-trained embed-
model is represented as a matrix, where each line represents an ding model, in which fine-tuning is performed with a specific corpus
average embedding vector for a given textual requirement. Thus, of the task domain, presents a better performance than a pre-trained
each embedding model has the same number of samples, and what model with a generic corpus, as is the case with word2vec_base and
varies is the dimension applied. BERT_base.
Previous tests were performed using the grid search method. For This can be proven by comparing the generic context-less embed-
the embeddings models without a context, tests were performed ding model (word2vec_base) to the same fine-tuned model (word2vec_SE),
with dimensions 50, 100, and 200 with the best results presented by where a 5% improvement is seen in relation to the MAE value for
dimensions 50 and 100. As there was no significant variation for the second model. Likewise, when applying the contextualized em-
the MAE between both dimensions using the same neural network bedding model with fine-tuning (BERT_SE), an improvement of
architecture, the number of dimensions of embeddings was fixed 7.8% for the MAE was observed in relation to the generic model
at 100. For the transformers BERT the standard dimension defined BERT_base. Thus, it is noted that, in general, the results achieved
for the BERT_base model was used, that is, equal to 768. improved after adjusting the models with a specific corpus of the
For the training of the learning model, it was necessary to con- domain .
figure some hyperparameters. Therefore, the Adam optimizer [40], If the MAE value obtained for the best model in Table 9 (BERT_SE),
learning rate 0.002 was used and the size of the batch_size was 128. which was 4.03, is applied in practice, this will indicate that, for a
Twenty epochs and an early stopping mechanism were defined, given user story, in if the real effort is 5 points per story, the esti-
in which, if the MAE value remains stable for 5 epochs, the best mated effort value could be between 1 and 9 points per story. This
results are averaged. is one of the reasons why human participation in the estimation
process is indicated, in order to calibrate these effort values to new
4.8 Experiment Settings requirements.
The results obtained in this study will be presented below in order In terms of MSE, the improvement was of 3.5% for the word2vec_SE
to make a comparative analysis with the most similar study [17] we model, when compared to word2vec_base, and 13.5% for MdAE.
identified, in addition to answering the following research questions When evaluating MSE and MdAE for the contextualized approach,
defined initially. the fine-tuned model outperforms the generic model by 14.6% and
14.8%, respectively.
Thus, we conclude that the representation of the training and
5 RESULTS AND DISCUSSIONS
test model (corp_SE), when generated from pre-trained and adjusted
The results will be presented below, to make a comparative analysis embeddings models, improves the performance of activities such as
with the most similar study [17], in addition to answering the the ABSEE. To this end, it is estimated that the greater the volume
following research questions defined initially. and diversity of samples in the corpus used in fine-tuning (corp-
RQ1. Does a generically pre-trained word embedding model Pret_SE, the better the results can be. Therefore, this fact must be
show similar results to a software engineering pre-trained considered when giving continuity to this task in future studies.
model? RQ2. Would embedding models generated by context-less
To answer this question, experiments (E1, E2, E3 and E4) were methods (i.e. Word2Vec) be effective as models generated by
performed, with pre-trained models with and without fine-tuning, contextualized methods (i.e. BERT)?
for context-less and contextualized approaches, a task that aimed As stated by Ruder [31], “it only seems to be a question of time
to determine effort estimation. The approach consists of using a until pre-trained word embeddings (i.e. word2vec and similar) will
pre-trained embedding model as the only source of input in the be dethroned and replaced by pre-trained language models (i.e.
deep learning architecture used.
2020-07-01 00:53. Page 11 of 1–17.
Experiments Pre-trained model Sequential network architecture

E1 word2vec_base Embedding() + LSTM + Dense (not-linear) + Dense (not-linear) + Dense (linear)
E2 word2vec_SE Embedding() + LSTM + Dense (not-linear) + Dense (not-linear) + Dense (linear)
E3 BERT_base Embedding() + LSTM + Dense (not-linear) + Dense (not-linear) + Dense (linear)
E4 BERT_SE Embedding() + LSTM + Dense (not-linear) + Dense (not-linear) + Dense (linear)
E5 BERT_SE Embedding() + LSTM + Dense (not-linear) + Dense (not-linear) + Dense (softmax)
Table 8: Description of performed experiments.
Approach Pre-trained model MAE MSE MdAE

Context-less word2vec_base 4.66 ±0.14 100.26 ±7.04 2.9
word2vec_SE 4.36 ±0.31 89.9 ±14.37 2.5
Contextualized BERT_base 4.52 ±0.094 100.95 ±7.3 2.7
BERT_SE 4.25 ±0.17 86.15 ±1.66 2.3
Table 9: Evaluation of the results obtained for experiments E1, E2, E3 and E4. Bold are the best results for each pre-trained
embedding model. For all the metrics used, the lower the value, the better the result.
Text 22810 Add that contact as a favorite notice that the images for contacts
BERT) in the toolbox of every NLP practitioner.” Thus, the objective (driven by remote url) change unexpectedly. Under the covers all
of this study was to analyze if context-less embeddings models that is happening, is that the data of the list view is refreshed.
are effective in effort estimates, as compared to contextualized Text 23227 Add qparam to skip retrieving metadata and graph edges if the
qparam is not there, use current default behavior.
embeddings models in a specific corpus.
As can be seen in Table 9 the results obtained by the contextual- Table 10: Example of requirement texts that have polysemy
ized models (BERT_base and BERT_ES), surpass the results of the - same words and different contexts.
models without context.
When comparing MAE values, the BERT_base model shows a 3%
Text 18 create new titanium studio splash screen there is a placeholder
improvement over word2vec_base. Likewise, for the values of MdAE image...
and MSE there was an improvement of 0.7% and 6.9%, respectively. Text 22810 build the corporate directory app for ios...
When comparing how these metrics between the models with Table 11: Example of requirement texts that have ambiguity
fine tuning (Word2vec_ES and BERT_ES), as improvements of the - different words in the same context.
contextualized model in relation to no context were 2.5%, 4.2% and
8% for the values of MAE, MSE and MdAE, respectively.
It is believed that this result is due to the fact that methods
context-less (ex. Word2Vec) allow to represent a single context for a RQ3. Are pre-trained embeddings models useful to a text-
given word in a set of texts. This aspect causes a lot of information based software effort estimation?
to be lost. In an effort estimate, based on the requirements texts (e.g. To answer this question regarding the perspectives of contex-
user stories), a contextualized strategy certainly produces better tualized pre-trained models applied to ABSEE, it is necessary to
results. Unlike the Word2vec approach, BERT methods offer this observe whether MAE, MSE and MdAE obtained good results. As
dynamic context, allowing the actual contexts of each word to be shown in Table 9, the best MAE value was 4.25, which means that
represented in each text. a software effort of 6 will be predicted between 2 and 10.
This aspect is important, as textual software requirements (i.e. When questioning whether this result is good or bad, can be
user stories, use cases) generally have short texts and little vocab- observed that, considering the small number of samples in the
ulary, which means that many words are common to the field of training set and tests, the high degree of imbalance between classes,
software engineering and are repeated in many texts. This is the and the high variability of the text, this result is quite adequate. It is
importance of identifying different contexts of use for each word, in precisely due to the perception that, at least currently, there is little
order to differentiate them. Contextualized methods like BERT guar- data to estimate the effort involved in software development that
antee this dynamic treatment of each word, addressing problems motivated these researchers to investigate pre-trained embedding
of polysemy and ambiguity in an intrinsic way. models, so as to solve the proposed problem.
This is possible due to the fact that the contextualized models
Method SE 3 M (multi-repositories) Deep-SE (cross-repository)
use a model of deep representation of each word in the text (that is,
MAE 4.25 ± 0.17 3.82 ± 1.56
12 or 24 layers), unlike the models of embeddings context-less that
present a superficial representation (of a single layer) for a word. In Table 12: Mean Absolute Error (MAE) obtained for BERT_SE
addition, contextualized models use an attention model that allows compared to the Deep-SE model [17]
verifying whether the same word occurred previously in the same
context or not (example in Table 10), or whether different words
can present the same context (e.g. create, implement, generate) - When comparing the best MAE results, obtained though the
example in Table 11. BERT_SE model, with the MAE results given by the Deep-SE [17]
model, the latter of which was the study found to be most similar
to the present research, some aspects stand out, as shown below.
2020-07-01 00:53. Page 12 of 1–17.
As can be seen in Table 12, the MAE obtained by BERT_SE was

slightly higher than Deep-SE. However, one should note that, to
obtain this result, [17] inferred the effort estimates between projects
(e.g. Moodle/Titanium, Mesos/Mule). Therefore, there is a large
chance that two projects share a similar context, which would
make predictability easier for projects in different contexts. This
statement is reflected directly in the standard deviation of the MAE
values (e.g. 5.37, 6.36, 5.55, 2.67, 4.24) for the Deep-SE between
projects, which is 1.56. One can see that some values of MAE are
relatively low (e.g. 2.67), while others are higher (ex. 6.36). This
means that there may be a higher variation for the estimates inferred
by Deep-SE, which, in the worst case, can cause a requirement
whose effort is 7 points per story, to return 1 or 13 points per story.
Thus, it is suggested that if the Deep-SE model is applied using a
single repository approach, as well as the SE 3 M, the MAE values
may be even higher.
SE 3 M, on the other hand, uses a single repository approach, that
is, all requirements are independent of project or repository, aiming
to generalize the model. Thus, although the SE 3 M MAE is close
to the Deep-SE value, the standard deviation obtained is smaller
(0.17), or almost nonexistent (ex. 3.87, 4.21, 4.15, 4.12, 3.97, 4.25, Figure 10: Confusion matrix for the E5 experiment using the
4.01), which can be proven by observing the pattern of MAE values 9 classes from Planning Poker.
obtained by the model. In this sense, it can be said that the pro-
posed method is more generic and applicable to different problems,
demonstrating a greater degree of reliability than Deep-SE. This percentages there was confusion among the smaller classes. This
is mainly due to the ability of the BERT Transformer mechanism aspect leads us to suggest that human intervention at the end of
(attention mechanism) to resolve long-term dependency and “van- the process is needed, with the aim of approving/modifying the
ishing gradient” [30] and [9], that presents itself as limitations in estimate generated, according to the user’s working reality. Thus,
recurrent network architectures, as is the case with RHN and LSTM, considering its application for the end user, the proposed method
used by [17]. The “vanishing gradient” is the loss of relevant context would be classified as semi-automated.
information, used to identify the semantics of a given word in a Another aspect to be observed in the results of Figure 10 is an
text. Therefore, the attention mechanism allows for one to ignore indication that the larger the data set representing each of the labels,
irrelevant information and focus on what is relevant, making it the better the results. This study argues that effective techniques for
possible to connect two related words, even if they are not located increasing data in texts can improve this result. Another alternative
one after the other. would be to collect a more significant number of real samples, com-
In order to reinforce the positive trend of applying pre-trained plementing the pre-training and fine-tuning data set corpPret_SE, as
embeddings models in the process of inferring effort estimation, the well as sets for training and testing (corp_SE). Thus, the generated
E5 experiment was performed. In this case, the same set of train- embeddings will be more representative, that is, they can better
ing and testing data was used, applying a classification layer (with represent each situation found in the requirements.
softmax activation) instead of linear regression. This required some RQ4. Are pre-trained embeddings models useful to a text-
modifications to the dataset. First, there was a need to make the based software effort estimation, both on new and existing
dataset more homogeneous and bulky as compared to the existing projects?
labels. Then the closest estimates were grouped, considering the Another aspect that can be observed is the indication of the SE 3 M
series of Fibonacci proposed for the Planning Poker [20] estimation model to estimates new and non-existing requirements. For this, an
method. As a result of this process, the data set only had 9 possi- additional experiment was carried out containing the same config-
ble labels: 1, 2, 3, 5, 8, 13, 20, 40, 100. Therefore, compared to the uration as E4, and only changing the type of data partitioning. In
regression problem, each label had its data volume increased and this experiment cross validation was applied by projects in order to
balanced in relation to the existing classes, mainly for the smaller evaluate the results of inferring the estimates for each project, that
labels (between 1 and 8), as shown in Table 13. is, for completely new projects. Thus, during each of the iterations
The Figure 10 shows the confusion matrix generated for the one of the projects was considered a target for the estimates (test
experiment E5. One can see that the greatest confusion occurs set), while the others were considered a source (training set). Table
between the lowest efforts (1, 2, 3, 5 and 8). Considering the MAE 14 shows the results obtained for the MAE in each project.
value of the regression experiment (4.25), it is possible to understand When considering the cross-project estimation approach, if tar-
this bias in the confusion matrix; after all, as explained previously, get projects are considered (UG, ME, AP, TI, AS, TI, MS, MU), as
an effort of 4 points per story could be 1 or 8. listed in Table 9 of the article by [17], the authors were able to
One can also observe that, similarly, the larger classes (13, 20, 40 an average MAE value of 3.82 for the Deep-SE model, while the
and 100) became more confused with each other and, in very low SE 3 M model presented a MAE of 3.4 (Table 15). It is observed that
2020-07-01 00:53. Page 13 of 1–17.
Number of labels 1 2 3 5 8 13 20 40 100 Total

Number of textual requirements/labels 4225 3406 4809 4725 3588 1238 706 451 165 23.313
Table 13: Number of textual requirements allocated to each Planning Poker class.
ID project Description Num. require- Med Std MAE

ment
project in relation to the other, which leads, for example, to the use
AS Appcelerator Studio 2919 5.63 3.32 2.50 of different terms for the same purpose. Thus, there is a need for a
AP Aptana Studio 829 8.01 5.95 4.18 balanced volume of representative samples from each sub-domain,
BB Bamboo 521 2.41 2.14 2.76
CV Clover 384 4.6 6.54 3.87 so that the model can learn properly.
DM Data Management 4667 9.56 16.6 7.78
DC DuraCloud 666 2.12 2.03 3.79 Source projects Target projects MAE (Deep-SE) MAE (SE 3 M)
JI JIRA Software 352 4.43 3.51 3.13 UG 1.57 3.24
ME Mesos 1680 3.08 2.42 3.39 ME 2.08 3.39
MD Moodle 1166 15.54 21.63 11.99 AP 5.37 4.18
MU Mule 889 5.08 3.49 3.51 15 remaining projects TI 6.36 3.49
MS Mule Studio 732 6.39 5.38 3.51 AS 5.55 2.50
XD Spring XD 3526 3.69 3.22 3.16 TI 2.67 3.49
TD Talend Data Quality 1381 5.92 5.19 4.04 MS 4.24 3.51
TE Talend ESB 868 2.16 1.49 3.42 MU 2.70 3.51
TI Titanium SDK/CLI 2251 6.31 5.09 3.49 Avg 3.82 3.40
UG Usergrid 482 2.85 1.40 3.24
Table 15: Mean Absolute Error (MAE) in estimating effort for
Table 14: MAE values when estimating the effort for each
new projects.
project in relation to the others. The number of require-
ments by project (Num. requirement), the average (med) and
standard deviation (std) of the effort by requirement in each
project are presented. Although the results of this study are not very different from
that obtained in the article by [17], the proposed approach presents
a much simpler and potentially more robust network architecture.
This is possible because part of the textitfeature learning process
this comparison was made only with the results obtained in the previously performed, which extracts the contextualized textual rep-
mode between repositories of [17], and for the results obtained with resentation for new project requirements, is performed only once,
SE 3 M, all 15 projects were considered as source projects remaining, during the process of generating the pre-trained model (BERT_SE).
while the target projects are the same chosen by the authors. This model is then employed as a parameter in the embedding layer
It is still observed that the authors of Deep-SE used a source of the sequential architecture used. This aspect is important, consid-
project for training and another target project for tests, that is, ering that there is the need to feed the training database with new
the diversity of characteristics that the learning method needs to cases of requirements, as well as with the cases’ respective efforts,
deal with, is limited only by the domain of a single project. In which makes it possible to increase precision in effort estimates for
other words, the model needs to deal with less variability of data, new projects.
which supposedly can facilitate learning, since some important RQ5. Are the results found generalizable, aiming to gen-
relationships between these data can be discovered more easily by erate estimates of effort between companies?
the learning method. However, the proposal presented here uses all In our approach, the results show that even a new project can
projects (except one) for training, that is, the variety of relationships have its effort predicted without any pre-existing data (as shown
that the model must deal with is much greater, when compared to in RQ4). Although the MAE still does not deliver a perfect result,
the Deep-SE approach. From a certain point of view this is good, as a good estimate of the effort can be achieved and used, in a semi-
the learning method should learn a better generalization, for any automated way, by companies, as explained in the RQ3 response.
type of problem presented. On the other hand, however, learning is Additionally, when considering the application of the model for
hampered due to curse of dimensionality, in which many important multiple companies, it is necessary to consider the possibility that
relationships can be more difficult to be inferred automatically, existing requirements contain different metrics (e.g. function points,
mainly due to the very small corpus. The explanation for this is that story points), since the data-set would be fed by requirements
in a smaller data set, basic relationships are easily learned, whereas from different companies. To address this question, the means of
more specific relationships are often not sufficiently representative. converting these metrics into a standard form, which, if obtained
In the case of software projects, each project used as training and from the estimator, could be converted into the format to be used
testing data is in fact an another domain (when compared to another by the user, is suggested. This conversion of software effort metrics
project), which can be composed of several sub-domains, such as: is proposed by [59].
application areas of the project, programming languages (e.g. Java, Thus, generalizing the proposed method so that is can be used
Python, C), databases (e.g. SQL Server, Oracle, MySQL, MongoDB), by several companies in a web application, for example, would
application modalities (e.g. web, mobile, desktop), development be perfectly possible. This claim is supported by the fact that the
teams with different characteristics (e.g. beginner, full), among method is based on a single repository (i. e. grouping multiple
others. Therefore, each sub-domain can be composed of different repositories), in which different projects’ requirements are met,
relationships, since different aspects can change the context of one regardless of the format for registering the requirements texts.
2020-07-01 00:53. Page 14 of 1–17.
Furthermore, considering its practical application, the proposed 23.313 requirements from sixteen open source projects, which differ
model can be adjusted by the user. Therefore, real estimates gen- significantly in size, complexity, developer team and community
erated and approved by specialists can be fed back, making the [17]. It is observed that open source projects do not present the same
system more and more adjusted to a specific company, for example. aspects as commercial projects in general, especially in relation
to the human resources involved, which requires more research.
6 THREATS TO VALIDITY It would be prudent to test the SE 3 M model with a database of
Thus as several research in the area of software engineering that commercial projects only (with data available containing the same
apply machine learning techniques for the use with the texts, this attributes used in this research) and then with all types of projects
paper propose a kind of software estimate or effort by analogy from (commercial and open source) in order to check for significant
requirements texts, difficult approaches regarding the availability of differences.
real data. These data should reflect the reality of software projects
in different areas, levels of complexity, uses of technologies, among 7 CONCLUSIONS AND FUTURE WORK
other attributes. Typically, these difficulties are related to the vol-
ume of data, the language in which the data is given, the quality of The main objective of this research is to evaluate if pre-trained
the data (for example, text formats, completed, nonexistent labels, embeddings models are promising for the inference of text-based
among others). This way, after accomplished a search for textual software effort estimation, evaluating two approaches to embed-
requirements databases, we decided to use database used by [17], dings: context-less and contextualized. The study obtained positive
which provided the first database at the level of requirements for results for the pre-trained models for the ABSEE task, particularly
the realization of area research. Since the model proposed in this the contextualized models, such as BERT. As predicted in the litera-
article (SE 3 M) aims to compare the results obtained for estimating ture, the contextualized methods demonstrate the best performance
software effort with the Deep-SE model, proposed by [17], it was numbers. In addition, we show that fine-tuning the embedding
defined by using same database. layers can help improve the results. All of these results can be
Therefore, actions to containing or reducing threats to validity improved, especially if trained with more data and/or using some
carried out by [17], were adopted and maintained for this work. effective data augmentation.
Are they: The researchers observed that the database was a limitation
of this research, particularly because it is not very bulky, which
• It is used the Actual project requirements data, which were prevents the results from being even better, especially when using
obtained from large and different open source projects. cutting-edge PLN techniques (e.g. pre-trained models, fine-tuning
• The story of points per user, that accompany each textual and deep learning). When observing the volume of data used in
requirement, were first estimated by human teams, and there- fine-tuning tasks for specific domains, such as [7] and [43], it is
fore, may not be accurate in some situations. [17] performed clear that even though the domain’s base is considered to be light,
two tests to mitigate this threat: one with the original story they contain billions of words. On the other hand, the domain-
points, the other with normalized and adjusted story points. specific database applied in the fine-tuning of the BERT_SE model,
With that it was verified that the proportions of points for has around 800 thousand words. Thus, it was observed that the
history attributed to each requirement were adequate. results obtained with the use of BERT could be improved if there
• It was observed that project managers and analysts deter- was a more voluminous and diverse set of data, in which it was
mine the estimate for a new requirement, based on their possible to better adjust the model for different problems, or even
comparison with requirements already implemented in the train a model of its own (from scratch), which would present an
past, and thus carry out the estimate consistently. In this even greater level of adjustment according to the domain area.
way, the problem is indicated for a machine apprentice, since Thus, this study argues that the SE 3 EM model analyzed in this
the training and testing database presents the description of article can best adapt to different project contexts (for example,
the requirements, accompanied by their respective efforts in agile development). Estimation of story points or another similar
points by history. Thus, new requirements can be estimated, estimation metric (e.g. use case points or function points) have a
as long as they have these two attributes. fine granularity (i.e. they are assigned to each user requirement).
In order to perform the experiments, appropriate metrics were But this same inference method can be applied to estimate a coarser
applied to evaluate regression models, which are commonly used granularity element. An example would be a sprint in agile mod-
to evaluate models of software effort estimates [17] which attest els [54], which is estimated by the sum of the smaller tasks that
to the validity of the model. In addition, different forms of data compose it.
partitioning (cross-validation) were used, in order to validate the Compared with the results obtained by Choetkiertikul et al. [17],
results obtained. the most similar work considering the objectives, note that SE 3 M
In order to compare the results obtained in this work with those used a pre-trained contextualized incorporation layer, which went
obtained by [17], considering that our implementation may not through a fine tuning process, without the need to add any noise to
present all the details that Deep-SE presents, our model was tested the texts. In addition, the proposed architecture is more simplified,
using the same data set as the authors. Thus, it was possible to state with just one recurring layer. As a result of the feature learning
that our results are consistent. process, which applies a contextualized incorporation approach, we
In order to validate the possibility of generalization, it should have a more generalized and multi-project method, which can be
be noted that the training and testing data set is composed of applied even in new projects. Thus, the method has more flexibility
2020-07-01 00:53. Page 15 of 1–17.
regarding the format of the input and multi-project texts, allowing [9] Bengio, Y., Simard, P., and Frasconi, P. Learning long-term dependencies with
for interference to be used in any new requirement, even during gradient descent is difficult. IEEE transactions on neural networks 5, 2 (1994),
157–166.
the initial stage of development. [10] Boehm, B., Abts, C., and Chulani, S. Software development cost estimation
Using the pre-trained BERT model (even the generic one), there approaches—a survey. Annals of software engineering 10, 1-4 (2000), 177–205.
[11] Bojanowski, P., Grave, E., Joulin, A., and Mikolov, T. Enriching word vectors
is no need for prior training of a specific corpus, or one that requires with subword information. Transactions of the Association for Computational
a large volume of data. This has the advantage of involving no pre- Linguistics 5 (2017), 135–146.
training cost (which takes days [31]). Rather, there is only the need [12] Bordes, A., Glorot, X., Weston, J., and Bengio, Y. Joint learning of words and
meaning representations for open-text semantic parsing. In Artificial Intelligence
for fine-tuning, which takes a few hours ([23]). That is, there is no and Statistics (2012), pp. 127–135.
need for training from scratch, as performed by [17]. [13] Calefato, F., Lanubile, F., Maiorano, F., and Novielli, N. [journal first]
Thus, we provide the pre-trained BERT_SE model that can be sentiment polarity detection for software development. In 2018 IEEE/ACM 40th
International Conference on Software Engineering (ICSE) (2018), IEEE, pp. 128–128.
used in various software engineering tasks, in addition to allowing [14] Chen, C., Gao, S., and Xing, Z. Mining analogical libraries in q&a discussions–
for further adjustments, if necessary. In addition, the SE 3 M model incorporating relational and categorical knowledge into word embedding. In
2016 IEEE 23rd international conference on software analysis, evolution, and reengi-
can be applied in a generic way and, in addition to being reliable, it neering (SANER) (2016), vol. 1, IEEE, pp. 338–348.
is a cheap and provides for a computationally fast solution, due to [15] Chen, X., Liu, Z., and Sun, M. A unified model for word sense representation
the fine tuning process. and disambiguation. In Proceedings of the 2014 Conference on Empirical Methods
in Natural Language Processing (EMNLP) (2014), pp. 1025–1035.
The results demonstrated that this is a promising research field [16] Chiu, N.-H., and Huang, S.-J. The adjusted analogy-based software effort
with many available resources and room for innovation. Therefore, estimation based on similarity distances. Journal of Systems and Software 80, 4
several research possibilities are presented as future work: (2007), 628–640.
[17] Choetkiertikul, M., Dam, H. K., Tran, T., Pham, T. T. M., Ghose, A., and
• Collect more textual requirement data to balance the dataset Menzies, T. A deep learning model for estimating story points. IEEE Transactions
on Software Engineering (2018).
against existing labels, and thereby increase the number of [18] Choi, Soonhwang, P. S. P. S. V. A rule-based approach for estimating software
contexts; or apply data augmentation techniques to improve development cost using function point and goal and scenario based requirements.
results. Expert Systems with Applications 39, 1 (2012), 406–418.
[19] Clark, K., Luong, M.-T., Manning, C. D., and Le, Q. V. Semi-supervised sequence
• update BERT_base vocabulary, including specific vocabulary, modeling with cross-view training. arXiv preprint arXiv:1809.08370 (2018).
and then fine-tune. [20] Cohn, M. Agile estimating and planning. Pearson Education, 2005.
[21] Dai, A. M., and Le, Q. V. Semi-supervised sequence learning. In Advances in
• perform fine tuning, such as with model BERT_large, and neural information processing systems (2015), pp. 3079–3087.
compare the results. [22] Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. Bidirectional encoder
• study and apply effective data augmentation techniques in representations from transformers, 2018.
[23] Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. Bert: Pre-training of
order to balance the number of samples in each existing deep bidirectional transformers for language understanding. North American
effort class, and thus obtain possible improvements in the Association for Computational Linguistics (NAACL) (2019).
results. [24] Efstathiou, V., Chatzilenas, C., and Spinellis, D. Word embeddings for the
software engineering domain. In Proceedings of the 15th International Conference
• study and apply different combinations of the layers in the on Mining Software Repositories (2018), pp. 38–41.
BERT model, in order to evaluate the performance of the [25] Fellbaum, C. Wordnet. The Encyclopedia of Applied Linguistics (2012).
[26] Ferrari, A., and Esuli, A. An nlp approach for cross-domain ambiguity detection
model regarding fine-tuning and its effects on EESA. in requirements engineering. Automated Software Engineering 26, 3 (2019), 559–
• study and apply pre-trained "light" (e.g. ALBERT [42]) to the 598.
SE 3 M model, and evaluate its performance aiming to achieve [27] Graves, A. Supervised sequence labelling. In Supervised sequence labelling with
recurrent neural networks. Springer, 2012, pp. 5–13.
EESA. [28] Hamdy, A. Genetic fuzzy system for enhancing software estimation models.
International Journal of Modeling and Optimization 4, 3 (2014), 227–232.
[29] He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recog-
REFERENCES nition. In Proceedings of the IEEE conference on computer vision and pattern
[1] Abrahamsson, P., Fronza, I., Moser, R., Vlasenko, J., and Pedrycz, W. Pre- recognition (2016), pp. 770–778.
dicting development effort from user stories. In 2011 International Symposium on [30] Hochreiter, S. Untersuchungen zu dynamischen neuronalen netzen. Diploma,
Empirical Software Engineering and Measurement (2011), IEEE, pp. 400–403. Technische Universität München 91, 1 (1991).
[2] Alammar, J. The illustrated bert, elmo, and co.(how nlp cracked transfer learning). [31] Howard, J., and Ruder, S. Universal language model fine-tuning for text classi-
December 3 (2018), 2018. fication. arXiv preprint arXiv:1801.06146 (2018).
[3] Amazal, F. A., Idri, A., and Abran, A. Improving fuzzy analogy based software [32] Huang, E. H., Socher, R., Manning, C. D., and Ng, A. Y. Improving word
development effort estimation. In Software Engineering Conference (APSEC), 2014 representations via global context and multiple word prototypes. In Proceedings
21st Asia-Pacific (2014), vol. 1, IEEE, pp. 247–254. of the 50th Annual Meeting of the Association for Computational Linguistics: Long
[4] Ayyildiz, T. E., and Koçyigit, A. A case study on the utilization of problem and Papers-Volume 1 (2012), Association for Computational Linguistics, pp. 873–882.
solution domain measures for software size estimation. In 2016 42th Euromicro [33] Hussain, I., Kosseim, L., and Ormandjieva, O. Approximation of cosmic func-
Conference on Software Engineering and Advanced Applications (SEAA) (2016), tional size to support early effort estimation in agile. Data & Knowledge Engi-
IEEE, pp. 108–111. neering 85 (2013), 2–14.
[5] Azzeh, M. Adjusted case-based software effort estimation using bees optimization [34] Idri, A., Abnane, I., and Abran, A. Missing data techniques in analogy-based
algorithm. Knowlege-Based and Intelligent Information and Engineering Systems software development effort estimation. Journal of Systems and Software 117
(2011), 315–324. (2016), 595–611.
[6] Bardsiri, V. K., Jawawi, D. N. A., Bardsiri, A. K., and Khatibi, E. Lmes: A local- [35] Idri, A., azzahra Amazal, F., and Abran, A. Analogy-based software devel-
ized multi-estimator model to estimate software development effort. Engineering opment effort estimation: A systematic mapping and review. Information and
Applications of Artificial Intelligence 26, 10 (2013), 2624–2640. Software Technology 58 (2015), 206–230.
[7] Beltagy, I., Lo, K., and Cohan, A. Scibert: A pretrained language model for [36] Idri, A., Hosni, M., and Abran, A. Systematic literature review of ensemble
scientific text. In Proceedings of the 2019 Conference on Empirical Methods in effort estimation. Journal of Systems and Software 118 (2016), 151–175.
Natural Language Processing and the 9th International Joint Conference on Natural [37] Ionescu, V.-S. An approach to software development effort estimation using
Language Processing (EMNLP-IJCNLP) (2019), pp. 3606–3611. machine learning. In Intelligent Computer Communication and Processing (ICCP),
[8] Bengio, Y., Courville, A. C., and Vincent, P. Unsupervised feature learning 2017 13th IEEE International Conference on (2017), IEEE, pp. 197–203.
and deep learning: A review and new perspectives. CoRR, abs/1206.5538 1 (2012), [38] Kågebäck, M., Johansson, F., Johansson, R., and Dubhashi, D. Neural context
2012. embeddings for automatic discovery of word senses. In Proceedings of the 1st
2020-07-01 00:53. Page 16 of 1–17.
Workshop on Vector Space Modeling for Natural Language Processing (2015), pp. 25– [64] Scott, T. C., and Marketos, P. On the origin of the fibonacci sequence. MacTutor
32. History of Mathematics (2014).
[39] Kazemifard, M., Z. A. N. M. A. M. F. Fuzzy emotional cocomo ii software [65] Shepperd, M., Schofield, C., and Kitchenham, B. Effort estimation using anal-
cost estimation (fecsce) using multi-agent systems. Applied Software Computing ogy. In Proceedings of IEEE 18th International Conference on Software Engineering
Journal 11, 12 (2011), 2260–2270. (1996), IEEE, pp. 170–178.
[40] Kingma, D. P., and Ba, J. Adam: A method for stochastic optimization. arXiv [66] Shivhare, J., and Rath, S. K. Software effort estimation using machine learning
preprint arXiv:1412.6980 (2014). techniques. In Proceedings of the 7th India Software Engineering Conference (2014),
[41] Kocaguneli, E., Gay, G., Menzies, T., Yang, Y., and Keung, J. W. When to use ACM, p. 19.
data from other projects for effort estimation. In Proceedings of the IEEE/ACM [67] Simonyan, K., and Zisserman, A. Very deep convolutional networks for large-
international conference on Automated software engineering (2010), ACM, pp. 321– scale image recognition. arXiv preprint arXiv:1409.1556 (2014).
324. [68] Tian, Y., Lo, D., and Lawall, J. Sewordsim: Software-specific word similarity
[42] Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P., and Soricut, R. Albert: database. In Companion Proceedings of the 36th International Conference on
A lite bert for self-supervised learning of language representations. arXiv preprint Software Engineering (2014), ACM, pp. 568–571.
arXiv:1909.11942 (2019). [69] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N.,
[43] Lee, J., Yoon, W., Kim, S., Kim, D., Kim, S., So, C. H., and Kang, J. Biobert: a Kaiser, Ł., and Polosukhin, I. Attention is all you need. In Advances in neural
pre-trained biomedical language representation model for biomedical text mining. information processing systems (2017), pp. 5998–6008.
Bioinformatics 36, 4 (2020), 1234–1240. [70] Wieting, J., and Gimpel, K. Revisiting recurrent networks for paraphrastic
[44] Lev, G., Klein, B., and Wolf, L. In defense of word embedding for generic text sentence embeddings. arXiv preprint arXiv:1705.00364 (2017).
representation. In International Conference on Applications of Natural Language [71] Yang, X., Lo, D., Xia, X., Bao, L., and Sun, J. Combining word embedding
to Information Systems (2015), Springer, pp. 35–50. with information retrieval to recommend similar bug reports. In 2016 IEEE 27th
[45] Liu, Q., Jiang, H., Wei, S., Ling, Z.-H., and Hu, Y. Learning semantic word International Symposium on Software Reliability Engineering (ISSRE) (2016), IEEE,
embeddings based on ordinal knowledge constraints. In Proceedings of the 53rd pp. 127–137.
Annual Meeting of the Association for Computational Linguistics and the 7th [72] Zhang, B. Google’s neural machine translation system: Bridging the gap between
International Joint Conference on Natural Language Processing (Volume 1: Long human and machine translation. arXiv preprint arXiv:1609.08144 (2016).
Papers) (2015), vol. 1, pp. 1501–1511. [73] Zhang, C., Tong, S., Mo, W., Zhou, Y., Xia, Y., and Shen, B. Esse: an early
[46] Maaten, L. v. d., and Hinton, G. Visualizing data using t-sne. Journal of software size estimation method based on auto-extracted requirements features.
machine learning research 9, Nov (2008), 2579–2605. In Proceedings of the 8th Asia-Pacific Symposium on Internetware (2016), ACM,
[47] Manikavelan, D., and Ponnusamy, R. Software cost estimation by analogy pp. 112–115.
using feed forward neural network. In Information Communication and Embedded [74] Zhu, Y., Kiros, R., Zemel, R., Salakhutdinov, R., Urtasun, R., Torralba, A.,
Systems (ICICES), 2014 International Conference on (2014), IEEE, pp. 1–5. and Fidler, S. Aligning books and movies: Towards story-like visual explanations
[48] McCann, B., Bradbury, J., Xiong, C., and Socher, R. Learned in translation: by watching movies and reading books. In Proceedings of the IEEE international
Contextualized word vectors. In Advances in Neural Information Processing conference on computer vision (2015), pp. 19–27.
Systems (2017), pp. 6294–6305. [75] Zwillinger, D., and Kokoska, S. CRC standard probability and statistics tables
[49] Melamud, O., Goldberger, J., and Dagan, I. context2vec: Learning generic and formulae. Crc Press, 1999.
context embedding with bidirectional lstm. In Proceedings of the 20th SIGNLL
conference on computational natural language learning (2016), pp. 51–61.
[50] Melamud, O., McClosky, D., Patwardhan, S., and Bansal, M. The role of
context types and dimensionality in learning word embeddings. arXiv preprint
arXiv:1601.00893 (2016).
[51] Mikolov, T., Chen, K., Corrado, G., and Dean, J. Efficient estimation of word
representations in vector space. arXiv preprint arXiv:1301.3781 (2013).
[52] Moharreri, K., Sapre, A. V., Ramanathan, J., and Ramnath, R. Cost-effective
supervised learning models for software effort estimation in agile environments.
In 2016 IEEE 40th Annual Computer Software and Applications Conference (COMP-
SAC) (2016), vol. 2, IEEE, pp. 135–140.
[53] Ochodek, M. Functional size approximation based on use-case names. Informa-
tion and Software Technology 80 (2016), 73–88.
[54] Paasivaara, M., Durasiewicz, S., and Lassenius, C. Using scrum in distributed
agile development: A multiple case study. In 2009 Fourth IEEE International
Conference on Global Software Engineering (2009), IEEE, pp. 195–204.
[55] Palangi, H., Deng, L., Shen, Y., Gao, J., He, X., Chen, J., Song, X., and Ward, R.
Deep sentence embedding using long short-term memory networks: Analysis
and application to information retrieval. IEEE/ACM Transactions on Audio, Speech
and Language Processing (TASLP) 24, 4 (2016), 694–707.
[56] Pennington, J., Socher, R., and Manning, C. Glove: Global vectors for word
representation. In Proceedings of the 2014 conference on empirical methods in
natural language processing (EMNLP) (2014), pp. 1532–1543.
[57] Peters, M. E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., and
Zettlemoyer, L. Deep contextualized word representations. arXiv preprint
arXiv:1802.05365 (2018).
[58] Porru, S., Murgia, A., Demeyer, S., Marchesi, M., and Tonelli, R. Estimating
story points from issue reports. In Proceedings of the The 12th International
Conference on Predictive Models and Data Analytics in Software Engineering (2016),
pp. 1–10.
[59] Pressman, R., and Maxim, B. Engenharia de Software-8ª Edição. McGraw Hill
Brasil, 2016.
[60] Radford, A., Narasimhan, K., Salimans, T., and Sutskever, I. Improving
language understanding by generative pre-training. URL https://s3-us-west-2.
amazonaws. com/openai-assets/researchcovers/languageunsupervised/language un-
derstanding paper. pdf (2018).
[61] Raschka, S. Python machine learning. Packt Publishing Ltd, 2015.
[62] Sak, H., Senior, A., and Beaufays, F. Long short-term memory recurrent neural
network architectures for large scale acoustic modeling. In Fifteenth annual
conference of the international speech communication association (2014), pp. 338–
342.
[63] Sarro, F., Petrozziello, A., and Harman, M. Multi-objective software effort es-
timation. In 2016 IEEE/ACM 38th International Conference on Software Engineering
(ICSE) (2016), IEEE, pp. 619–630.
2020-07-01 00:53. Page 17 of 1–17.

SE3M

Uploaded by

Copyright:

Available Formats

SE3M

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

SE3M

Uploaded by

Copyright:

Available Formats

SE3M: A Model for Software Effort Estimation Using Pre-trained

Estimating effort based on requirement texts presents many chal-

that sequence, generating a representation for the sentence

4.1 Evaluation Metrics

Corpus Specification Aplication Labeled

Figure 4 shows the frequency of distribution of the effort size

• A vocabulary file (vocab.txt) for mapping WordPiece [72] for

4.3 Textual Representation Model

Entry of two sentences Entry of a sentence

4.6 Exploratory data analysis

Figure 8: Architecture deep learning with the pre-trained em-

Figure 7: Embeddings generated by BERT_SE. The points rep-

Experiments Pre-trained model Sequential network architecture

Approach Pre-trained model MAE MSE MdAE

As can be seen in Table 12, the MAE obtained by BERT_SE was

Number of labels 1 2 3 5 8 13 20 40 100 Total

ID project Description Num. require- Med Std MAE

You might also like