A. Standardization of Engineering Requirements Using Large Language Models
A. Standardization of Engineering Requirements Using Large Language Models
A Dissertation
Presented to
The Academic Faculty
By
In Partial Fulfillment
of the Requirements for the Degree
Doctor of Philosophy in the
Daniel Guggenheim School of Aerospace Engineering
Aerospace Systems Design Laboratory
May 2023
Thesis committee:
This dissertation represents the culmination of a journey that has been both in-
tellectually challenging and emotionally rewarding. I have learned to embrace failure
as an integral part of the research process, to be persistent in the face of obstacles,
and to celebrate small wins. The road to this point has been long, winding, and at
times, daunting. But thanks to the support of so many wonderful people, I stand
here today, having completed one of the most significant achievements of my life.
First and foremost, I would like to express my deepest gratitude to my advisor,
Prof. Dimitri Mavris, whose support, mentorship, and guidance have been instru-
mental in shaping the person I have become. His encouragement to explore multiple
research directions and collaborate with industry sponsors has been instrumental in
shaping my research and helping me see the common ground between industry and
academia, thus breaking down the silos between the two. I could not have asked for
a better advisor.
I would like to express my appreciation to my committee members, Prof. Dimitri
Mavris, Prof. Daniel Schrage, Dr. Ryan White, Dr. Bjorn Cole, and Dr. Olivia Pinon
Fischer, for their invaluable feedback, guidance, and support throughout this journey.
Their expertise, insights, and constructive criticism have been instrumental in shaping
my research and helping me grow as a researcher. I am humbled by their dedication
and commitment to my professional development.
I would like to extend a heartfelt shoutout to Dr. Olivia Pinon Fischer, who
has been an invaluable mentor and friend throughout this journey. Her kindness,
generosity, and unwavering encouragement have been a constant source of inspiration
and motivation for me. I feel truly fortunate to have had the opportunity to work
with such a talented, supportive, and caring individual.
I express my sincere appreciation to Dr. Bjorn Cole for his valuable contribution to
v
bringing the industry perspective to my dissertation. I am grateful for his willingness
to engage in thoughtful discussions and provide insightful opinions whenever needed.
Additionally, I am thankful for his exceptional co-authorship on the joint papers we
worked on together.
I would like to express my sincere gratitude to Dr. Ryan White, who has been an
invaluable guide and friend throughout my undergraduate and graduate studies. I
am grateful for his willingness to devote his time and energy to my work, and for the
many valuable conversations that we have shared. Thank you, Ryan, for all that you
do for me and for the rest of your students.
I would like to extend a special thank you to Adrienne Durham, Tanya Ard-Smith,
and Brittany Hodges who have made my life as an international student a little
easier. Their invaluable assistance with paperwork, and other administrative tasks
was instrumental in ensuring a smooth journey throughout my doctoral program.
To my friends, Rubanya Nanda, Lijing Zhai (and Sunny), Patsy Jammal, Stella
Kampezidou, Nathaniel Omoarebun, Rosa Galindo, and Shubhneet Singh who have
been my pillars of strength, my sounding board, and my cheerleaders throughout this
journey – thank you. Your support and encouragement have helped me through the
most challenging times, and I am grateful for the laughs, tears, and memories we’ve
shared along the way.
I am also immensely grateful to my family, who have been my source of support,
love, and encouragement. Their constant belief in me, even when I doubted myself,
has meant the world to me. They have shared the ups and downs of this journey, and
I know that I could not have made it without them.
Last but certainly not least, I want to express my heartfelt gratitude to my hus-
band Anirudh Bhat, who has been my ardent supporter and partner throughout this
journey. His countless sacrifices, whether it was staying up late into the night dis-
cussing ideas or helping me debug my code, have been instrumental in making this
vi
achievement possible. I could not have done this without him, and I am forever
grateful for his presence in my life. I eagerly look forward to embarking on the next
chapter of our lives together.
vii
TABLE OF CONTENTS
Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxvi
Chapter 1: Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
viii
1.9 Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
Chapter 4: Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
ix
4.3 Step 2: Fine-tuning BERT for aerospace requirement classification
(aeroBERT-Classifier) . . . . . . . . . . . . . . . . . . . . . . . . . . 76
4.4 Step 3: Fine-tuning BERT for POS tagging of aerospace text (aeroBERT-
POStagger) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
Chapter 5: Implementation . . . . . . . . . . . . . . . . . . . . . . . . . 80
5.2.1 Preparing the dataset for fine-tuning BERT for aerospace NER 88
x
Chapter 6: Results and Discussion . . . . . . . . . . . . . . . . . . . . . 120
xi
8.3.2 aeroBERT-Classifier . . . . . . . . . . . . . . . . . . . . . . . 164
Appendices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
xii
LIST OF TABLES
xiii
5.5 NER tags and their counts in the aerospace corpus for aeroBERT-NER 84
5.11 Tokens, token ids, tags, tag ids, and attention masks . . . . . . . . . 92
5.12 Breakdown of the “types” of requirements in the training and test set 94
5.13 Tokens, token IDs, and attention masks for requirements classification 96
6.7 List of requirements (from test set) that were misclassified (0: Design;
1: Functional; 2: Performance) . . . . . . . . . . . . . . . . . . . . . . 131
xiv
6.8 Requirement table populated by using various LMs . . . . . . . . . . 135
xv
LIST OF FIGURES
xvi
2.11 Majo and Jaramillo template . . . . . . . . . . . . . . . . . . . . . . 37
3.11 Using transfer learning to fine-tune LMs for tasks in the aerospace
domain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
xvii
4.3 Methodology for obtaining aeroBERT-Classifier . . . . . . . . . . . . 77
5.5 Choosing maximum length of the input sequence for training aeroBERT-
NER . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
5.7 Figure showing the distribution of sequence lengths in the training set
used for training aeroBERT-Classifier . . . . . . . . . . . . . . . . . . 95
5.11 Sankey diagram showing the POS tag patterns in design requirements. 102
5.12 Sankey diagram showing the POS tag patterns in functional requirements103
5.13 Sankey diagram showing the POS tag patterns in performance require-
ments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
5.15 Sankey diagram showing the text chunk patterns in design requirements106
xviii
5.17 Text chunk analysis for design requirements (Part 2) . . . . . . . . . 107
5.18 Sankey diagram showing the text chunk patterns in functional require-
ments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
5.20 Sankey diagram showing the text chunk patterns in performance re-
quirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
6.3 Confusion matrix showing the breakdown of the true and predicted
labels by the aeroBERT-Classifier on the test data . . . . . . . . . . . 131
6.4 Confusion matrix showing the breakdown of the true and predicted
labels by the bart-large-mnli model on the test data . . . . . . . . . . 133
xix
6.10 Functional Requirements: Boilerplate 4 . . . . . . . . . . . . . . . . . 145
C.3 Word sense disambiguation - A computer mouse vs. a mouse (rodent) 172
C.4 Cased and uncased text for BERT language model . . . . . . . . . . . 172
xx
C.12 BERT Embeddings deep-dive . . . . . . . . . . . . . . . . . . . . . . 174
C.15 Example showing Query (Q) and Key (K) vectors and attention score
calculation - Part 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
C.16 Example showing Query (Q) and Key (K) vectors and attention score
calculation - Part 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
xxi
LIST OF ACRONYMS
xxii
HMM Hidden Markov Models
INCOSE International Council on Systems Engineering
I-NRDM Information-Based Needs and Requirements Definition and Manage-
ment
IPPD Integrated Product & Process Design
K Key
LLM Large Language Model
LM Language Model
LOC Location
LSRE Large Scale Requirements Engineering
LSTM Long Short-Term Memory
MBSE Model-Based Systems Engineering
MISC Miscellaneous
ML Machine Learning
MLM Masked Language Modeling
MSRE Medium-Scale Requirements Engineering
NASA National Aeronautics and Space Administration
NE Named Entities
NER Named-Entity Recognition
NFR Non-Functional Requirements
NL Natural Language
NLG Natural Language Generation
NLI Natural Language Inference
NLP Natural Language Processing
NLP4RE Natural Language Processing for Requirements Engineering
NLTK Natural Language Toolkit
NLU Natural Language Understanding
xxiii
NN Neural Network
NN Noun
NoRBERT Non-functional and functional Requirements classification using
BERT
NP Noun Phrases
NSP Next Sentence Prediction
OEM Original Equipment Manufacturer
ORG Organizations
PER Person
POS Parts-Of-Speech
PP Prepositional Phrase
Q Query
QA Question Answering
QFD Quality Function Deployment
RES Resource
RNN Recurrent Neural Network
RQ Research Question
RS Requirement Specification
SADT Structured Analysis Design Technique
SBAR Subordinate Clause
SDM System Design Methodology
SDMC Sequential Data Mining under Constraints
SL Supervised Learning
SME Subject Matter Expert
SOTA State-Of-The-Art
SRS Software Requirement Specification
SSRE Small-Scale Requirements Engineering
xxiv
SVM Support Vector Machines
SYS System
SysML Systems Modeling Language
T5 Text-to-Text Transfer Transformer
TN True Negative
TP True Positive
UML Unified Modeling Language
V Value
VAL Value
VLSRE Very Large-Scale Requirements Engineering
VP Verb Phrases
ZSL Zero-Shot Learning
xxv
SUMMARY
Requirements serve as the foundation for all systems, products, services, and
enterprises. A well-formulated requirement conveys information, which must be nec-
essary, clear, traceable, verifiable, and complete to respective stakeholders. Various
types of requirements like functional, non-functional, design, quality, performance,
and certification requirements are used to define system functions/objectives based
on the domain of interest and the system being designed.
Organizations predominantly use natural language (NL) for requirements elici-
tation since it is easy to understand and use by stakeholders with varying levels of
experience. In addition, NL lowers the barrier to entry when compared to model-
based languages such as Unified Modeling Language (UML) and Systems Modeling
Language (SysML), which require training. Despite these advantages, NL require-
ments bring along many drawbacks such as ambiguities associated with language,
a tedious and error-prone manual examination process, difficulties associated with
verifying requirements completeness, and failure to recognize and use technical terms
effectively. While the drawbacks associated with using NL for requirements engineer-
ing are not limited to a single domain or industry, the focus of this dissertation will
be on aerospace requirements.
Most of the systems in the present-day world are complex and warrant an inte-
grated and holistic approach to their development to capture the numerous interre-
lationships. To address this need, there has been a paradigm shift towards a model-
centric approach to engineering as compared to traditional document-based methods.
The promise shown by the model-centric approach is huge, however, the conversion
of NL requirements into models is hindered by the ambiguities and inconsistencies in
NL requirements. This necessitates the use of standardized/semi-machine-readable
requirements for transitioning to Model-Based Systems Engineering (MBSE).
xxvi
As such, the objective of this dissertation is to identify, develop, and implement
tools and techniques to enable/support the automated translation of NL requirements
into semi-machine-readable requirements. This will contribute to the mainstream
adoption of MBSE.
Given the close relationship between NL and requirements, researchers have been
striving to develop Natural Language Processing (NLP) tools and methodologies for
processing and managing requirements since the 1970s. Despite the interest in using
NLP for requirements engineering, the inadequate developments in language process-
ing technologies thwarted progress. However, the recent developments in this field
have propelled NLP for Requirements Engineering (NLP4RE) into an active area of
research. Hence, NLP techniques are strong candidates for the standardization of NL
requirements and are the focus of this dissertation.
One of the central ideas in NLP is neural language models (LMs), which lever-
age neural networks to simultaneously learn lower-dimensional word embeddings and
learn to estimate conditional probabilities of the next words simultaneously using
gradient-based supervised learning. This opened the door to ever-more-complex and
effective language models to perform an expanding array of NLP tasks, starting with
distinct word embeddings to recurrent neural networks (RNNs) and LSTM encoder-
decoders to attention mechanisms. These models did not stray too far from the
N-gram statistical language modeling paradigm, with advances that allowed text
generation beyond a single next word with for example beam search and sequence-
to-sequence learning. These ideas can be applied to distinct NLP tasks. In 2017, the
Transformer architecture was introduced which improved computational paralleliza-
tion capabilities over recurrent models and therefore enabled the successful optimiza-
tion of larger models. Transformers consist of stacks of encoders (encoder block) and
stacks of decoders (decoder block), where the encoder block receives the input from
the user and outputs a matrix representation of the input text. The decoder takes the
xxvii
input representation produced by the encoder stack and generates outputs iteratively.
BERT, which is a transformer-based model was selected for this research because
1) it can be fine-tuned for a variety of language tasks such as Named-entity recognition
(NER), parts-of-speech (POS) tagging, and sentence classification, 2) can achieve
State-of-the-art (SOTA) results. In addition, it uses a bidirectional transformer-
based architecture enabling it to better capture the context in a sentence. BERT is
pre-trained on BookCorpus and English Wikipedia (general-domain text) and as a
result, needs to be fine-tuned using an aerospace corpus to be able to generalize to
the aerospace domain.
To fine-tune BERT for different NLP tasks, two annotated aerospace corpora were
created. These corpora contain text from Parts 23 and 25 of Title 14 of the Code
of Federal Regulations (CFRs) and publications by the National Academy of Space
Studies Board. Both corpora were open-sourced to make them available to other
researchers to accelerate research in the field of Natural Language Processing for
Requirement Engineering (NLP4RE).
First, the corpus annotated for aerospace-specific named entities (NEs), was used
to fine-tune different variants of the BERT LM for the identification of five categories
of named entities, namely, system names (SYS), resources (RES), values (VAL), or-
ganization names (ORG), and datetime (DATETIME). The extracted named entities
were used to create a glossary, which is expected to improve the quality and un-
derstandability of aerospace requirements by ensuring uniform use of terminologies.
Second, the corpus annotated for aerospace requirements classification was used to
fine-tune BERT LM to classify requirements into different types such as design re-
quirements, functional requirements, and performance requirements. Being able to
classify requirements will improve the ability to conduct redundancy checks, evaluate
consistency, and identify boilerplates, which are pre-defined linguistic patterns for
standardizing requirements. Third, an off-the-shelf model (flair/chunk-english)
xxviii
was used for identifying the different sentence chunks in a requirement sentence, which
is helpful for ordering phrases in a sentence and hence useful for the standardization
of requirements.
The capability to classify requirements, identify named entities occurring in re-
quirements, and extract different sentence chunks in aerospace requirements, facil-
itated the creation of requirements table and boilerplates for the conversion of NL
requirements into semi-machine-readable requirements. Based on the frequency of
different linguistic patterns, boilerplates were constructed for various types of re-
quirements.
In summary, this effort resulted in the development of the first open-source an-
notated aerospace corpora along with two LMs (aeroBERT-NER, and aeroBERT-
Classifier). Various methodologies were developed to use the fine-tuned LMs to stan-
dardize requirements by making use of requirements boilerplates. As a result, this
research will lead to speeding up the design and development process by reducing
ambiguities and inconsistencies associated with requirements. In addition, it will
reduce the workload on engineers who manually evaluate a large number of require-
ments by facilitating the conversion of NL aerospace requirements into standardized
requirements.
xxix
CHAPTER 1
INTRODUCTION
This chapter serves as an introductory section that covers the fundamental concepts
of Integrated Product and Process Development (IPPD), the Systems engineering
process, and Quality Function Deployment (QFD). It highlights the criticality of re-
quirements engineering in the design of systems, products, and enterprises. Addition-
ally, it underscores the challenges and constraints associated with the use of natural
language (NL) for requirements elicitation. Furthermore, it explores the advantages
of adopting model-based methodologies and evaluates the obstacles such a shift faces
because of the inherent ambiguities in NL requirements. The chapter culminates with
a brief summary of the dissertation’s main focus.
In the next few sections the Integrated Product and Process Development (IPPD)
process, Quality Function Deployment (QFD), and Systems Engineering Process will
be discussed. These are interconnected and complementary processes in product
development.
Organizations use the Integrated Product and Process Development (IPPD) [1] method
to ensure that the development and production functions, along with the Systems En-
gineering Process and Quality Function Deployment (QFD), are integrated to create a
high-quality product that satisfies customer demands while minimizing development
time and expenses. IPPD is utilized to achieve cost, schedule, and quality objectives
while designing and developing products that meet customer requirements. The pro-
cess involves the collaboration of all stakeholders, including customers, suppliers, and
team members, to develop a product that satisfies all stakeholders. Figure 1.1 shows
1
the IPPD process in detail.
Figure 1.1: Integrated Product and Process Development (IPPD)[1]. The primary
focus of this dissertation is on the Requirements and Functional Analysis phase of
the IPPD process within the systems engineering domain (as highlighted in red the
figure).
2. Define Objectives: Once customer requirements are identified, the team must
define objectives for the product, including cost, schedule, and quality objec-
tives.
3. Design and Develop the Product: The next step is to design and develop
the product, taking into account the customer requirements and the defined
objectives.
2
4. Test and Validate: After the product is designed and developed, it must be
tested and validated to ensure that it meets customer requirements and the
defined objectives.
Quality Function Deployment (QFD) (Figure 1.2) is a methodology used in the IPPD
process to ensure that customer requirements are translated into the design of the
product. QFD uses a matrix to organize and prioritize customer requirements and to
link them to specific design characteristics of the product.
The QFD matrix commonly consists of the following components:
1. Customer Requirements: This is the first row of the matrix, which lists all
the requirements that the customer has for the product.
2. Importance Rating: This is the second row of the matrix, which assigns a
weight or importance rating to each customer requirement. The importance
rating is usually based on how critical the requirement is to the customer.
3
Figure 1.2: House of Quality which is a design tool for QFD [2]
3. Design Characteristics: This is the third row of the matrix, which lists
all the design characteristics that must be addressed to meet the customer
requirements.
4. Relationship Matrix: This is the main body of the matrix, which links the
customer requirements to the design characteristics. The relationship matrix
shows how well each design characteristic satisfies each customer requirement.
5. Technical Response: This is the last row of the matrix, which shows how
well the design characteristics are being met by the technical response.
The QFD matrix helps to ensure that the design of the product is aligned with
the customer’s requirements and that the product meets the defined objectives for
cost, schedule, and quality.
4
1.3 The Systems Engineering Process
• Transform needs and requirements into a set of system product and process
descriptions (adding value and more detail with each level of development),
5
Figure 1.3: The Systems Engineering Process [3]. This dissertation centers around
the initial phase of systems engineering, known as Requirements Analysis, with some
emphasis on the Requirements Loop as a whole.
functions. In QFD, requirements are gathered from customers and translated into
specific engineering requirements to ensure that the final product meets the customers’
needs. In the Systems Engineering Process, requirements analysis is the first step, and
it involves developing functional and performance requirements that define what the
system must do and how well it must perform. The IPPD process requires integrated
development, which means that all life cycle needs must be considered concurrently
during the development process, and requirements analysis is a critical component
of this process. In all three methodologies, requirements serve as a guide for the
design, development, and testing of a product or system and ensure that it meets the
intended purpose and specifications.
The remainder of this dissertation will concentrate on requirements, specifically
addressing challenges that may arise in natural language (NL) requirements and ex-
ploring how modern NLP techniques can mitigate these challenges.
6
1.4 Requirements: Background and definitions
Requirements serve as the foundation for all systems, products, services, and enter-
prises. INCOSE [4] defines a requirement as “a statement that identifies a system,
product or process characteristic or constraint, which is unambiguous, clear, unique,
consistent, stand-alone (not grouped), and verifiable, and is deemed necessary for
stakeholder acceptability.” Requirements shall be [5], [6]:
• Clear: able to convey the desired goal to the stakeholders by being simple and
concise
7
the stakeholder expectations and then converting these expectations into technical
requirements [7]. In other words, it involves defining, documenting, and maintaining
requirements throughout the engineering lifecycle [8]. External stakeholders that con-
tribute to the requirements generation process are competitors, regulatory authorities,
operators, shareholders, subcontractors, component providers, consumers, etc. Some
of the internal stakeholders are technology research and development teams, sourcing
teams, supply and manufacturing teams, engineers, etc. In the case of large-scale
systems, the number of stakeholders increases and so do the number of requirements
to achieve the desired system [9]. The process followed for defining, documenting,
and maintaining requirements is called requirements engineering [8]. Require-
ments engineering can be classified into different categories based on the number of
requirements, as shown in Table 1.1 [9].
8
The number of requirements serves as a proxy for the system complexity. Hence,
the higher the number of requirements, the higher the system complexity.
In the aerospace engineering domain, most systems require Very Large-Scale Re-
quirements Engineering (VLSRE) where system requirements are predominantly writ-
ten in natural language1 [9], [11], [12]. This is done to make sure that the requirements
are easy to write and understand by stakeholders with varying levels of experience
when it comes to requirements engineering [11]. However, the use of natural language
for requirements engineering might introduce ambiguities, and inconsistencies that
would reduce system quality, and lead to cost overruns, and system failure altogether
[13].
Observation 1
Requirements are written in Natural Language (NL) to make them more ac-
cessible to different stakeholders; however, this introduces unintended inconsis-
tencies and ambiguities.
9
tion, inspection, and testing [14], leading to dramatic engineering and programmatic
consequences when caught late in the product life cycle [13], [15]. When “require-
ments engineering” is practiced as a separate effort on large teams, typically with a
dedicated team or at least a lead, it becomes very process-focused. Configuration
management, customer interviews, validation and verification planning, and matura-
tion sessions all take place within the effort. Specialized software packages such as
DOORS [16] have been crafted over many years to service the needs of processing
requirements as a collection of individual data records. However, one connection that
is often lost is that between the requirements development team and the architectural
team. Because Natural Language (NL) is predominantly used to write requirements
[11], requirements are commonly prone to ambiguities and inconsistencies. This in
turn increases the likelihood of errors and issues in the way requirements are for-
mulated. The ambiguities of the requirements text are eventually resolved, and not
entirely satisfactorily, in test definition. At this point, misunderstandings and misses
are quite expensive. If the misses don’t prevent the system from being fielded, they
will often be overlooked and instead simply become potentially lost value to the end
user. This overall process orientation can make requirements development look like it
is just part of the paperwork overhead of delivering large projects to institutional cus-
tomers rather than the vital part of customer needs understanding and embodiment
that it is.
According to a study by NASA [15], the cost to fix requirements during the re-
quirements generation phase is minimal but can go up to 29x – 1500x in the operations
phase (Table 1.2). According to data from industry, 50% of the product defects and
80% of rework effort can be traced back to the errors during the requirements engineer-
ing phase [14]. The stakes are even higher when it comes to safety-critical systems2 –
40% of accidents involving these systems have resulted due to poor requirements [14].
2
Systems whose failure can lead to catastrophic damage to life, property, and environment.
Examples: medical devices, nuclear power plant systems, aircraft flight control systems [17]
10
This emphasizes the importance of requirements and how it is even more important
to fix errors at an early stage to save time and costs – which consequently means
allocating a larger amount of project costs towards the requirement definition phase
[15].
Table 1.2: Cost to fix requirements error at NASA (in ratios) [15]
Observation 2
The cost of fixing errors in requirements goes up exponentially as we progress
across the project life cycle.
As mentioned, Natural Language (NL) has been used for capturing requirements as
a means to make them more accessible to stakeholders [11]. This is in comparison
to requirements defined in modeling languages such as Unified Modeling Language
(UML) and Systems Modeling Language (SysML), which require special training [18].
While the benefits offered by NL are alluring, they present some challenges.
11
NL can be ambiguous [11], [19], [20] – capable of being understood in two or more
ways [21]. For example – The display must have a “user-friendly” interface. Here,
the word “user-friendly” can mean different things to different people, hence leading
to ambiguities in requirements.
The second problem associated with NL requirements is unrecognized disambigua-
tion [22] – the reader uses the first meaning that comes to their mind as the only
meaning of a certain word, abbreviation, sentence, etc. For example: there is a differ-
ence in meaning between the two words “FARs” (Federal Aviation Regulations) and
“FARS” (Fatality Analysis Reporting System). The reader would have completely
misunderstood the context if they understood FARs as Fatality Analysis Report-
ing System (FARS) when talking about the aerospace domain and vice-versa when
referring to self-driving car systems.
In addition, manual examination of NL requirements for checking completeness
and consistency is tedious, and the effort goes up exponentially when the number
of requirements increases [23]. Inconsistent, missing, and duplicate requirements are
hard to fix manually [11] and can lead to catastrophic outcomes such as the failure
of the Mars climate orbiter in 1998 due to confusion regarding units between the
stakeholders involved in the project [24].
Lastly, failure to recognize technical terms and their meanings can lead to a re-
duced understanding of requirements for a particular domain. For example: not
understanding and using terms effectively such as FAA, NASA, and ATC in the
aerospace domain can lead to reduced requirements quality by introducing ambigui-
ties and inconsistencies.
Most of the systems in the present-day world are complex and hence need a com-
prehensive approach to their design and development [25]. To accommodate this need,
there has been a drive toward the development and use of Model-Based Systems Engi-
neering (MBSE) principles and tools, where activities that support the system design
12
process are accomplished using models as compared to traditional document-based
methods [26]. Models capture the requirements as well as the domain knowledge and
make them accessible to all stakeholders [27], [28]. While MBSE shows great promise,
the ambiguities and inconsistencies inherent to NL requirements hinder their direct
conversion to models [29]. Hand-crafting models are time-consuming and require
highly specialized subject matter expertise. As a result, there is a need to convert
NL requirements into a semi-machine-readable form (which involves being able to
extract information from NL requirements as well as converting them into a stan-
dardized form) so as to facilitate their integration and use in an MBSE environment.
The need to access data within requirements rather than treating the statement as a
standalone object has also been recognized by the International Council on System
Engineering’s (INCOSE) Requirements Working Group in their recent publication
[30]. The document envisions an informational environment where requirements are
not linked only to each other or test activities but also to architectural elements. This
is represented in the figure below, which envisions a web of interconnections from a
new kind of requirement object, which is a fusion of the natural language statement
and machine-readable attributes that can connect to architectural entities such as
interfaces and functions.
The approach of creating and maintaining requirements as more information-rich
objects than natural language sentences has been called “Information-Based Needs
and Requirements Definition and Management” (I-NRDM). In the INCOSE manual
[30], model-based design (working on architecture and analysis) is recommended to
combine with I-NRDM to be the full definition of MBSE. The “Property-Based Re-
quirement” of recent SysML revisions can serve as the “Requirements Expression”,
as shown in Figure 1.5.
Despite these identified needs and ongoing developments, there are no standard
tools or methods for converting NL requirements into a machine-readable/semi-machine-
13
Figure 1.5: Information-based Requirement Development and Management Model
[31]
readable form.
Observation 3
As systems become more complex and the number of requirements increases, it
becomes difficult to evaluate requirement completeness and consistency manu-
ally, hence the need for automatic evaluation of requirements arises.
1.7 Summary
The requirements engineering phase of a project is very crucial and ambiguities in this
phase can affect various downstream tasks such as system architecting, system design,
testing, analysis, and inspection [14], ultimately resulting in a reduced quality system,
cost overruns, and system failure [13]. The cost of fixing errors in requirements goes
up exponentially as we move forward in the design and development process for a
system [15].
14
Most of the requirements are written in natural language because of the low barrier
to adoption as compared to model-based languages such as UML and SysML [11],
[18]. Despite this advantage, NL requirements present a large number of drawbacks
such as ambiguities associated with language [11], [19], [20], tedious and error-prone
manual examination process, difficulties associated with verifying system description
for completeness [23], and failure to recognize and use technical terms effectively that
are associated with a domain [11].
To address the drawbacks associated with NL requirements there has been a shift
towards machine-readable/semi-formal requirements, where data within requirements
can be accessed as compared to treating requirement sentences as standalone objects
[30].
This leads us to the research objective for this thesis:
Research Objective
15
Figure 1.6: An illustration of how to convert the contents of a natural language re-
quirement into data objects can be seen in this example: In the given requirement
“The air-taxi shall have a configuration that can seat five passengers within its pas-
senger cabin”, the air-taxi and passenger cabin are considered as SYSTEM, while five
passengers is classified as a VALUE.
Figure 1.7: Different steps of requirements engineering, starting with gathering re-
quirements from various stakeholders, followed by using Natural Language Process-
ing (NLP) techniques to standardize them and lastly converting the standardized
requirements into models. The main focus of the dissertation was to convert NL re-
quirements into semi-machine-readable requirements (where parts of the requirement
become data objects) as shown in Step 2.
16
• Chapter 1 provides an introduction to the field of requirements engineering.
It discusses the importance of NL in requirements elicitation and also its draw-
backs. Lastly, this chapter sets the research objective for this dissertation, based
on various observations.
• Chapter 2 provides the background into the historical and current-day use
of NLP in the field of requirements engineering. In addition, it discusses the
relevance of pre-trained LMs and domain-specific corpus for analyzing aerospace
requirements. The chapter concludes by outlining the observations made and
identifying gaps in the current literature.
• Chapter 3 outlines the research plan for this dissertation, which is based on the
observations and identified gaps in the previous chapters. The chapter identifies
the research questions and their corresponding hypotheses.
17
• Chapter 6 discusses the results regarding the various LMs that were developed
for the conversion of NL requirements into standardized requirements. it also
provides a comparison between the performance of these models to that of off-
the-shelf models. Lastly, the requirements table and the identified boilerplates
are presented and explained.
1.9 Publications
• Tikayat Ray, A., Pinon Fischer, O.J., Mavris, D.N., White, R.T. and Cole,
B.F., “aeroBERT-NER: Named-Entity Recognition for Aerospace Requirements
Engineering using BERT”, AIAA 2023-2583. AIAA SCITECH 2023 Forum.
January 2023.
• Tikayat Ray, A., Cole, B.F., Pinon Fischer, O.J., White, R.T., and Mavris,
D.N., “aeroBERT-Classifier: Classification of Aerospace Requirements Using
BERT”, MDPI Aerospace 2023, 10, 279.
• Tikayat Ray, A., Pinon Fischer, O.J., White, R.T., Cole, B.F., and Mavris,
D.N., “aeroBERT-NER: Aerospace Corpus and Language Model for Named-
Entity-Recognition for Aerospace Requirements Engineering”, [Under Review].
18
CHAPTER 2
PRELIMINARIES AND LITERATURE REVIEW
This chapter provides the necessary background into the historical and current-day
use of NLP in the field of requirements engineering. The following sections and
subsections describe the evolution of NLP and establish the relevance of pre-trained
large language models (LLMs), and importance of domain-specific corpus for ana-
lyzing aerospace requirements. Various NLP tasks such as named entity recognition
(NER), classification, etc. are discussed in regards to aerospace requirements. Lastly,
this chapter summarizes the observations and identifies the gaps in literature.
Requirements are almost always written in NL [32] to make them accessible to differ-
ent stakeholders. According to various surveys, NL was deemed to be the best way
to express requirements [33], and 95% of 151 software companies surveyed revealed
that they were using some form of NL to capture requirements [12]. Given the ease of
using NL for requirements elicitation, researchers have been striving to come up with
NLP tools for requirements processing dating back to the 1970s. Tools such as the
Structured Analysis Design Technique (SADT), and the System Design Methodology
(SDM) developed at MIT, are systems that were created to aid in requirement writing
and management [33]. Despite the interest in applying NLP techniques and models to
the requirements engineering domain, the slow development of natural language tech-
nologies thwarted progress until recently [11]. The availability of NL libraries/toolkits
(Stanford CoreNLP [34], NLTK [35], spaCy [36], etc.), and off-the-shelf transformer-
based [37] pre-trained language models (LMs) (BERT [38], BART [39], etc.) have
propelled NLP4RE into an active area of research [32]. NLP4RE deals with applying
19
Natural Language Processing (NLP) tools and techniques to the field of requirements
engineering [40].
A recent survey performed by Zhao et al. reviewed 404 NLP4RE studies conducted
between 1983 and April 2019 and reported on the developments in this domain [40].
Figure 2.1 shows a clear increase in the number of published studies in NLP4RE
over the years. This underlines the critical role that NLP plays in requirements
engineering, a role that is expected to become more important with time as the
availability of off-the-shelf language models increases.
Among those 404 NLP4RE studies, 370 NLP4RE were classified based on the
main NLP4RE task being performed. As illustrated in Figure 2.2, the most common
focus of the studies was the detection of linguistic issues in requirements, such as the
occurrence of ambiguous phrases, the conformance to pre-defined templates, etc. The
classification task was the second most popular task and dealt with the classification
of requirements into various categories. Extraction dealt with the identification of key
domain concepts. Finally, modeling, tracing and relating, and search and retrieval
are some of the other NLP4RE tasks of interest.
20
Figure 2.2: Distribution of selected studies based on NLP4RE task [40]
21
outperformed the crowd-workers on Amazon Mechanical Turk (MTurk) in four
of the five tasks. However, the success of ChatGPT for text annotation is not
expected to extend to technical domains.
Advanced NLP tools and techniques have the potential to revolutionize NLP4RE
research (a field that is known to have limited annotated datasets) in the aerospace
requirements engineering domain and are the subject of this dissertation.
22
spell check, etc.
NLP can be further divided into Natural Language Understanding (NLU) and
Natural Language Generation (NLG) [51] (Figure 2.3). NLU involves understanding
and interpreting text whereas NLG uses the meaning of text (as understood by NLU)
in order to produce text [51].
NLP tools and techniques have a strong potential to enable/support the automatic
translation of NL requirements into machine-readable requirements by extracting and
converting information present in a requirement into data objects. For example, word
sense disambiguation and Named-entity recognition (NER) can help with reducing
ambiguities associated with requirements; parts-of-speech (POS) tagging, text chunk-
ing, and dependency parsing can aid the requirements examination or re-write pro-
cess; classification of requirements and boilerplate identification will be crucial for
standardizing requirements (Figure 2.4). These aspects are discussed in the following
sections.
23
Figure 2.4: Problems with NL requirements and potential solutions provided by NLP
24
attention mechanisms [60]. These models did not stray too far from the N-gram
statistical language modeling paradigm, with advances that allowed text generation
beyond a single next word with for example beam search in [59] and sequence-to-
sequence learning in [61]. These ideas were applied to distinct NLP tasks.
25
language models, and T5 character-level language model [64]. These sophisticated
language models break the single dataset-single task modeling paradigm of most
mainstream models in the past. They employ self-supervised pre-training on mas-
sive unlabeled text corpora. For example, BERT is trained on Book Corpus (800M
words) and English Wikipedia (2500M words) [38]. Similarly, GPT-3 is trained on
500B words gathered from datasets of books and the internet [63].
These techniques set up automated supervised learning tasks, such as masked lan-
guage modeling (MLM), next-sentence prediction (NSP), and generative pre-training.
No labeling is required as the labels are automatically extracted from the text and
hidden from the model, and the model is trained to predict them. This enables the
models to develop a deep understanding of language, independent of the NLP task.
These pre-trained models are then fine-tuned on much smaller labeled datasets, lead-
ing to advances in the state-of-the-art (SOTA) for nearly all downstream NLP tasks
(Figure 2.6), such as Named-entity recognition (NER), text classification, language
translation, question answering, etc. [38], [65], [66].
26
2.2.3 Importance of domain-specific corpus
27
Observation 4
Language models are trained on general-domain text, which leads to poor per-
formance by these models when applied to aerospace requirements due to a lack
of domain knowledge.
Often stakeholders use different terms to refer to the same entity, hence leading to
ambiguities [11], [19], [20]. As such there is a need to build a glossary for terms
occurring in aerospace requirements, however, doing this manually is an arduous task
when dealing with large number of requirements [69]. According to Arora et al. [69],
a glossary can be obtained by putting NL requirements through the NLP pipeline,
which first tokenizes the text, followed by POS tagging, NER and chunker as shown
in Figure 2.7. The output produced by the pipeline are annotated tokens (POS, NE,
and text chunks) out of which only Noun-Phrases (NPs) are of interest. Noun Phrases
(NPs) can be object or subject of a verb. Verb Phrases (VPs) consists of the verb
along with modal, auxiliary, and modifier.
Figure 2.7: NLP pipeline for text chunking and NER [69]
28
Arora et al. [69] carried out a clustering task to account for different variations
of a term. For example, “status of the air-taxi” and “Air-taxi status” mean the same
even though the chronology of tokens/phrases is different. This work focuses only on
extracting and including NPs in the glossary which is one of its limitations. Second
limitation, the authors used different tools in conjunction (openNLP, GATE, and
JAPE heuristics for term extraction; SimPack, SEMILAR for similarity computation;
and R for selecting the number of clusters) to facilitate the creation of clusters of
extracted terms as compared to providing a list of glossary terms.
Hence, to mitigate the vagueness and ambiguities of aerospace requirements, a
methodology is needed for a more comprehensive glossary creation that contains dif-
ferent types of NEs (such as names of organizations, systems, resources, values, etc.)
pertaining to the aerospace domain. Due to the advancements in NLP, the feasi-
bility of a comprehensive and automated tool for glossary creation also needs to be
explored.
Observation 5
Stakeholders use different terms/words to refer to the same entity/idea when
framing requirements leading to ambiguities.
29
hence being able to classify them makes this task easier [76]. The following section
will discuss more about requirements classification.
30
encryption, and data integrity by using J48 decision trees. They removed stopwords
from the security requirement descriptions, which might not be a good idea when it
comes to analyzing aerospace requirements. Indeed, terms like not, must, when, etc.
are stop-words and add meaning to requirement specifications.
The field of requirements classification has moved from manual to semi-automatic
to automatic by making use of breakthroughs in the field of machine learning (ML). A
literature review published by Binkhonain et al. [82] looked into 24 selected studies,
all of which used ML algorithms for analyzing NFRs. Out of these 24 studies, 17
used supervised learning (SL) (71%) and SVM proved to be the most popular ML
algorithm to be used. All the studies used pipelines, where the first step was to pre-
process NL requirements, followed by a feature extraction phase, a learning phase
where the ML models were trained, and the last step was model evaluation, where the
model’s performance was evaluated on a test dataset [82]. The pipeline is represented
in Figure 2.8.
In a recent study, Hey et al. fine-tuned the BERT language model on the PROMISE
NFR dataset [49] to obtain NoRBERT (Non-functional and functional Requirements
classification using BERT) - a model capable of classifying requirements [83]. NoR-
BERT is capable of performing four tasks, namely, (1) binary classification of re-
quirements into two classes (functional and non-functional); (2) binary and multi-
class classification of four non-functional requirement classes (Usability, Security,
31
Operational, and Performance); (3) multi-class classification of ten non-functional
requirement types; and (4) binary classification of the functional and quality aspect
of requirements. NoRBERT was able to achieve an average F1 score of 0.87 on
the most frequently occurring classes of requirements (Usability, Performance, Op-
erational, and Security). In particular, it demonstrates the relevance and potential
of transfer-learning approaches to requirements engineering research as a means to
address the limited availability of labeled data. The PROMISE NFR dataset, [49],
which was used in [83] contains 625 requirements in total (255 functional and 370
non-functional, which are further broken down into different “sub-types”). Table 2.1
provides some examples from the PROMISE NFR dataset.
Table 2.1: Requirements examples from the PROMISE NFR dataset [49]
32
based methods. Yin et al. proposed a method for zero-shot text classification using
pre-trained Natural Language Inference (NLI) models [85]. The bart-large-mnli model
was obtained by training bart-large [39] on the MultiNLI (MNLI) dataset, which is
a crowd-sourced dataset containing 433,000 sentence pairs annotated with textual
entailment information [0: entailment; 1: neutral; 2: contradiction] [86]. For example,
to classify the sentence “The United States is in North America” into one of the
possible classes, namely, politics, geography, or film, we could construct a hypothesis
such as - This text is about geography. The probabilities for the entailment and
contraction of the hypothesis will then be converted to probabilities associated with
each of the labels provided.
Alhosan et al. [48] performed a preliminary study for the classification of require-
ments using ZSL in which they classified non-functional requirements into two cate-
gories, namely usability, and security. An embedding-based method was used where
the probability of the relatedness (cosine similarity) between the text embedding layer
and the tag (or label) embedding layer was calculated to classify the requirement into
either of the two categories. This work which leveraged a subset of the PROMISE
NFR dataset was able to achieve a recall and F-score of 82%. The authors acknowl-
edge that certain LMs (RoBERTaBase and XLM-RoBERTa) seemed to favor one or
the other class, hence classifying all the requirements as either usability or security.
This was attributed to the fact that LLMs are trained on general domain data and
hence might not perform well on specialized domains [49].
The classification of requirements is an important task and there exists exten-
sive research when it comes to software requirements. However, research pertaining
to leveraging requirements classification for consistency and redundancy checks, and
identification of requirements boilerplates is scarce, both in software and aerospace
domains. Hence, ideas from the above literature can be leveraged to classify aerospace
requirements which will be a stepping stone toward converting aerospace NL require-
33
ments into semi-machine-readable requirements.
Observation 6
There is limited work on the classification of aerospace requirements which
hinders the ability to conduct redundancy checks, evaluate consistency, and
standardization of requirements.
34
Figure 2.9: Rupp’s Boilerplate [18], [88], [92]
interface, user interaction, and autonomous [93]. The different parts of this boilerplate
structure are described below [18], [88]:
• Additional details about the object: more information regarding the object
35
Similarly, the EARS boilerplate can be divided into four parts [88]. The first part
is an optional condition block, followed by the system/subsystem name, the degree
of obligation, and the response of the system.
Despite the pros offered by requirement boilerplates (Rupp’s and EARS), they
can be restrictive when it comes to certain types of functional and non-functional
requirements, and do not allow for the inclusion of constraints [18]. In some cases,
using Rupp’s boilerplate can lead to inconsistencies due to the restrictions imposed
by the structure leading to exclusion of ranges of values and bi-conditionals, lack of
reference to external systems that the original system interacts with [18]. Mazo et al.
[18] improved on Rupp’s template and came up with a new boilerplate that addresses
the shortcomings of Rupp’s. It consists of eight blocks as compared to six in Rupp’s
[Figure 2.11].
Observation 7
Research focusing on the definition of appropriate, relevant, and reliable boil-
erplate is scarce; Rupp’s and EARS boilerplates do not capture all aspects of
a requirement.
Arora et al. [95] stress the importance of verifying whether requirements actually
conform to a given boilerplate, which is important for quality assurance purposes. In
36
Figure 2.11: Majo and Jaramillo template [18]
their paper, they check the conformance of requirements to Rupp’s boilerplate using
text chunking (NPs and VPs) [95]. In addition, they suggest that this is a robust
method for checking conformance in the absence of a glossary [95]. In another paper
by the same authors [88], they develop a conformation check methdology for Rupp’s
and EARS boilerplates using the NLP pipeline described in Figure 2.7.
Boilerplate identification has been significantly limited due to variations in re-
quirements across different industries and institutions. Furthermore, previous studies
have primarily concentrated on software requirements, which can differ substantially
from aerospace requirements, the focus of this study.
Observation 8
There has been limited work focusing on identifying appropriate boilerplates
given a type of requirement e.g., converting a NL requirement into an appro-
priate boilerplate structure.
37
requirement.
1. Requirements are written in Natural Language (NL) to make them more accessi-
ble to different stakeholders; however, this introduces unintended inconsistencies
and ambiguities.
4. Language models are trained on general-domain text, which leads to poor per-
formance by these models when applied to aerospace requirements due to a lack
of domain knowledge.
38
5. Stakeholders use different terms/words to refer to the same entity/idea when
framing requirements leading to ambiguities.
The above observations can be summarized into the following two gaps:
39
CHAPTER 3
RESEARCH FORMULATION
This chapter describes the research plan for the dissertation based on observations,
and gaps recognized in the last two chapters. The goal of this chapter is four folds,
namely:
40
Research Objective
3. Use the LMs created in the previous step for standardization of aerospace re-
quirements. This involves being able to extract information from requirements
and converting them to data objects, as well as, identifying text patterns in NL
requirements.
In order to achieve the above objectives, the gaps identified in Chapter 2 must be
addressed. Hence, two research questions were formulated along with their respective
sub-questions and corresponding hypotheses to achieve the objectives. A road map
for this dissertation is outlined in Figure 3.1.
41
Figure 3.1: Research road-map
Research Question 1
How can engineering/aerospace-specific domain knowledge be integrated into
language models?
42
To answer the research question posed above, corpus, LMs (BERT specifically),
and fine-tuning of LMs need to be further discussed.
A corpus is a collection of texts, for example, all the text available on Wikipedia
on which LMs can be trained.
As mentioned previously, a suitable corpus is required to train LMs [96]. However,
there is a lack of annotated corpus when it comes to engineering domains [97], [98]
such as aerospace making it difficult to tackle NLP tasks such as identification of
NEs, classification of requirements, etc.
SciBERT is a well known language model trained on scientific text [97]. This
language model was trained on a corpus that contains 18% of the text from the
computer science domain and 82% from the biomedical domain [97]. The vocabulary
size is 30,000 tokens which is the same as BERT and there is a 42% overlap between
both vocabularies, meaning 58% of the words specific to the scientific domain were
absent from the general-domain corpus [97].
FinBERT (Financial BERT) [99] uses a financial services corpus for training the
LM; BioBERT [100] uses biomedical literature for training; ClinicalBERT [101] uses
clinical notes; and PatentBERT [102] fine-tunes pre-trained BERT model for patent
classification. The existence of these LMs stresses the fact that domain-specific corpus
is crucial when it comes to domains that are not well represented in the general-
domain text (newspaper articles, Wikipedia, etc.).
Hence, in the context of this research, a LM is needed that has the capability
to perform well on aerospace text. Doing so involves the following two sub-tasks,
namely:
43
power necessary to do so. For example, GPT-3 required 355 GPU years and cost $4.6
million [67] to train. BERT has two pre-trained models - BERTBASE with 110 million
parameters, which took between $2.5k - $50k to train and BERTLARGE , which has
340 million parameters and took between $10k - $200k to train [103].
The high cost of training LMs makes it prohibitive to train a model from scratch
for the purpose of this research, the focus will be on fine-tuning a LM for various
NLP tasks using small labeled corpora. The task at hand is to choose a LM that
satisfies certain criteria of interest, as shown in Table 3.1.
Criteria Description
Pre-trained neural LM Ability to learn common language represen-
tations by using large corpus of unlabeled
data in an unsupervised/semi-supervised
manner [65]
Can be fine-tuned for a variety of tasks Fine tuning is the process of tuning the pre-
trained model’s parameters by using data of
interest – aerospace in our case
Bidirectional transformer-based architecture Bidirectional architecture makes sure that
the text is looked at in both the forward and
backward direction in order to better cap-
ture the context; transformers make use of
an attention mechanism to make the model
training faster [104]
State-of-the-art (SOTA) results for NL tasks State-of-the-art (SOTA) models can achieve
such as NER, text classification, etc. higher scores when it comes to certain NLP
tasks of interest
Considering the above criteria, Google’s BERT LM was selected. BERT stands
for Bidirectional Encoder Representations from Transformers [38]. It is capable of
pre-training deep bidirectional representations from unlabeled text by jointly incorpo-
rating both left and right context of a sentence in all layers [38]. In addition, BERT
has a fixed vocabulary of 30,000 tokens and can be fine-tuned for various NLP tasks
such as question answering, NER, classification, Masked Language Modeling, next
sentence prediction, etc [38].
Figure 3.2 illustrates what is meant by bi-directionality. In the example given,
44
Figure 3.2: Example explaining the meaning of bidirectional [105]
LMs can predict the next word, and hence having information regarding what comes
before and after the word Teddy will help with the prediction. If we know that the
word following Teddy is Roosevelt, then the context is about the former president of
the United States as compared to a teddy bear.
A simple example of transformers at work is shown in Figure 3.3, where it is
being used for a language translation task [104]. It primarily consists of two compo-
nents, namely an encoder stack and a decoder stack, where these stacks can consist
of multiple encoders and decoders. The structures of the encoder layers are identical,
however, they do not share the same weights [104].
Figure 3.4 takes us one step deeper into encoders and decoders showing how the
layers come together to form an encoder or decoder block [104]. The encoder block
can be broken down into self-attention and feed forward neural network layers. The
self-attention layer encodes the different tokens in an input sequence while looking
at other words in the sequence to make sure that the context is accounted for [104].
45
The output of the self-attention layer is then passed on to the feed forward NN.
The decoder has the same layers as the encoder except for an extra encoder-decoder
attention layer which helps with focusing on the appropriate part of the input sequence
[104].
BERT uses only encoder layers and has two variations (Figure 3.5) [37], [38], [104]:
Both BERT variants come in cased and uncased versions (Appendix C, Fig-
ure C.4). In the case of the cased model, the input text stays the same (containing
both upper and lowercase letters). However, the text is lower-cased before being tok-
enized by a WordPiece tokenizer in the uncased model. Hence, there are four variants
of BERT in total, as described below:
46
3. BERTBU : BERT base model trained on lower-cased English text
47
comes to tasks such as Question Answering (QA) and Natural Language Inference
(NLI) [38].
Figure 3.6: Pre-training and fine-tuning BERT language model [38], [104]
Q ∗ KT
Attention(Q, K, V ) = sof tmax( q )∗V (3.1)
(dk )
Where dk = dimensionality of the key vectors; its square root is used for normal-
ization of the attention scores
48
Appendix C (Figure C.14, Figure C.15, Figure C.16) provides a comprehensive
illustration of the matrix multiplication process involved in computing self-attention.
Multi-headed attention scores can be calculated by computing multiple attention
vectors in parallel using Equation 3.2 [37].
Figure 3.7: The 9th attention head of layer 3 in the encoder of BERTBASE−U N CASED
illustrates how the query (Q) for the word “record” attends to the concept of “what
should be recorded” in BERT’s internal representation of the requirement,“Each cock-
pit voice recorder shall record voice communications transmitted from or received in
the airplane by radio.”
Figure 3.7 depicts the Query (Q), Key (K), and Value (V) vectors as vertical bands,
with the intensity of each band corresponding to its magnitude. The lines connect-
ing these bands are weighted based on the attention between the tokens. Figure 3.7
shows the inner working of the 3rd encoder’s 9th attention head which captures how
the query vector “record” focuses on “what should be recorded” in BERT’s internal
representation of the requirement,“Each cockpit voice recorder shall record voice com-
49
munications transmitted from or received in the airplane by radio.” This behavior ex-
hibited by the attention heads occurs based solely on the self-supervised pre-training
of BERT. Additional information regarding the matrix multiplications utilized in
computing multi-headed attention is provided in Appendix C (Figure C.17).
BERT LM expects the input sequence to be in a certain format given the tasks
it is pre-trained on (MLM and NSP). Details about the input format are discussed
below:
– [SEP] : This special token is used to separate one sequence from the next
[38].
The use of special tokens and BERT embeddings is illustrated in Figure 3.8.
50
Figure 3.8: BERT embeddings and use of special tokens [38]
51
however, these models are trained on entity-annotated news articles from a specific
time-frame [38], [107]. Hence they do not generalize when it comes to the aerospace
domain. Table 3.2 shows the NE categories that a BERT NER model fine-tuned on
the CoNLL-2003 Named Entity Recognition dataset (English) can identify [38], [107].
Figure 3.10 shows the results obtained by applying fine-tuned BERT NER model
on general-domain text (Example 1) and on aerospace-specific text (Examples 2, 3,
and 4). The LM is able to identify the ORG (Delta Airlines) and LOC (Atlanta, GA)
with very high confidence in Example 1. However, when used on aerospace-specific
text, the same LM is unable to identify terms such as AoA (Example 2) and instead
breaks it into two different sub-words ‘A’ and ‘oA’ which is not helpful. In example 3,
the model is unable to identify terms like ETOPS and resources such as Part 121 and
52
Part 135 correctly. In addition, ISO 10303-233:2012 (from Example 4) was identified
as MISC even though it is a resource.
53
Figure 3.11: BERT LM can be fine-tuned for tasks such as NER, Classification, etc.
in the aerospace domain despite having been pre-trained on general English corpora.
Due to the lack of large annotated datasets in the aerospace requirements engineering
domain, transfer learning was chosen as the path forward.
54
Research Question 1.1
How can we fine-tune the BERT language model to identify named-entities
specific to the aerospace domain?
Being able to identify NEs will help with effective communication between different
stakeholders due to the use of consistent language [69]. In addition, it will make the
task of finding resources, system names, quantities, etc. being used in a system
description much easier [69].
Traditional text extraction tools and the LMs trained on non-aerospace text lead
to poor recall when applied to aerospace requirements [69] which translates into their
inability to identify relevant terms. However, as a pre-requisite to being able to
fine-tune BERT LM for NER identification for the aerospace domain, a corpus with
annotated aerospace NEs is required. Such a corpus does not exist and will be cre-
ated as a part of this dissertation. This will involve the collection of texts from the
aerospace domain followed by an annotation task which will be discussed in Chapter
4.
The identification of NEs is difficult for two main reasons, namely Segmentation
and Type Ambiguity [54] which are discussed below.
Segmentation: Every token in a sentence has one associated POS, however, this
is not the case when it comes to NEs. In addition, multiple tokens might be one
NE. Hence, it is helpful to address this issue during the text annotation phase by
using BIO tagging scheme which is a standard for NER [38], [68] [Table 3.4]. For the
sentence “I like Atlanta.”, the POS and BIO tagging is shown in Table 3.5.
55
Table 3.5: BIO tagging NE annotation task [38], [68]
Type Ambiguity: This type of ambiguity arises when there can be multiple
meanings for a given token. Let’s try to understand this by the help of two sentences:
Sentence A: Washington was the first president of the USA.
Sentence B: The airplane is headed to Washington.
The word Washington occurs in both Sentences A and B; however, they mean
different things given the different contexts. Washington in Sentence A refers to a
person (PER), whereas it is a location (LOC) in Sentence B. This ambiguity stresses
the fact that context is crucial for understanding a sentence. BERT is capable of
understanding this distinction because of its bidirectional property and the use of
self-attention mechanism.
The above discussion leads to the first hypothesis:
Hypothesis 1.1: If an annotated aerospace NE corpus is used to fine-tune BERT,
then we will be able to identify NEs specific to the domain.
To validate Hypothesis 1.1, a NER LM with aerospace domain knowledge needs to
be developed. This new model will be named aeroBERT-NER. The performance
of BERTBASE NER LM will be compared with aeroBERT-NER’s performance on
aerospace text for validation of the hypothesis. An overview of Experiment 1.1 which
will facilitate the testing of Hypothesis 1.1 is provided below.
Experiment 1.1: Establish the capability to identify aerospace-specific NEs from
aerospace requirements and text. This capability or model can be validated by applying
it to test aerospace requirements and measuring the performance by examining the
following:
56
1. Ability to extract NEs from aerospace requirements in a reliable and repeatable
manner.
If the above capabilities are realized then it can be concluded that the fine-tuning
of BERTBASE LM for NER with annotated aerospace NE corpus improved the per-
formance of the model on aerospace text, hence, substantiating Hypothesis 1.1. The
hypothesis will be rejected if the above capabilities can not be achieved.
The experimental steps for Experiment 1.1 are stated below:
– The dataset was split into training and test sets to avoid data leakage
– The maximum length of the input sequence was selected based on the
distribution of lengths of the sequences in the training set
– The input sequence was be tokenized and special tokens such as [CLS],
[SEP], and [PAD] were added, as required
• Step 4: Metrics such as precision, recall, and F1 score were used to measure
the performance of aeroBERT-NER against BERTBASE NER.
The confusion matrix is shown in Figure 3.12 and the equations for evaluation
metrics (accuracy, precision, recall, and F1 score) are provided below:
57
Figure 3.12: Confusion Matrix
• Accuracy: Defined by the summation of true positives (TP) and true negatives
(TN) divided by the total number of items as shown in Equation 3.3.
Accuracy is not a good metric when dealing with an imbalanced class problem.
TP + TN
Accuracy = (3.3)
T otal number of items
• Precision: Defined by the number of true positives (TP) divided by the sum-
mation of true positives and false positives (FP) (Equation 3.4). The higher
the number of TPs, the higher the precision. The higher the number of FPs,
the lower the precision.
TP
P recision = (3.4)
TP + FP
• Recall: Defined by the number of true positives (TP) divided by the summation
of true positives and false negatives (FN) (Equation 3.5). The higher the number
58
of TPs, the higher the recall. The higher the number of FNs, the lower the recall.
TP
Recall = (3.5)
TP + FN
2T P
F 1 Score = (3.6)
2T P + F P + F N
Research Question 2
How can NL requirements be converted into appropriate boilerplates of interest?
To answer the above question, we need to understand the NLP sub-tasks that can
facilitate the conversion of NL requirements into standardized form. The sub-tasks
are stated below:
59
2. Identifying named entities (NEs) in aerospace requirements: Being
able to identify and extract NEs from aerospace requirements will help with
accessing data within requirements rather than treating the statement as a
standalone object [30]. The identification of NEs was addressed in RQ 1.1.
Figure 3.13 shows the steps that need to be followed for answering RQ 2. The
process for the identification of boilerplates and requirement standardization begins
with the classification of requirements into various types. After the classification,
the POS tags, text chunks, and NEs associated with each token in the requirement
sequences for each type of requirement are obtained. Based on the observed textual
patterns, the conversion of requirement elements into data objects and boilerplates
60
can be developed for each type of requirement in a semi-automated way. It might be
the case that more than one boilerplate patterns exist for each requirement type. In
Figure 3.13, the raw NL requirement first passes through the classification algorithm
and is classified as a Type 1 requirement. This requirement is then passed through a
POS tagger, text chunker, and aeroBERT-NER and tags are assigned to each token.
Based on the frequency of different linguistic patterns (patterns in the tags and NEs),
boilerplates can be constructed in a semi-automated manner for various types of
requirements in an agile manner. This methodology makes it faster and easier to
boilerplates that are more tailored for use at a particular industry/organization as
compared to Rupp’s or EARS boilerplate structures.
The following discusses in more detail the first step to address RQ 2: requirement
classification.
Text classification is widely used in various fields such as newsgroup filtering,
sentiment classification, marketing, medical diagnosis, etc. [110], [111]. It has also
proven to be critical even in the oil industry for the classification of failed occupational
health control and for resolving accidents [112]. While using text classification for
various tasks is not a new thing, there has been no study when it comes to the
classification of aerospace requirements.
As discussed previously, BERT can be fine-tuned for text classification in a super-
vised way [38]. Text classification can be of two types, namely, binary and multi-class
classification. Binary classification has two classes, whereas multi-class classification
deals with multiple classes. For the problem at hand, there is an inclination towards
multi-class classification given there are more than two types of aerospace require-
ments. Another way to classify classification algorithms is hard or soft classification.
A hard classification algorithm assigns one class to each instance, whereas, a soft
algorithm assigns probabilities of a requirement belonging to a certain class [110].
This leads to the first sub-question under RQ 2:
61
Research Question 2.1
How can BERT language model be fine-tuned for classifying aerospace require-
ments?
Requirement Label
The air-taxi should have 5-passenger configuration according to layout men- Type 1
tioned in Document 2.3.4.
The measurement system shall include a FAA approved one-third octave band anal- Type 2
ysis system.
62
• Step 1: Aerospace requirements are labeled based on the type of requirement,
this serves as an input for the fine-tuning process.
• Step 2: WordPiece Tokenizer [38] will be used to tokenize the input sequence
and special tokens such as [CLS], [SEP], and [PAD] will be added.
• Step 3: The dataset will be split into a training and test set. Stratified sampling
might be used in case of an unbalanced class problem.
• Step 5: Metrics such as precision, recall, and F1 score will be used to measure
the performance of aeroBERT-Classifier on the test set.
The next step is to develop the capability to tag every token in a sequence with
its corresponding POS tag.
There are eight main parts of speech for English, namely, nouns, pronouns, adjec-
tives, verbs, adverbs, prepositions, conjunctions, and interjection [113]. Flair [114],
developed by researchers at Humboldt University in Berlin is one of the state-of-the-
art models when it comes to sequence tagging. A subset of the different POS tags
used by Flair model are listed in Table 3.7.
Figure 3.14 shows some sentences from the aerospace domain along with their
associated POS tags. The Flair [73] model was used for the POS tagging task. This
sequence tagging model will be used for POS tagging aerospace text, which will then
be used to fine-tune BERT LM.
This leads to the second sub-question under RQ 2:
63
Table 3.7: POS tagging notation used by Flair [114]
The answer to RQ 2.2 will help map the sequence to their respective POS tags.
The information provided by the POS tags can be helpful for improving syntactic
parsing [54] that is helpful for the ordering of words in a sentence. In addition,
POS tags can be helpful when it comes to measuring similarity/differences between
sentences [54].
POS tagging has been done reliably (with high accuracy) by various supervised
machine learning models such as Hidden Markov Models (HMM), RNNs, etc. [115].
However, certain pain points persist. Hence, while 85% of word types are unambigu-
64
ous, the 15% of the ambiguous word types are very commonly used and can have
different meanings depending upon the context. Figure 3.15 shows the POS tags for
a simple sentence “Alex will drive the car”. The word “will” will always be an auxil-
iary. On the other hand, a verb would not follow the word “the”. These are some of
the ways in which POS tags are helpful in ordering words in a sentence.
Figure 3.16: Example showing words with POS tags based on context [115]
Figure 3.16 shows that the word “back” can take up different POS tags based on
the context [115], making POS tagging not very straightforward. Hence, the use of
BERT LM for POS tagging can help with the problem at hand.
This leads to Hypothesis 2.2:
Hypothesis 2.2: If an annotated POS tagged aerospace corpus is used to fine-tune
BERT, then the identification of POS tags for tokens in NL aerospace requirements
will be possible.
To validate Hypothesis 2.2, a fine-tuned BERT POS tagger needs to be devel-
oped. This model will be called aeroBERT-POStagger. Accuracy, precision, re-
call, and F1 score metrics will be used for measuring the performance of the model.
65
An overview of Experiment 2.2, which will facilitate the testing of Hypothesis 2.2, is
provided below.
Experiment 2.2: Establish the capability to tag POS for aerospace requirements
in an automated manner.
Hypothesis 2.2 will be substantiated upon the development of a version of BERT
LM which is capable of POS tagging tokens in aerospace requirement sequences. The
hypothesis will be rejected if the above capabilities can not be achieved.
The detailed experimental steps for Experiment 2.2 are stated below:
• Step 1: The Flair sequence labeling model will be used for POS tagging of the
aerospace corpus.
• Step 2: Special tokens such as [CLS], [SEP], and [PAD] will be added.
• Step 3: The dataset will be split into a training and testing set.
• Step 5: Metrics such as precision, recall, and F1 score will be used to measure
the performance of aeroBERT-POStagger on the test set.
Following the capability to classify aerospace requirements and tag NEs and POS
for tokens, requirements can be converted into a standardized form.
To reiterate the discussion from Section 2.3, there exists two well-known boil-
erplates such as Rupp’s and EARS. However, these boilerplates can be restrictive
and as such can contribute to ambiguities. In addition, Rupp’s does not have blocks
to include quantities, ranges of values, and references to external systems/devices/re-
sources [18]. There is a need for customized boilerplates based on the contents of a NL
requirement since requirement structures can vary significantly from one organization
to another and even within an organization.
This leads to the third sub-question under RQ 2:
66
Research Question 2.3
How can requirements classification, NER, POS tagging, and text chunking be
used for constructing boilerplates?
The answer to RQ 2.3 will help identify different boilerplate structures for non-
standardized aerospace requirements. To that end a corpus-driven approach will be
employed [108] for analyzing requirements to identify linguistic constructs. Semi-
automatic bottom-up approach for defining elements of boilerplates using sequential
text mining techniques have been proven to be useful [109]. Warnier et al. [109] used
sequential data mining tool SDMC (Sequential Data Mining under Constraints) to
find patterns in requirements text (in French) for two satellite projects. They ended
up with ∼160,000 textual patterns initially and narrowed it down to 3,854 patterns
after keeping the common patterns observed in requirements belonging to both the
projects [109]. Finally, the number of observed linguistic patterns was reduced to
2,441 after discarding the patterns that were occurring in general-domain text such
as newspaper articles [109].
In another study focused on requirement boilerplate structure, Kravari et al. [116]
proposed that a requirement can be divided into three parts, namely, a prefix, a main
part, and a suffix. While it is mandatory for a requirement to have one main part, it
can have multiple prefixes (preconditions) or suffixes (additional information about
a system’s action, etc.). An ontology, as well as user input, was used for identifying
boilerplate structures. While ontologies can be helpful, they are time-consuming to
create and do not translate from one system to another.
According to another study by Ibrahim et al. [93], predefined boilerplate struc-
tures were helpful for novice systems engineers in writing requirements in a consistent
and repeatable manner, hence reducing ambiguities and inconsistencies in NL require-
ments. The authors proposed boilerplates for both functional and non-functional
requirements (performance, specific quality, and constraint). NL requirements per-
67
taining to an industrial case study (healthcare system MediNET) was used to validate
the proposed boilerplate structures [93]. The study did not provide insights about how
these boilerplate structures were decided upon and if/whether they can be tailored
to other systems of interest, hence, making the methodology opaque.
All the studies discussed above, focus on rule-based approaches to developing boil-
erplates or requirement templates which reduces their viability in real-world industrial
use cases [41]–[43], [117]. This is because the success of requirement boilerplates is
heavily dependent on the consistency of requirements with the defined boilerplate
structures. Oftentimes, NL requirements can have variations and hence might not
perfectly match with the pre-defined boilerplates which is usually the case in large
development projects (with less control over requirement authoring environments)
[32], [41]–[44], [95], [117], [118] leading to lower accuracy. Hence, there is a need for
an agile methodology for the creation of boilerplates/templates that are based on
dynamically identified syntactic patterns in requirements, which is a more adaptive
approach when compared to their rule-based counterparts.
Keeping in mind the need for an agile methodology for boilerplate creation, Fig-
ure 3.17 shows the process for the identification of linguistic constructs given different
68
variations of a requirement text. Despite the variations, the linguistic pattern ([Deter-
miner][Noun][Modal][Verb][Cardinal], etc.) stays the same as shown in the example.
Based on the frequency of different linguistic patterns along with information from
the NER and text chunking models, boilerplates can be constructed for various types
of requirements.
This leads to Hypothesis 2.3:
Hypothesis 2.3: If sequential text mining techniques are employed to detect lin-
guistic constructs, then boilerplates can be constructed based on the observed patterns
in the requirements.
To validate Hypothesis 2.3, sequential text mining techniques need to be developed
to identify linguistic constructs which will aid in the boilerplate construction process.
Experiment 2.3: Establish the capability to employ sequential text mining tech-
niques for identification of linguistic patterns in a semi-automated manner.
Hypothesis 2.3 will be substantiated upon the development of a methodology for
the identification of linguistic patterns for construction of boilerplates for standard-
izing requirements. The hypothesis will be rejected if the above capabilities can not
be achieved.
The detailed experimental steps for Experiment 2.3 are stated below:
• Step 3: aeroBERT-POStagger will be used for POS tagging the tokens in the
requirements.
69
• Step 5: Based on the frequencies of different linguistic patterns, elements of
the boilerplate and their order will be decided upon.
• Step 7: Iterations of the above steps will be carried out to make sure that the
boilerplate construction system is robust.
• Step 8: The validity and accuracy of the boilerplates generated will be verified
with the help of a Subject Matter Expert (SME).
70
3.5 Summary of Observations, gaps, and research questions
The summary of the observations, gaps, and research questions is provided in Fig-
ure 3.18.
71
72
Figure 3.18: This figure presents a summary of the observations made, gaps identified, and research questions formulated.
Specifically, Gap 1 was identified and Research Question 1 was formulated based on observations 4 and 5, while Gap 2 was
identified, and Research Question 2 and its sub-research questions were formulated based on observations 6, 7, and 8.
CHAPTER 4
METHODOLOGY
Step Associated RQ
0 Creation of an annotated aerospace corpus
1 RQ 1.1: Fine-tuning of BERT for aerospace NER
2 RQ 2.1: Fine-tuning of BERT for classification of aerospace requirements
3 RQ 2.2: Fine-tuning of BERT for POS tagging
4 RQ 2.3: Creation of aerospace requirement boilerplates
73
4.1 Step 0: Development of an annotated aerospace corpus
Part Subject
Part 1 Definitions and Abbreviations
Part 21 Certification Procedures for Products and Parts
Part 23 Airworthiness Standards: Normal, Utility, Acrobatic and Commuter Airplanes
Part 27 Airworthiness Standards: Normal Category Rotorcraft
Part 33 Airworthiness Standards: Aircraft Engines
Part 36 Noise Standards: Aircraft Type and Airworthiness Certification
Part 39 Airworthiness Directives
To reiterate, three types of aerospace text, namely, technical aerospace texts such
as research papers, more general aerospace texts such as aviation news, and lastly
certification requirements from Title 14 CFR are used to create the corpus. The
74
following steps are proposed to develop an annotated aerospace corpus:
• Step 1: Aerospace texts are collected for the creation of the corpora.
• Step 2: The collected texts are pre-processed to remove equations, etc., and
to get the text into the correct format for the next step.
• Step 3: Annotation criteria for different downstream tasks are decided upon,
for example, the types of NEs that are of interest, types of requirements, and
type of POS tags to be considered. This is likely to be an iterative process.
– Different types of aerospace requirements are selected and labeled for the
classification task
The Python programming language is used for all the coding tasks required since
it is open-source and highly customizable for various tasks.
This section presents the methodology to carry out Experiment 1.1 and addresses RQ
1.1.
Figure 4.2 describes the experimental steps. The experiment begins with using
the annotated NE text corpus from Step 0 as an input for the fine-tuning process.
Various pre-processing steps are carried out, such as splitting the dataset into training
and test sets, deciding on the maximum length of the input sequences, and obtaining
the BERT embeddings. The pre-processed training set is used to fine-tune the BERT
75
Figure 4.2: Methodology for obtaining aeroBERT-NER
parameters for NER. Evaluation metrics such as precision, recall, and F1 score are
used to evaluate the model performance on a validation and test sets. In addition, the
performance of aeroBERT-NER [120] is compared with that of BERTBASE NER on
an aerospace requirements test set. Various iterations of the model training/testing
are carried out to make sure that the results are robust and reliable.
This section presents the methodology to carry out Experiment 2.1 and addresses RQ
2.1.
Figure 4.3 illustrates the experimental steps for Experiment 2.1, which aims at
developing a classifier for the automated classification of aerospace requirements.
The methodology starts with labeled aerospace requirements, which are divided into
training and test sets. The labeled training set of aerospace requirements is used
for fine-tuning the parameters of BERT LM for the classification task. Evaluation
metrics such as accuracy, precision, recall, and F1 score are used to evaluate the
model performance on the test set to account for the imbalanced dataset. Various
iterations of the model training/testing are carried out to make sure that the results
76
Figure 4.3: Methodology for obtaining aeroBERT-Classifier
Figure 4.4 provides a detailed view of the classification algorithm. The process
starts with an input sequence/sentence, which is tokenized using a WordPiece tok-
enizer. Special tokens such as [CLS], [SEP], and [PAD] are added to the tokenized
sequence to mark the beginning, end and padding token. The [PAD] token is used
to pad the sequence up to the set maximum length, which is dependent upon the
distribution of length of sequences. This pre-processed sequence serves as an input to
77
the fine-tuning process. Only the [CLS] token is used for the classification task since
it is a pooled output of all tokens. An activation function is used for outputting the
probabilities of a given requirement belonging to each of the pre-defined classes. The
requirement is classified into the class which has the highest probability.
4.4 Step 3: Fine-tuning BERT for POS tagging of aerospace text (aeroBERT-
POStagger)
This section presents the methodology to carry out Experiment 2.2 and addresses RQ
2.2.
Figure 4.5 illustrates the various steps to carry out Experiment 2.2. The annotated
corpus with POS tags (from Step 0) serves as input to fine-tune BERT for POS
tagging. Evaluation metrics such as precision, recall, and F1 score are used to evaluate
the model performance on the test set. The fine-tuned model receives the test set
(which it has not seen before) and performs the POS tag predictions on it. Various
iterations of the model training/testing are carried out to make sure that the results
are robust and reliable.
78
4.5 Step 4: Creation of aerospace requirements boilerplates
Figure 4.6 lays out the steps to carry out Experiment 2.3, which contributes to the
goal of converting NL aerospace requirements into standardized/machine-readable
requirements. This step needs inputs from Steps 1, 2, and 3 (discussed previously).
Given multiple requirements belonging to a certain type, the frequency of different
linguistic patterns is identified. Based on the observed frequencies of different lin-
guistic patterns, the ordering of elements for boilerplates is identified. This process
is repeated for different types of requirements. Lastly, the verification of the validity
and accuracy of the proposed boilerplates is carried out by SMEs.
79
CHAPTER 5
IMPLEMENTATION
This chapter discusses the implementation of the methodology presented in the pre-
vious chapter.
A LM such as BERT can be fine-tuned for various downstream tasks, such as NER,
text classification, POS tagging, question answering, etc. Pre-trained parameters are
used to initialize the BERT LM and these parameters are then fine-tuned with the
help of labeled data for the downstream task. The use of different types of labeled
data will lead to different models that perform different tasks even though they were
initialized with the same parameters. For the purpose of this work, BERT LM is fine-
tuned for three different downstream tasks, namely, NER, requirements classification,
and POS tagging. As a result, three different annotated corpora need to be created,
as discussed in the following sections.
80
Table 5.1: Resources for creation of annotated aerospace NER corpus
1432 sentences related to the aerospace domain were incorporated into the corpus.
This was done as the annotation of the corpus is a time-consuming task, and hence
a sufficient number of sentences were selected to demonstrate the methodology. For
a detailed breakdown of the types of sentences included, please refer to Figure 5.1.
These sentences were obtained by modifying the original text (when required) into
a proper format for the purpose of corpus creation. For example, 14 CFR §23.1457(g)
is shown in its original and modified form in Table 5.2. In this example, the original
text was converted into three distinct requirements so that they can each be complete
sentences.
In addition to modifying the original text (Table 5.2), figures, tables, and equations
81
Table 5.2: Text modification and requirement creation
were discarded from the original text. Some other changes were made to certain
entities, as shown in Table 5.3. The dots ‘.’ which were not sentence endings were
replaced with a ‘-’ so as not to cause any confusion for the model which is pre-trained
on general domain corpora. In addition, the symbol for section (‘§’), was replaced
with the word ‘Section’ to make it more intuitive for the model to learn patterns.
The aerospace corpus obtained was then annotated using the BIO tagging scheme
(Table 3.5). For the purpose of aeroBERT-NER, five classes of NEs were identified
based on their frequency of occurrence in aerospace texts (Table 5.4): System (SYS),
Value (VAL), Date time (DATETIME), Organization (ORG), and Resource (RES).
After identifying the NE classes of interest, the aerospace corpus was annotated.
NEs were identified in the corpus and added to distinct lookup .txt files for each
class. For example, auxiliary power unit was added to the systems.txt file since it was
identified as a system. A detailed chronology for the creation of lookup files is shown
82
Table 5.4: Types of named-entities identified
Category NER Tags Example
System B-SYS, I-SYS exhaust heat exchangers, powerplant, auxiliary power unit
Value B-VAL, I-VAL 1.2 percent, 400 feet, 10 to 19 passengers
Date time B-DATETIME, I-DATETIME 2013, 2019, May 11,1991
Organization B-ORG, I-ORG DOD, Ames Research Center, NOAA
Resource B-RES, I-RES Section 25-341, Sections 25-173 through 25-177, Part 23 subpart B
in Figure 5.2. These lookup files were then used for semi-automated NE annotation
of the aerospace corpus.
Figure 5.2: Flowchart showing creation of lookup files for NER annotation
For the NER annotation, if a token or sequence of tokens is found in the lookup .txt
files, then it is tagged according to the name of the .txt file in which the token/tokens
were found. For example, if exhaust system is found in the systems.txt file, then
it will be tagged as a system. The flowchart detailing the NER annotations for
the requirement “The exhaust system, including exhaust heat exchangers for each
powerplant or auxiliary power unit, must provide a means to safely discharge potential
harmful material”, is shown in Figure 5.3. All the annotations were done according
to the BIO tagging scheme provided in Table 3.5.
Lastly, Table 5.5 shows the number of times each NER tag occurs in the aerospace
corpus created for aeroBERT-NER. System names (B-SYS, I-SYS) occur the most
often, whereas DATETIME entities occur the least often in the corpus. The dataset
for NER contains a total of 44,033 tokens.
83
Figure 5.3: Flowchart showing NER annotation methodology for the requirement
“The exhaust system, including exhaust heat exchangers for each powerplant or auxil-
iary power unit, must provide a means to safely discharge potential harmful material.”
It took a total of four months to collect data and perform the initial annotation for
Named Entity Recognition (NER). The annotation task was carried out by one human
annotator who possessed knowledge of the aerospace domain. To ensure consistency,
a second and third review of the annotation was conducted.
Table 5.5: NER tags and their counts in aerospace corpus for aeroBERT-NER
84
5.1.2 Annotated corpus for aerospace requirements classification
Table 5.6: Resources used for the creation of aerospace requirements classification
corpus
85
Table 5.7: Definitions used for labeling/annotating requirements [5], [7], [121]
Example: The airplane must be free from flutter, control reversal, and
divergence for any configuration and condition of operation.
Interface Defines the interaction between systems [122];
86
Figure 5.4: Six “types” of requirements were initially considered for the classifica-
tion corpus. Due to the lack of sufficient examples for Interface, Environment, and
Quality requirements, these classes were dropped at a later phase. However, some of
the Interface requirements (23) were rewritten (or reclassified) to convert them into
either Design or Functional requirements to keep them in the final corpus, which only
contains Design, Functional, and Performance requirements.
requirements.
To obtain a more balanced dataset, Environment, and Quality requirements were
dropped completely. However, some of the Interface requirements (23) were rewritten
(or reclassified) as Design and Functional requirements, as shown in Table 5.8. The
rationale for this reclassification was that it is possible to treat the interface as a thing
being specified rather than as a special requirement type between two systems.
Table 5.8: Examples showing the modification of Interface requirements into other
“types” of requirements
87
The final form of the dataset is shown in Table 5.9.
Requirements Label
Each cockpit voice recorder shall record voice communications transmit- 1
ted from or received in the airplane by radio.
Each recorder container must be either bright orange or bright yellow. 0
Single-engine airplanes, not certified for aerobatics, must not have a ten- 2
dency to inadvertently depart controlled flight.
Each part of the airplane must have adequate provisions for ventilation 0
and drainage.
Each baggage and cargo compartment must have a means to prevent 1
the contents of the compartment from becoming a hazard by impacting
occupants or shifting.
The requirements dataset was collected and annotated by a single human anno-
tator with expertise in the aerospace domain, which took a total of two months. An
SME was consulted during the process and various iterations of labeling and review
were conducted to ensure the consistency of the labeling.
After having developed two different aerospace corpora, the next step was to
use them to fine-tune BERT LM for aerospace NER and requirements classification.
Various variants of BERT were fine-tuned for aerospace NER and requirements clas-
sification, the detailed methodology for which is discussed in the next sections.
The detailed methodology for fine-tuning BERT for aerospace named-entity recogni-
tion (NER) is discussed below.
5.2.1 Preparing the dataset for fine-tuning BERT for aerospace NER
BERT LM expects the input sequence to be in a certain format. The 1432 sentences
in the aerospace corpus were divided into training (90%) and validation sets (10%).
88
The text in the training set was then tokenized using the WordPiece tokenizer. If
a word is not present in BERT’s vocabulary, the WordPiece tokenizer splits it into
subwords. “##” is used as a prefix to denote that the previous string is not whitespace
(Figure C.7). Therefore, tokens with “##” as a prefix should be concatenated with
the preceding token when converting to a string. It is important to use the same
tokenizer that the model was pre-trained with.
The maximum length of the input sequence is decided upon and was set to 175
after examining the distribution of lengths of all sequences in the training set (Fig-
ure 5.5). The length of the longest sequence in the aerospace corpus was found to be
172, while the 95th percentile was found to be 68. If any sequence has a length less
than the set maximum length, it is post-padded with [PAD] tokens till the sequence
length is equal to the maximum length. The sequences which are longer than 175,
were truncated. It is crucial to consider the lengths of sequences when working with
language models since they have limitations in processing text of certain lengths. If
a sequence exceeds the maximum length the language model can handle, the excess
text is truncated, resulting in the loss of valuable information.
Special tokens such as [CLS], [SEP], and [PAD] were added to every requirement.
Post-padding was performed. All the tokens were converted into their respective “ids”
(numbers), and all the [PAD] tokens were tagged as 0. This gives the model an idea
about which tokens carry “actual” information as compared to padding tokens. In
addition, the tags associated with each sequence were converted into “ids” as well, as
shown in Table 5.10. Lastly, attention masks (1 for “real” tokens, and 0 for [PAD]
tokens) were obtained for all the sequences in the training set.
Table 5.11 shows an example of a sequence, its associated ids, tags, and attention
mask for the sentence, “It must be shown by analysis or test, or both, that each operable
reverser can be restored to the forward thrust position.” Three columns serve as an
input to the BERT model for fine-tuning, namely token ID, tag ID, and attention
89
Figure 5.5: Choosing maximum length of the input sequence for training aeroBERT-
NER
mask.
The text corpus with annotated named entities was used for the fine-tuning pro-
cess during which the BERT LM parameters were fine-tuned for NER within the
aerospace domain. All four variants of BERT were fine-tuned, namely BERTLU ,
BERTLC , BERTBU , and BERTBC to obtain their aeroBERT-NER counterparts. A
full pass over the training set was performed in each epoch. The batch size was set
to 32. Adam optimizer with a learning rate of 3 × 10−5 was used and the model
was trained for 20 epochs. The dropout rate was set to the default value of 0.1 to
promote the generalizability of the model and avoid overfitting. The training losses
were tracked and the norms of the gradients were clipped to avoid the “exploding
gradient” problem. The model performance was measured on the validation set for
each epoch. The performance metrics precision, recall, and F1 score were used to
90
Table 5.10: NER tags and their corresponding IDs that were used for this work
evaluate the model on the validation set since the dataset was imbalanced in regard
to different named entity classes. Model training/validation was carried out many
times to make sure that the results were robust and reliable. The fine-tuning process
took 268.6 seconds for aeroBERT-NERBC on an NVIDIA Quadro RTX 8000 GPU
with 40 GB VRAM. This training time, however, more than doubles to 873.9 seconds
for aeroBERT-NERLC on the same GPU. This difference in training times can be
attributed to the fine-tuning of 110M parameters for aeroBERT-NERBC as compared
to 340M for aeroBERT-NERLC .
Figure 5.6 shows the methodology used for fine-tuning BERT for NER for the
requirement, “Trim control systems must be designed to prevent creeping in flight”.
The embeddings for the WordPiece tokens are fed into the pre-trained LM to compute
the token representations by the BERT encoder. The representations are then fed
through a linear classification layer. Here, scores measuring how likely the token
belongs to each named entity category are computed. The token is classified into
the category with the highest score. For example, Trim is the beginning of a system
name (B-SYS) in this case. This process is repeated for every token in the sequence
to classify it into a particular NER category.
To explore the trends in F1 scores on the validation set, several subsets of the
91
Table 5.11: Tokens, token ids, tags, tag ids, and attention masks
92
93
Figure 5.6: The detailed methodology used for full-fine-tuning of BERT LM for aerospace NER is shown here. E[name]
represents the embedding for that particular WordPiece token which is a combination of position, segment, and token
embeddings. R[name] is the representation for every token after it goes through the BERT model. This representation then
passes through a linear layer and is classified into one of the NER categories by the highest estimated probability.
NER on aerospace text.
The detailed methodology for fine-tuning BERT for aerospace requirements classifi-
cation is discussed below.
5.3.1 Preparing the dataset for fine-tuning BERT for classification of aerospace
requirements
As mentioned previously, LMs expect inputs in a certain format, and this may vary
from one LM to another based on how the model was pre-trained. The dataset
was split into training (90%) and test (10%) sets containing 279 and 31 samples,
respectively (the corpus contains a total of 310 requirements). Table 5.12 provides a
detailed breakdown of the count of each type of requirement in the training and test
sets. The LM was fine-tuned on the training set, whereas the model performance was
tested on the test set, which the model had not been exposed to during training.
Table 5.12: Breakdown of the types of requirements in the training and test set for
aeroBERT-Classifier
The text in the training set was then tokenized using the WordPiece tokenizer.
The maximum length for the input sequences was set to 100 after examining the
distribution of lengths of all sequences (requirements) in the training set (Figure 5.7).
All the sequences with a length less than the set maximum length were post-padded
94
Figure 5.7: Figure showing the distribution of sequence lengths in the training set
used for training aeroBERT-Classifier. The 95th percentile was found to be 62. The
maximum sequence length was set to 100 for the aeroBERT-Classifier model.
with [PAD] tokens till the sequence length was equal to the maximum length. The
sequences which are longer than 100, were truncated. Special tokens such as [CLS],
[SEP], and [PAD] were added to every requirement.
All four variants of BERT were fine-tuned, namely BERTLU , BERTLC , BERTBU ,
and BERTBC to obtain their aeroBERT-Classifier counterparts. Hence, the input
sequences were either cased or uncased depending on the BERT variant being fine-
tuned. The tokens present in every sentence were mapped to their respective IDs in
BERT’s vocabulary of 30,000 tokens. Lastly, attention masks (1 for “real” tokens,
and 0 for [PAD] tokens) were obtained for all the sequences in the training set. Only
the token ID and the attention mask columns are used for fine-tuning BERT for
requirements classification, as shown in Table 5.13. In addition to these columns, the
model is provided with a label for each requirement example.
95
Table 5.13: Tokens, token IDs, and attention masks for the requirement “It must be
shown by analysis or test, or both, that each operable reverser can be restored to the
forward thrust position” is shown. Only token IDs and attention masks along with
the requirement label are provided as inputs to the BERT model for fine-tuning.
96
5.3.2 Fine-Tuning BERT LM for aerospace requirements classification
A pre-trained BERT model with a linear classification layer on the top is loaded from
the Transformers library from HuggingFace (BertForSequenceClassification). This
model and the untrained linear classification layer (full-fine-tuning) are trained on the
classification corpus created previously (Table 5.9).
Figure 5.8: The example demonstrates a classification dataset with a batch size of
three sentences, along with the input IDs for the corresponding tokens and their
respective attention masks. However, it’s important to note that the batch size for
the aeroBERT-Classifier is actually 16. The special tokens [CLS], [PAD], and [SEP]
are assigned the IDs 101, 0, and 102 respectively.
The batch size was set to 16 (Figure 5.8) and the model was trained for 20 epochs.
The model was supplied with three tensors for training: 1) input IDs; 2) attention
masks; and 3) labels for each example. The AdamW optimizer [124] with a learning
rate of 2 × 10−5 was used. The previously calculated gradients were cleared before
performing each backward pass. In addition, the norm of gradients was clipped to 1.0
to prevent the “exploding gradient” problem. The dropout rate was set to the default
value of 0.1 (after experimenting with other rates) to promote the generalizability
of the model and speed up the training process. The model was trained to minimize
the cross-entropy loss function.
The model performance on the test set was measured by calculating metrics,
97
98
Figure 5.9: The detailed methodology used for full-fine-tuning of BERT LM for aerospace requirements classification is
shown here. Ename represents the embedding for that particular WordPiece token which is a combination of position,
segment, and token embeddings. Rname is the representation for every token after it goes through the BERT model. Only
R[CLS] is used for requirement classification since its hidden state contains the aggregate sequence representation.
including the F1 score, precision, and recall. Model training and testing were carried
out multiple times to make sure that the model was robust and reliable. The fine-
tuning process took only 39 seconds for aeroBERT-ClassifierBU and 174 seconds for
aeroBERT-ClassifierLU on an NVIDIA Quadro RTX 8000 GPU with a 40 GB VRAM.
The short training time can be attributed to the small training set.
Figure 5.9 shows a rough schematic of the methodology used for fine-tuning the
BERT LM for requirement classification for the requirement, “Trim control systems
must be designed to prevent creeping in flight”. The token embeddings are fed into the
pre-trained LM and the representations for these tokens are obtained after passing
through 12 encoder layers. The representation for the first token (R[CLS] ) contains the
aggregate sequence representation and is passed through a pooler layer (with a Tanh
activation function) and then a linear classification layer. Class probabilities for the
requirement belonging to the three categories (design, functional, and performance)
are estimated and the requirement is classified into the category with the highest
estimated probability, ‘Design’ in this case.
In order to evaluate its overall performance, the various variants of aeroBERT-
Classifier were compared with other language models that employ different architec-
tures. Details of the models used in the comparison are presented in Table 5.14.
While two of the models employed for comparison with aeroBERT-Classifier are
based on transformers (Table 5.14), the Bi-LSTM (with GloVe) utilizes pre-trained
GloVe embeddings to train the Bi-LSTM from scratch. By leveraging the knowl-
edge embedded in the pre-trained word embeddings, which have been trained on a
large corpus of text data, the model initialization process accelerates the training
process and boosts the performance of the model, particularly when training data is
inadequate.
99
Table 5.14: To evaluate the performance of aeroBERT-Classifier, several language
models with distinct architectures and training methods were employed for bench-
marking purposes.
The initial proposition was to fine-tune BERT LM for POS tagging of aerospace text.
Upon further inspection, however, an off-the-shelf LM called flair/pos-english
[114] was deemed adequate for POS tagging aerospace text. This LM is trained on
Ontonotes [125] which contains text from news articles and broadcasts, telephone
conversations, etc. Despite the differences in the training data used for training
flair/pos-english and aerospace text, this LM is expected to work since POS tags
lie at the very basis of the English language and hence are expected to generalize
fairly well to previously unseen text.
Another reason for using an off-the-shelf LM for POS tagging is training/fine-
tuning a POS model requires a large amount of annotated text given the amount of
100
intraclass variations for each POS class. Hence, training such a model from scratch
would require a substantial amount of annotated aerospace POS datasets as compared
to the small annotated corpora used for aeroBERT-NER and aeroBERT-Classifier.
Figure 5.10 shows two examples illustrating intraclass variations seen in POS tags
using flair/pos-english LM [114]. In example 1, “measurement”, “system”, etc.
are tagged as Nouns (NN). However, in example 2, “5-passenger” was also tagged as a
Noun (NN). Despite both being nouns, the word “5-passenger” looks very different as
compared to the other two. Similarly, “one-third” (from example 1) and “2.3.4” (from
example 2) are both tagged as Cardinal numbers (CD), despite the variation observed
in them. Hence, a copious amount of annotated text data is required for training/fine-
tuning a POS (or text chunking model) to capture the intraclass variations. The
examples discussed here give a preview of the variations and are not all-encompassing.
The main idea behind using POS tags (and NER) is to be able to identify linguis-
tic patterns (if any) in different types of aerospace requirements. Hence, aeroBERT-
Classifier [126] was first used to classify requirements into various types before per-
forming any POS tag or NER-based analysis. Tokens in the three types of require-
ments (design, function, and performance) were tagged with their respective POS and
NEs and then analyzed to observe any patterns that might emerge. These patterns
will then be used for the standardization of requirements.
101
5.4.1 Observations regarding POS tags and patterns in requirements
Figure 5.11 shows a Sankey diagram representing the observed POS tag patterns
in all of the requirements that were classified as design requirements by aeroBERT-
Classifier. As can be seen from the figure, most of the requirements start with a
determiner (DET), namely, each, the, etc. DET are mostly followed by nouns which
can be system names, or part of a system name in case it spans more than one word.
Figure 5.11: Sankey diagram showing the POS tag patterns in design requirements.
It is difficult to observe any discernable patterns due to the noise in the POS tags.
A part of the figure is shown due to space constraints, however, the full diagram can
be found here.
Similarly, Figure 5.12 shows a Sankey diagram representing the observed POS tag
patterns in all of the requirements that were classified as functional requirements by
the aeroBERT-Classifier. Patterns similar to that observed for design requirements
were observed for functional requirements as well.
Lastly, Figure 5.13 shows the POS patterns for performance requirements.
As can be observed from Figure 5.11, Figure 5.12, and Figure 5.13, POS tags
by themselves can be noisy (since each word has a POS tag associated with it),
102
Figure 5.12: Sankey diagram showing the POS tag patterns in functional require-
ments. Most of the requirements start with a determiner (DET), followed by a NOUN
which is usually a system name. A part of the figure is shown due to space constraints,
however, the full diagram can be found here.
Figure 5.13: Sankey diagram showing the POS tag patterns in performance require-
ments. A part of the figure is shown due to space constraints, however, the full
diagram can be found here.
103
making them less useful when it comes to identifying patterns in requirements for
standardization. For example, in the first requirement in Figure 5.10, the system name
(measurement system which is a noun) spans two words as compared to the system
name in the second requirement (air-taxi) which is just one word. The differences
in the spans of different words (such as system names, resource names, etc.) and
hence their POS tags, render pattern-matching exercises difficult. While, the example
presented here demonstrates this in the case of nouns, the same issue exists for other
POS tags as well. Therefore, sentence chunks (Figure 5.14), which are an aggregation
of POS tags [54], were deemed more useful when it comes to identifying patterns
in NL requirements and reorganizing requirement structures to make them follow
a standardized template. Sentence/text chunks provide information regarding the
syntactic structure of a sentence while reducing the variability as seen in the case of
POS tags, hence rendering pattern matching easier.
Figure 5.14: An aerospace requirement along with its POS tags and sentence chunks.
Each word has a POS tag associated with it which can then be combined together to
obtain a higher-level representation called sentence chunks (NP: Noun Phrase; VP:
Verb Phrase; PP: Prepositional Phrase).
Various LMs exist for sentence/text chunking [114], and can be accessed via the
Hugging Face platform (flair/chunk-english). Since NL aerospace requirements
are in English, these models are expected to be able to extract the appropriate text
chunks and hence do not require fine-tuning an LM on aerospace text. Some of the
104
Table 5.15: A subset of text chunks along with definitions and examples is shown
here [114], [127]. The bold text highlights the type of text chunk of interest.
Design Requirements
Of the 149 design requirements (Table 5.12), 139 started with a noun phrase (NP), 7
started with a prepositional phrase (PP), 2 started with a subordinate clause (SBAR),
and only 1 started with a verb phrase (VP). In 106 of the requirements, NPs were
followed by a VP or another NP. The detailed sequence of patterns is shown in
105
Figure 5.15.
Figure 5.15: Sankey diagram showing the text chunk patterns in design requirements.
A part of the figure is shown due to space constraints, however, the full diagram can
be found here.
106
Figure 5.16: Examples 1 and 2 show a design requirement beginning with a prepo-
sitional phrase (PP) and subordinate clause (SBAR) which is uncommon in the re-
quirements dataset used for this work. The uncommon starting sentence chunks (PP,
SBAR) are however followed by a noun phrase (NP) and verb phrase (VP). Most of
the design requirements start with an NP.
Figure 5.17: Example 3 shows a design requirement starting with a verb phrase (VP).
Example 4 shows the requirement starting with a noun phrase (NP) which was the
most commonly observed pattern.
edge. Such discrepancies can be resolved by simultaneously accounting for the named
entities (NEs) identified by aeroBERT-NER for the same requirement. For this ex-
ample, “Balancing tabs” was identified as a system name (SYS) by aeroBERT-NER
which should be a NP by default. In places where the text chunking and NER models
disagree, the results from the NER model take precedence since it is fine-tuned on
an annotated aerospace corpus and hence has more context regarding the aerospace
domain.
Functional Requirements
Of the 100 requirements classified (by aeroBERT-Classifier) as functional, 84 started
107
with a noun phrase (NP), 10 started with a prepositional phrase (PP), and 6 started
with a subordinate clause (SBAR). The majority of the NPs are followed by a VP
(69). The detailed sequence of patterns is shown in Figure 5.18.
Figure 5.18: Sankey diagram showing the text chunk patterns in functional require-
ments. A part of the figure is shown due to space constraints, however, the full
diagram can be found here.
As can be observed from Figure 5.18, functional requirements used for this work,
either start with an NP, PP, or SBAR. Figure 5.19 shows examples of functional
requirements beginning with these three types of sentence chunks.
The functional requirements beginning with an NP have system names in the be-
ginning (Example 1 of Figure 5.19). However, this is not the case for requirements
that start with a condition, as shown in Example 3.
Performance Requirements
Of the 61 requirements classified as performance, 53 started with a noun phrase (NP),
and 8 started with a prepositional phrase (PP). The majority of the NPs are followed
by a VP (39). The detailed sequence of patterns is shown in Figure 5.20.
108
Figure 5.19: Examples 1, 2, and 3 show functional requirements starting with an NP,
PP, and SBAR. Most of the functional requirements start with an NP, however.
Figure 5.20: Sankey diagram showing the text chunk patterns in performance re-
quirements. A part of the figure is shown due to space constraints, however, the full
diagram can be found here.
109
Figure 5.21: Examples 1 and 2 show performance requirements starting with NP and
PP respectively.
In example 2 (Figure 5.21), cardinal numbers, such as “400 feet”, and “1.5 percent”
are both tagged as NP, however, there is no way to distinguish between the different
variations in NPs (NPs containing cardinal numbers vs not). Using aeroBERT-NER
in conjunction with flair/chunk-english is expected to clarify different types of
entities beyond their text chunk tags which are helpful for ordering entities in a
requirement. The same idea applies to resources (RES, for example, Section 25-395)
as well.
110
ways (Figure 5.22), namely, 1) the creation of a requirements table, and 2) boilerplate
identification.
Details regarding the background and methodology for the development of the
requirements table will be discussed next.
The subsequent sections furnish an overview of requirements tables and elucidate the
approach for constructing a requirements table using language models.
Background
A requirement table can be created in SysML and is used to capture requirements
in a spreadsheet-like environment. Each row in the table represents a requirement,
and additional columns can be added to capture the attributes assigned to the re-
quirement or system model elements related to it. This table can be used to filter
requirements of interest and their associated properties. In addition, given its tab-
ular format, the requirements table can be easily exported into a Microsoft Excel
spreadsheet [128]. This table can be used for automated requirements analysis and
111
modeling as requirements change with time, hence leading to time and cost savings
[129]. A typical SysML requirements table is shown in Figure 5.23.
Figure 5.23: A SysML requirements table with four columns, namely, Id, Name, Text,
and Satisfied by. It typically contains these four columns by default, however, more
columns can be added to capture other properties pertaining to the requirement.
Riesener et al. [130] illustrate a pipeline for the generation of the SysML require-
ments table using a dictionary-based method. Relevant domain-specific keywords
were added to a dictionary, and these words were then extracted when they occurred
in mechatronics requirements text. Technical units along with related quantities,
material properties, and manufacturing processes were the named-entity types of in-
terest for this work [130]. The amount of effort required to create a dictionary for
the extraction of words of interest is tremendous and does not generalize across dif-
ferent projects. In addition, this dictionary needs to be updated from time to time
to keep up with the occurrence of new units, etc. Therefore, employing language
models to extract pertinent named entities (like system names, values, units, etc.)
is more adaptable and scalable than dictionary-based methods, which can either be
112
unchanging or require manual revisions, rendering them arduous.
Despite the advantages offered by a requirement table, the creation of such a table
can be very manual and hence prone to human errors due to repetitive tasks that
need to be performed for populating the various columns of interest. Therefore, vari-
ous LMs can be used to automate the creation of an Excel spreadsheet (requirement
table) containing some of the columns of interest which will aid the creation of a
similar table in SysML.
The order of the language models used for identifying requirements boilerplate tem-
plates are depicted in Figure 5.24.
113
Table 5.16: List of language models used to populate the columns of requirement
table.
Initially, all the requirements underwent classification into three categories, namely
design, functional, and performance, using the aeroBERT-Classifier language model.
Out of the total requirements, 149 were classified as design, 100 as functional, and
61 as performance. Additionally, to maintain consistency among the requirements,
all variations such as “should have”, “must have”, etc., were changed to “shall have”.
Following this, the requirements were tagged for sentence chunks and named enti-
ties using flair/chunk-english and aeroBERT-NER, respectively. An illustration
exemplifying this process is presented in Figure 5.25.
This was followed by going through the original requirement text and identifying
broad high-level patterns. After analyzing the three types of requirements it was
discovered that there was a general textual pattern irrespective of the type, as shown
114
Figure 5.25: Example demonstrating the boilerplate identification pipeline for the re-
quirement “Each cockpit voice recorder shall record voice communications transmitted
from or received in the airplane by radio.”
in Figure 5.26.
Figure 5.26: The general textual pattern observed in the requirements was 〈prefix〉
+ 〈body〉 + 〈suffix〉 out of which prefix and suffix are optional and can be used to
provide more context about the requirement.
The Body section of the requirement usually starts with a NP (system name) which
(for most cases) contains an SYS named-entity, whereas the beginning of a Prefix
and Suffix is usually marked by a subordinate clause (SBAR) or prepositional phrase
(PP), namely, ‘so that’, ‘so as’, ‘unless’, ‘while’, ‘if’, ‘that’, etc. Both the Prefix and
Suffix provide more context into a requirement and thus are likely to be conditions,
exceptions, the state of the system after a function is performed, etc. Usually, the
Suffix can contain various different types of NEs, such as names of resources (RES),
values (VAL), other system or sub-system names (SYS) that add more context to the
requirement, etc. It is mandatory for a requirement to have a Body, however, both
115
prefixes and suffixes are optional.
The contents of an aerospace requirement can be broadly classified into various
elements, such as system, functional attribute, state, condition, etc. The details
about the different elements of a requirement are described in Table 5.17 along with
examples. The presence or absence of these elements, along with their sequence/order
is a distinguishing factor for requirements belonging to different types as well as
requirements within a type, hence giving rise to different boilerplate structures.
Element Definition
〈condition〉 Specifies details about the external circumstance, system
configuration, or current system activity under which a sys-
tem is present while performing a certain function, etc.;
116
Element Definition
〈sub-system/system〉 Specifies any additional system/sub-system that the main
system shall include, or shall protect in case of a certain
operational condition, etc.;
The information about the sentence chunks and NEs helped in the identification
of elements (described in Table 5.17) present in different requirements and hence
contributed to the identification of relevant boilerplate patterns. The methodology
used for the identification of patterns in requirements is discussed in the following
subsection.
117
5.5.3 Identification of boilerplates by examining patterns in requirements text
This study focuses on developing an agile methodology to identify and create boiler-
plate templates based on observed patterns in well-written requirements as opposed
to proposing generalized boilerplates. Consequently, this initiative is anticipated to
aid diverse organizations in devising bespoke boilerplates by examining textual pat-
terns in requirements that are specific to their internal needs. This is beneficial as the
structures and types of requirements can differ not only between organizations but
also within an organization. For the purpose of this work, certification requirements
from Parts 23 and 25 of Title 14 CFRs were used for this work due to their availability
to the author.
The requirements are first classified using the aeroBERT-Classifier. The patterns
in the sentence chunk and NE tags are then examined for each of the requirements
and the identified patterns are used for boilerplate template creation (Figure 5.24,
Figure 5.25). Based on these tags, it was observed that a requirement with a 〈condi-
tion〉 in the beginning usually starts with a prepositional phrase (PP) or subordinate
clause (SBAR). Requirements with a condition, in the beginning, were however rare
in the dataset used for this work. The initial 〈condition〉 is almost always followed
by a noun phrase (NP), which contains a 〈system〉 name which can be distinctly
identified within the NP by using the NE tag (SYS). The initial NP is always suc-
ceeded by a verb phrase (VP) which contains the term “shall [variations]”. These
“[variations]” (shall be designed, shall be protected, shall have, shall be capable of,
shall maintain, etc.) were crucial for the identification of different types of boiler-
plates since they provided information regarding the action that the 〈system〉 should
be performing (to protect, to maintain, etc.). The VP is followed by a subordinate
clause (SBAR) or prepositional phrase (PP) but this is optional. The SBAR/PP is
succeeded by a NP or adjective clause (ADJP) and can contain either a 〈functional
attribute〉, 〈state〉, 〈design attribute〉, 〈user〉, or a 〈sub-system/system〉, depending
118
on the type/sub-type of a requirement. This usually brings an end to the main Body
of the requirement. The Suffix is optional and can contain additional information
such as operating and environmental conditions (can contain VAL named entity),
resources (RES), and context.
The process of categorizing requirements, chunking, performing Named Entity
Recognition (NER), and identifying requirement elements is iterated for all the re-
quirements. The elements are recognized by combining information obtained from
the sentence chunker and aeroBERT-NER. The requirements are subsequently clas-
sified into groups based on the identified elements to form boilerplate structures.
Optional elements are included in the structures to cover any variations that may
occur, allowing more requirements to be accommodated under a single boilerplate.
119
CHAPTER 6
RESULTS AND DISCUSSION
This chapter discusses in detail the results regarding the various corpora and LMs that
were developed for the conversion of NL aerospace requirements into semi-machine-
readable/standardized requirements. This conversion was facilitated by the devel-
opment of aeroBERT-NER, aeroBERT-Classifier, and the use of sentence chunking,
which will be discussed in the following sections.
Two different annotated aerospace corpora (or datasets) were created as a part of this
dissertation and can be accessed via the Hugging Face platform using the Python
scripts provided in Appendix A. The details regarding the corpora are listed below:
Due to the technical nature of the data, a single human annotator manually an-
notated both aforementioned corpora, which took over six months for the annotation
process. The annotations were subsequently evaluated by subject matter experts
(SMEs) who provided feedback. This feedback led to further revisions being made to
ensure consistency across the annotations.
120
Open-source datasets are scarce when it comes to requirements engineering, es-
pecially those specific to aerospace applications. Hence, the above datasets not only
enable the present research but are also expected to promote research in the aerospace
requirements engineering domain.
6.2 aeroBERT-NER
The fine-tuning of different BERT variants was carried out using datasets of different
sizes, with the goal to observe the trends in learning and to answer the question, “how
much data is enough for the LM to gain aerospace NER domain knowledge?”. The F1
scores were considered to be an accurate representation of the model’s performance
on the validation set since NER is a token classification task. The fine-tuning process
was carried out for 20 epochs for each dataset size containing 250, 500, 750, 1000,
1250, and 1423 sentences, respectively. It is important to understand that some of
the sentences in the dataset did not contain any named entity. These sentences were
included to make sure the model learns that there can be aerospace sentences that
do not contain any of the five named entities considered for this work.
121
Each of these datasets was divided into training (90%) and validation (10%) sets.
Figure 6.1 shows the trends in F1 scores for different dataset sizes when the model is
trained on the training set and tested on the validation set for 20 epochs.
122
Table 6.1: Trends observed in F1 scores with the increase in dataset size for training
and testing. 90% of the dataset was used for training and 10% was used for testing.
Figure 6.2: Original, cased, and uncased versions for the word “Georgia Tech”
123
Table 6.2: aeroBERT-NER model performance on the validation set. The highest score for each metric per NER type is
shown in bold. (P = Precision, R = Recall, F1 = F1 Score)
P R F1 P R F1 P R F1 P R F1 P R F1 F1
aeroBERT-NERLU 0.81 0.88 0.84 0.68 0.67 0.68 0.96 0.90 0.93 0.80 0.83 0.81 0.86 0.87 0.86 0.83
aeroBERT-NERLC 0.81 0.92 0.86 0.88 0.88 0.88 0.97 0.98 0.97 0.90 0.92 0.91 0.83 0.87 0.85 0.90
aeroBERT-NERBU 0.67 0.75 0.71 0.58 0.74 0.65 0.95 0.95 0.95 0.81 0.80 0.80 0.87 0.89 0.88 0.83
aeroBERT-NERBC 0.88 0.88 0.88 0.98 0.92 0.95 0.98 0.98 0.98 0.93 0.91 0.92 0.89 0.91 0.90 0.92
The training time for aeroBERT-NERBC and aeroBERT-NERLC were 268.6 and
873.9 seconds, respectively. For the rest of this section, aeroBERT-NERBC will be
referred to as aeroBERT-NER since it was deemed to be the best-performing model
out of all BERT variants that were fine-tuned for the NER task.
One goal of this research is to facilitate the creation of a glossary after using aeroBERT-
NER for identifying NEs in aerospace requirements. To test this capability, a separate
test set containing 20 requirements was created. The requirements in the test set are
provided in Table B.1 in Appendix B. These requirements are chosen from 14 CFR
§25.1301 through §25.1360. aeroBERT-NER is first used to identify different NEs.
A Python script is then used to concatenate the NE subwords (or WordPiece to-
kens) identified by aeroBERT-NER. These concatenated NEs are then used to create
a glossary in which they are grouped following the category they belong to (SYS,
ORG, VAL, DATETIME, and RES). A manual check is then performed to identify
the NEs in the test set and compare them with the NEs that were identified by
aeroBERT-NER and added to the glossary.
Out of the 32 SYS names, 20 SYS names were identified verbatim. Of the remain-
ing SYS NEs, subwords of 5 system names were identified. 62.5% of all the SYS NEs
were identified verbatim, however, the percentage goes up to 78.13% if the subwords
are included as well (Table 6.3). Table 6.4 shows some of the NEs that were identi-
fied by aeroBERT-NER and added to the glossary. The NEs in this table are newly
identified and the model was not trained on these specific NEs (systems names, etc.).
The test set did not contain any ORG NEs, which is typical of aerospace require-
ments. aeroBERT-NER did not falsely identify any ORGs, which stresses the model’s
robustness. Despite the sparse presence of ORG entities in requirements, it was in-
cluded as a category, in case aeroBERT-NER is used on aerospace texts apart from
125
Table 6.3: Percentage of NEs identified by aeroBERT-NER in the test set
Type of NE Test Set (total count) aeroBERT-NER (verbatim) aeroBERT-NER (including subwords)
SYS 32 20 (62.5%) 25 (78.13%)
RES 11 11 (100%) -
DATETIME 2 1 (50%) 2 (100%)
ORG 0 - -
VAL 0 - -
requirements. Similarly, no VAL NEs occurred in the requirements in the test set
and as a result, none were identified by the model. Out of the two DATETIME NEs
that occurred in the test set, one was identified verbatim, whereas only subwords of
the other entity were identified. Lastly, 100% of the RES NEs occurring in the test
set were identified verbatim.
The creation of a glossary helps automatically determine the systems that the text
or requirement is referring to. For example, the glossary in Table 6.4 indicates the
requirements in the test set are about flight deck controls, electrical system, autopilot
system, etc. In addition, the resources being referred to in the requirements are
also being identified, which helps in narrowing down the resources of interest in case
someone wants to dig further to get more context.
126
6.2.3 Comparison between aeroBERT-NER and BERTBASE -NER
To compare the performance of both models on aerospace text in the test dataset,
the type of NEs was dropped and the absolute number of entities that were identified
by each model was considered since both models are not intended to identify the same
types of NEs. The entities identified by each of the models were then compared with
the entities manually identified during the glossary creation step.
aeroBERT-NER was able to identify 71% (32 out of 45) of the relevant NEs
(not including the subwords), a sample of which is shown in Table 6.4. However,
BERTBASE -NER was unable to identify any NEs apart from two subwords (“Section
25”, “Section”). Both of the identified subwords are tagged as MISC. A summary of
the comparison is shown in Table 6.5. This illustrates the superior performance of
aeroBERT-NER when compared to BERTBASE -NER on aerospace text despite being
fine-tuned on a small annotated NE aerospace corpus. This showcases the potential
of transfer learning in the realm of NLP.
127
6.3 aeroBERT-Classifier
128
Table 6.6: Requirements classification results on aerospace requirements dataset. The highest score for each metric per class
is shown in bold. (P = Precision, R = Recall, F1 = F1 Score)
aeroBERT-ClassifierLC 0.86 0.92 0.89 0.82 0.90 0.86 0.83 0.63 0.71 0.82
aeroBERT-ClassifierBU 0.80 0.92 0.86 0.89 0.80 0.84 0.86 0.75 0.80 0.83
aeroBERT-ClassifierBC 0.79 0.85 0.81 0.80 0.80 0.80 0.86 0.75 0.80 0.80
GPT-2 0.67 0.60 0.63 0.67 0.67 0.67 0.70 0.78 0.74 0.68
Bi-LSTM (GloVe) 0.75 0.75 0.75 0.75 0.60 0.67 0.43 0.75 0.55 0.68
bart-large-mnli 0.43 0.25 0.32 0.38 0.53 0.44 0.0 0.0 0.0 0.34
Of all the variants of aeroBERT-Classifier evaluated, uncased variants outper-
formed the cased variants. This indicates that “casing” is not as important to text
classification as it would be in the case of a NER task. In addition, the overall av-
erage F1 scores obtained by aeroBERT-ClassifierLU and aeroBERT-ClassifierBU were
the same. This suggests that the base-uncased model is capable of learning the desired
patterns in aerospace requirements in less training time and, hence is preferred.
Various iterations of the model training and testing were performed, and the model
performance scores were consistent. In addition, the aggregate precision and recall
were not very far off from each other, giving rise to a high F1 score (harmonic mean of
precision and recall). Since the difference between the training and test performance
is low despite the small size of the dataset, it is expected that the model will generalize
well to unseen requirements belonging to the three categories.
Table 6.7 provides a list of requirements from the test set that were misclassified
(Predicted label ̸= Actual label) by aeroBERT-ClassifierBU . A confusion matrix sum-
marizing the classification task is shown in Figure 6.3. It is important to note that
some of the requirements were difficult to classify even by SMEs with expertise in
requirements engineering.
The test set contained 13 design, 10 functional, and 8 performance requirements
(Table 5.12). As seen in Table 6.7 and Figure 6.3, out of the 13 design require-
ments, only one was misclassified as a performance requirement. Of the 8 perfor-
mance requirements, 2 were misclassified. And 2 of the 10 functional requirements
were misclassified.
The training and testing were carried out multiple times, and the requirements
shown in Table 6.7 were consistently misclassified, which might have been due to am-
biguity in the labeling. Hence, it is important to have a human-in-the-loop (preferably
a Subject Matter Expert (SME)) who can make a judgment call on whether a cer-
tain requirement was labeled wrongly or to support a requirement rewrite to resolve
130
Table 6.7: List of requirements (from test set) that were misclassified by aeroBERT-
ClassifierBU (0: Design; 1: Functional; 2: Performance)
Figure 6.3: Confusion matrix showing the breakdown of the true and predicted labels
by the aeroBERT-Classifier on the test data
131
ambiguities.
132
Figure 6.4: Confusion matrix showing the breakdown of the true and predicted labels
by the bart-large-mnli model on the test data
Figure 6.4 shows the true and the predicted labels for all the requirements in the
test set by bart-large-mnli. Upon comparing Figure 6.3 to Figure 6.4, aeroBERT-
Classifier was able to correctly classify 83.87% of the requirements as compared to
43.39% when bart-large-mnli was used. The latter model seemed to be biased towards
classifying most of the requirements as functional requirements. Had bart-large-mnli
classified all the requirements as functional, the zero-shot classifier would have rightly
classified 32.26% of the requirements. This illustrates the superior performance of the
aeroBERT-Classifier despite it being trained on a small labeled dataset. Hence, while
bart-large-mnli performs well on more general tasks like sentiment analysis, classifi-
cation of news articles into genres, etc., using zero-shot classification, its performance
is degraded in tasks involving specialized and structured texts such as aerospace re-
quirements.
It is important to have an SME (Subject Matter Expert) overseeing the require-
ment classification process, who can make a decision on whether a requirement is
being mislabeled or misclassified. Additionally, if necessary, requirements can be
rewritten to eliminate any ambiguities.
133
6.4 Standardization of requirements
Table 6.8 shows a requirement table with five requirements belonging to various types
and their corresponding properties. The various columns of the table were populated
by extracting information from the original requirement text using different LMs
(aeroBERT-NER and aeroBERT-Classifier) that were fine-tuned on an aerospace-
specific corpus. This table can be exported as an Excel spreadsheet, which can then
be verified by a subject matter expert (SME) and any missing information can be
added.
The creation of a requirement table, as described above, is an important step to-
ward the standardization of requirements by aiding the creation of tables and models
in SysML. The use of various LMs automates the process and limits the manual way
of populating the table. In addition, aeroBERT-NER, and aeroBERT-Classifier gen-
eralize well and are capable of identifying named entities and classifying requirements
despite the noise and variations that can occur in NL requirements. This method-
ology for extracting information from NL requirements and storing them in tabular
format triumphs in comparison to using a dictionary-based approach which needs
constant updating as the requirements evolve.
134
Table 6.8: Requirement table containing columns extracted from NL requirements using language models. This table can be
used to aid the creation of SysML requirement table.
The requirements were first classified into various types using the aeroBERT-Classifier.
Boilerplate templates for various types of requirements were then determined by uti-
lizing sentence chunks and named entities to detect patterns. To account for the
diversity of these requirements, multiple templates were recognized for each type.
Table 6.9: Summary of boilerplate template identification task. Two, five, and three
boilerplate templates were identified for Design, Functional, and Performance require-
ments that were used for this study.
Table 6.9 shows a breakdown of the number of boilerplate templates that were
identified for each requirement type and the percentage of requirements covered by the
boilerplate templates. Two boilerplates were identified for design requirements that
were used for this study. Five and three boilerplates were identified for Functional
and Performance requirements respectively. A greater variability was observed in
the textual patterns occurring in Functional requirements which resulted in a greater
number of boilerplates for this particular type. The identified templates are discussed
in detail in the following subsections.
Design Requirements
In analyzing the design requirements as they were presented, it was discovered that
two separate boilerplate structures were responsible for roughly ∼55% of the require-
ments used in the study. These two structures were able to encompass the majority
of the requirements, and incorporating additional boilerplate templates would have
resulted in overfitting them to only a handful of requirements each. This would have
compromised their ability to be applied broadly, reducing their overall generalizabil-
136
ity.
The first boilerplate is shown in Figure 6.5 and focuses on requirements that
mandate the “way” a system should be designed and/or installed, its location, and
whether it should protect another system/sub-system from a certain 〈condition〉 or
〈state〉. The NE and sentence chunk tags are displayed above and below the boil-
erplate structure. Based on these tags, it was observed that a requirement with a
〈condition〉 in the beginning usually starts with a prepositional phrase (PP) or sub-
ordinate clause (SBAR). This is followed by a noun phrase (NP), which contains a
〈system〉 name, which can be distinctly identified within the NP by using the NE tag
(SYS). The initial NP is always succeeded by a verb phrase (VP) which contains the
term “shall [variations]”. These “[variations]” (shall be designed, shall be protected,
shall have, shall be capable of, shall maintain, etc.) were crucial for the identification
of different types of boilerplates since they provided information regarding the action
that the 〈system〉 should be performing (to protect, to maintain, etc.).
In the case of Figure 6.5, the observed “[variations]” were be designed, be designed
and installed, installed, located, and protected respectively. The VP is followed by a
subordinate clause (SBAR) or prepositional phrase (PP) but this is optional. This is
then followed by a NP or adjective clause (ADJP) and can contain either a 〈functional
attribute〉, 〈state〉, 〈design attribute〉, or a 〈sub-system/system〉. This brings an end
to the main Body of the requirement. The Suffix is optional and can contain additional
information such as operating and environmental conditions, resources, and context.
The second boilerplate for design requirements is shown in Figure 6.6. This boil-
erplate accounts for the design requirements that mandate a certain 〈functional at-
tribute〉 that a system should have, a 〈sub-system/system〉 it should include, and any
〈design attribute〉 it should have by design. Similar to the previous boilerplate, the
NEs and sentence chunk tags are displayed above and below the structure.
The rest of the design requirements were examined, however, no common patterns
137
138
Figure 6.5: The schematics of the first boilerplate for design requirements along with some examples that fit the
boilerplate are shown here. This boilerplate accounts for 74 of the 149 design requirements (∼50%) used for this study and
is tailored toward requirements that mandate the way a 〈system〉 should be designed and/or installed, its location, and
whether it should protect another 〈system/sub-system〉 given a certain 〈condition〉 or 〈state〉. Parts of the NL requirements
shown here are matched with their corresponding boilerplate elements via the use of the same color scheme. In addition,
the sentence chunks and named entity (NEs) tags are displayed below and above the boilerplate structure respectively.
139
Figure 6.6: The schematics of the second boilerplate for design requirements along with some examples that fit the
boilerplate are shown here. This boilerplate accounts for 8 of the 149 design requirements (∼5%) used for this study and
focuses on requirements that mandate a 〈functional attribute〉 , 〈design attribute〉 , or the inclusion of a 〈system/sub-
system〉 by design. Two of the example requirements highlight the 〈design attribute〉 element which emphasizes additional
details regarding the system design to facilitate a certain function. The last example shows a requirement where a 〈sub-
system〉 is to be included in a system by design.
were observed in most of them to warrant boilerplate creation specific to these re-
quirements. Boilerplates, if created, would have fewer requirements compatible with
them, which could have undermined their capacity to be applied more generally. As
a result, the overall generalizability of the templates would have been reduced.
Functional Requirements
In analyzing the NL functional requirements as they appeared in Parts 23 and
25 of Title 14 CFRs, the study identified five separate boilerplate structures that
encompassed a total of 63% of the functional requirements. However, introducing
more boilerplate templates would have led to fitting a smaller number of requirements
to these structures, potentially limiting their overall applicability and generalizability.
The first boilerplate is shown in Figure 6.7 and is tailored toward requirements
that describe the capability of a 〈system〉 to be in a certain 〈state〉 or perform a
certain 〈function〉Ṫhe example requirement (especially 1) focuses on the handling
characteristics of the system (airplane in this case). The associated sentence chunks
and NEs for each of the elements of the boilerplate are also shown.
The second boilerplate for functional requirements is shown in Figure 6.8 and
focuses on requirements that require the 〈system〉 to have a certain 〈functional at-
tribute〉 or maintain a particular 〈state〉. This boilerplate structure accounts for 15%
of all the functional requirements.
Figure 6.9 shows the third boilerplate for functional requirements and is tailored
toward requirements that require the 〈system〉 to protect another 〈sub-system/system〉
or 〈user〉 against a certain 〈state〉 or another 〈sub-system/system〉. This boilerplate
structure accounts for 7% of all the functional requirements.
Figure 6.10 shows the fourth boilerplate for functional requirements and is tai-
lored toward requirements that require the 〈system〉 to provide a certain 〈functional
140
141
Figure 6.7: The schematics of the first boilerplate for functional requirements along with some examples that fit
the boilerplate is shown here. This boilerplate accounts for 20 of the 100 functional requirements (20%) used for this
study and is tailored toward requirements that describe the capability of a 〈system〉 to be in a certain 〈state〉 or have a
certain 〈functional attribute〉. The example requirement (especially 1) focuses on the handling characteristics of the system
(airplane in this case).
142
Figure 6.8: The schematics of the second boilerplate for functional requirements along with some examples that fit the
boilerplate is shown here. This boilerplate accounts for 15 of the 100 functional requirements (15%) used for this study and
is tailored toward requirements that require the 〈system〉 to have a certain 〈functional attribute〉 or maintain a particular
〈state〉.
143
Figure 6.9: The schematics of the third boilerplate for functional requirements along with some examples that fit the
boilerplate are shown here. This boilerplate accounts for 7 of the 100 functional requirements (7%) used for this study
and is tailored toward requirements that require the 〈system〉 to protect another 〈sub-system/system〉 or 〈user〉 against a
certain 〈state〉 or another 〈sub-system/system〉.
attribute〉 given a certain 〈condition〉. This boilerplate structure accounts for 15% of
all the functional requirements.
Figure 6.11 shows the fifth boilerplate for functional requirements and is specifi-
cally focused on requirements related to the cockpit voice recorder since a total of six
requirements in the entire dataset were about this particular system and its 〈func-
tional attribute〉 given a certain 〈condition〉. This boilerplate structure accounts for
6% of all the functional requirements. Although it is generally not recommended to
have a boilerplate template that is specific to a particular system, in this case, it was
deemed acceptable because a significant portion of the requirements pertained to that
system, and the dataset used was relatively small.
Performance Requirements
Three distinct boilerplates were identified for performance requirements which ac-
counted for a total of (∼58%) of all the requirements belonging to this type.
The first boilerplate for performance requirements is shown in Figure 6.12. This
particular boilerplate has the element 〈system attribute〉, which is unique as com-
pared to the other boilerplate structures. In addition, this boilerplate caters to the
performance requirements specifying a 〈system〉 or 〈system attribute〉 to satisfy a
certain function or 〈condition〉.∼33% of all the performance requirements match this
template.
Figure 6.13 shows the second boilerplate for performance requirements. This boil-
erplate accounts for 12 of the 61 performance requirements (∼20%) used for this
study. This boilerplate focuses on performance requirements that specify a 〈func-
tional attribute〉 that a 〈system〉 should have or maintain given a certain 〈state〉 or
〈condition〉.
Lastly, Figure 6.14 shows the third boilerplate for performance requirements. This
144
145
Figure 6.10: The schematics of the fourth boilerplate for functional requirements along with some examples that fit
the boilerplate is shown here. This boilerplate accounts for 15 of the 100 functional requirements (15%) used for this study
and is tailored toward requirements that require the 〈system〉 to provide a certain 〈functional attribute〉 given a certain
〈condition〉.
146
Figure 6.11: The schematics of the fifth boilerplate for functional requirements along with some examples that fit the
boilerplate is shown here. This boilerplate accounts for 6 of the 100 design requirements (6%) used for this study and
is specifically focused on requirements related to the cockpit voice recorder since a total of six requirements in the entire
dataset were about this particular system and its 〈functional attribute〉 given a certain 〈condition〉.
147
Figure 6.12: The schematics of the first boilerplate for performance requirements along with some examples that fit
the boilerplate are shown here. This boilerplate accounts for 20 of the 61 performance requirements (∼33%) used for this
study. This particular boilerplate has the element 〈system attribute〉 which is unique as compared to the other boilerplate
structures. In addition, this boilerplate caters to the performance requirements specifying a 〈system〉 or 〈system attribute〉
to satisfy a certain 〈condition〉 or have a certain 〈functional attribute〉.
148
Figure 6.13: The schematics of the second boilerplate for performance requirements along with some examples that fit
the boilerplate are shown here. This boilerplate accounts for 12 of the 61 performance requirements (∼20%) used for this
study. This boilerplate focuses on performance requirements that specify a 〈functional attribute〉 that a 〈system〉 should
have or maintain given a certain 〈state〉 or 〈condition〉.
149
Figure 6.14: The schematics of the third boilerplate for performance requirements along with some examples that fit
the boilerplate are shown here. This boilerplate accounts for 3 of the 61 performance requirements (∼5%) used for this
study and focuses on a 〈system〉 being able to withstand and certain 〈condition〉 with or without ending up in a certain
〈state〉.
boilerplate accounts for 3 of the 61 performance requirements (∼5%) used for this
study and focuses on a 〈system〉 being able to withstand a certain 〈condition〉 with
or without ending up in a certain 〈state〉.
To summarize, the study found two boilerplate structures for design requirements,
five for functional requirements, and three for performance requirements. A larger
number of boilerplate structures were identified for functional requirements due to
their greater variability. These structures were identified based on patterns observed
in sentence chunks and named entities (NEs). The boilerplates can be utilized to cre-
ate new requirements that follow the established structure or to assess the conformity
of natural language requirements with the identified boilerplates. These activities are
valuable for standardizing requirements on a larger scale and at a faster pace, and are
expected to contribute to the adoption of Model-Based Systems Engineering (MBSE)
in a more streamlined manner. Subject matter experts (SMEs) should review the
identified boilerplates to ensure their accuracy and consistency.
150
CHAPTER 7
PRACTITIONER’S GUIDE
This chapter offers a condensed version of the methodologies (using flowcharts) de-
vised in this dissertation, to facilitate the implementation by industry practitioners.
The chapter is structured into three main sections. The first section covers the
development of language models. The second section discusses how the outputs from
these models can be used to generate a requirements table. Finally, the third section
demonstrates the creation of boilerplates through the use of three language models:
aeroBERT-NER [120], aeroBERT-Classifier [126], and flair/chunk-english [114].
Figure 7.1 illustrates the process for developing aeroBERT-NER [120] and aeroBERT-
Classifier [126], and provides references to the corresponding section numbers for more
detailed information. To begin with, text pertaining to the aerospace domain was
gathered, which was then used to construct two separate corpora: one containing
definitions and other aerospace-related texts (the NER corpus), and the other con-
taining only requirements (the requirements corpus). Both corpora were manually
examined to identify the relevant named entities and requirement types for each.
To annotate the named entity corpus, individual “.txt” files were generated for
each type of entity, allowing for easy differentiation. These files were utilized by a
Python script that could match and tag the text accordingly. The annotation followed
a BIO-tagging scheme to identify named entities. Likewise, the requirements within
the requirements corpus were categorized and labeled according to their respective
types. For instance, design requirements were labeled as ‘0’, functional requirements
were labeled as ‘1’, and performance requirements were labeled as ‘2’. This completes
151
Figure 7.1: Practitioner’s Guide to creation of aeroBERT-NER and aeroBERT-
Classifier. A zoomed-in version of this figure can be found here.
152
the data annotation phase.
The NER corpus, which had been previously annotated, underwent pre-processing
to prepare it for the fine-tuning of BERT for the identification of named entities within
the aerospace domain. The corpus was split into training and validation sets, and
the training set was tokenized with the BertTokenizer. To determine the optimal
input sequence length, the distribution of sequence lengths within the training set
was analyzed. Special tokens were added, and the sequences were post-padded. The
resulting token IDs, tag IDs, and attention masks for each token were then generated
as input to the BERT model. The same pre-processing steps were applied to the
classification corpus, with the exception that only token IDs, attention masks, and
labels for each requirement were utilized as inputs to the BERT model.
The Transformers library was used to import BertForTokenClassification.
After selecting various hyperparameters, including the batch size, number of epochs,
optimizer, and learning rate, the pre-trained parameters of BERT were fine-tuned
using the annotated NER corpus (archanatikayatray/aeroBERT-NER) to create
aeroBERT-NER. In a similar manner, BertForSequenceClassification was used
to fine-tune BERT on the annotated aerospace requirements corpus
(archanatikayatray/aeroBERT-classification), to obtain aeroBERT-Classifier.
The models were assessed using evaluation metrics such as Precision, Recall, and F1
score.
aeroBERT-NER was trained in this study to recognize five named entity types
(SYS, VAL, ORG, DATETIME, and RES), while aeroBERT-Classifier was trained
to categorize requirements into three types (design, functional, and performance).
Nonetheless, with adequate labeled training data, both models can be trained to
identify additional types of named entities and requirements beyond those mentioned.
Findings:
1. The NER task yielded better results with the cased variants of aeroBERT-NER
153
compared to the uncased variants. This indicates that maintaining the case of
the text is crucial for NER.
Figure 7.2 outlines the steps taken to create the requirements table, with references
to relevant sections of the dissertation provided alongside. The process involves in-
putting requirements into both aeroBERT-Classifier and aeroBERT-NER, as shown
in the flowchart for a single example, although multiple requirements can be processed
simultaneously using both models.
For this thesis, a requirements table was created that has five columns: Name,
Requirement Text, Type of Requirement, Property, and Related To. Further details
on these columns can be found in Table 5.16. The initial entry in the Name column
corresponds to the first system name (SYS named entity) detected by aeroBERT-
NER [120]. The Type of Requirement column is filled in once aeroBERT-Classifier
[126] classifies the requirement. The Property column is populated by a Python
dictionary format that includes all the named entities (excluding SYS) identified by
aeroBERT-NER, with the named entity type (RES, VAL, DATETIME, ORG) set as
the dictionary key and the identified named entities presented as values in a Python
list format. Lastly, the system name identified by aeroBERT-NER [120] is used to
populate the Related to column.
The requirements table can be expanded with additional columns as necessary,
although doing so may entail developing further language models capable of extract-
ing the desired data. Moreover, aeroBERT-NER and aeroBERT-Classifier can be
enhanced with additional named entities or requirement types to extract other perti-
nent information.
154
155
Figure 7.2: Practitioner’s Guide to the creation of requirements table. A zoomed-in version of this figure can be found
here.
7.3 Identification of Requirements boilerplates using language models
The flowchart in Figure 7.3 demonstrates the process of identifying boilerplate tem-
plates from well-written requirements. While the example shown in the flowchart
illustrates the steps involved in using a single requirement, it is crucial to note that
in order to generate boilerplate structures that can be applied more broadly, patterns
observed across multiple requirements need to be identified. Moreover, for those
who seek more information, the relevant section numbers from this dissertation are
provided along with the flowchart.
The process of identifying boilerplate templates begins with the classification of re-
quirements into different types utilizing aeroBERT-Classifier [126]. Next, text chunks,
including Noun Phrases, Verb Phrases, etc., are identified using flair/chunk-english,
and aerospace named entities are detected using aeroBERT-NER [120]. It is impor-
tant to note that aeroBERT-NER is only capable of recognizing five types of named
entities for which it was fine-tuned. Additionally, identifying elements in the NL
requirements, such as 〈condition〉, 〈system〉, etc., is done manually because these
elements may differ from one organization to another, or even within an organiza-
tion. Once the elements have been identified, they are matched with the previously
identified text chunks and named entities to partially automate the boilerplate identi-
fication process. This helps with the identification of boilerplate templates and with
checking the conformance of requirements to the identified templates. The aggre-
gation of various patterns in text chunks, named entities, and element sequences or
their presence or absence is utilized to identify boilerplate templates.
Findings:
1. The determination of the threshold for the number of requirements that need
to follow a specific pattern for it be considered a boilerplate template is at the
156
Figure 7.3: Practitioner’s Guide to creation of aeroBERT-NER and aeroBERT-
Classifier. A zoomed-in version of the figure can be found here.
2. Boilerplate structures may differ from one requirement type to another, and
furthermore, there may be multiple boilerplate templates for each requirement
157
type.
158
CHAPTER 8
CONCLUDING REMARKS
8.1 Conclusions
The field of Natural Language Processing (NLP) has seen limited application in the
aerospace industry, and its potential use in aerospace requirements engineering re-
mains largely unexplored. Despite the crucial role of NL in various requirements
engineering tasks throughout the system lifecycle, the aerospace industry has yet to
fully exploit the potential of NLP in this area. This research aims to fill the gap in
the literature by exploring the application of NLP techniques, including the use of
LLMs, to aerospace requirements engineering.
Large Language Models (LLMs) have made it easier to use them in different do-
mains than their original training through transfer learning. They are trained on
extensive text corpora, and as a result, they possess a comprehensive understanding
of language rules, which allows them to be fine-tuned on smaller labeled datasets
for various downstream tasks, including specialized domains like medicine, law, en-
gineering, etc. This is especially beneficial in domains with limited resources, such
as aerospace requirements engineering, where acquiring large labeled datasets can be
difficult due to the proprietary nature of the requirements and the subject matter
expertise needed to create and annotate them. Therefore, the primary objective of
this thesis was to develop and apply tools, techniques, and methodologies centered
around LLMs that simplify the conversion of Natural Language (NL) requirements
into semi-machine-readable requirements (Figure 1.7). The adoption of this approach
is anticipated to promote the widespread utilization of LLMs for handling require-
ments on a larger scale and at a faster pace.
159
The certification requirements in Parts 23 and 25 of Title 14 CFR were the source
of the requirements used in this study, as system requirements are generally pro-
prietary. To familiarize readers with the requirements used in this study, numerous
examples were included throughout this dissertation. Additionally, the annotated cor-
pora employed for both the NER and classification tasks have been made open-source
(refer to Appendix A), with the aim of supporting future research in this field.
The developed corpora were utilized to fine-tune the BERT LM for two distinct
downstream tasks - identifying named entities specific to the aerospace domain, and
classifying aerospace requirements into different types. Their performance was com-
pared to that of other models, including some fine-tuned on the same corpora and
off-the-shelf models. For NER and requirements classification on aerospace text, both
models performed better than off-the-shelf models, despite being trained on small la-
beled datasets.
aeroBERT-NER and aeroBERT-Classifier have the ability to extract information
from aerospace text and requirements and store them in a more accessible format,
such as a Python list or dictionary. This extracted information was then utilized to
showcase the methodology for creating a requirements table, which is a tabular format
for storing requirements and their associated properties. This table can assist in the
creation of a SysML requirements table, potentially reducing the time, resources, and
manual labor involved in creating such a table from scratch.
Boilerplate templates for various types of requirements were identified using fine-
tuned models for classification and named entity recognition, as well as an off-the-shelf
sentence chunking model (flair/chunk-english). To account for variations within
each type of requirement, multiple boilerplate templates were obtained. The use of
these templates, particularly by inexperienced engineers working with requirements,
will ensure that requirements are written in a standardized form from the beginning.
In doing so, this dissertation democratizes a methodology for the identification of
160
boilerplates given a set of requirements, which can vary from one company or industry
to another.
This dissertation presents a comprehensive methodology that outlines the collec-
tion of text, annotation, training, and validation of language models for aerospace
text. Although the models developed may not be directly transferable to the propri-
etary system requirements used by aerospace companies, the methodology presented
herein is still relevant and can be easily reproduced.
Finally, the benefits offered by using NLP for standardizing aerospace require-
ments contribute to speeding up the design and development process and reducing
the workload on engineers. In addition, standardized requirements if/when made
semi-machine readable support a model-centric approach to engineering.
The methodology used for data collection, cleaning, and annotation is described
in detail to facilitate the reproducibility of corpus creation. The corpora gen-
erated in this study can be utilized for fine-tuning other language models for
161
downstream tasks, such as named entity recognition (NER) and requirements
classification in the aerospace industry.
162
(e.g., SysML) requirement objects by extracting relevant phrases (system
names, resources, quantities, etc.) from free-form NL requirements in an
automated way.
Limitations of this work and avenues for future research directions are discussed
below.
8.3.1 aeroBERT-NER
This study focused on identifying five specific types of named entities (SYS, RES,
VAL, DATETIME, and ORG) and demonstrated the effectiveness of the proposed
methodology. The approach and resulting model, aeroBERT-NER, exhibit general-
izability. To build on this work, it would be valuable to train or fine-tune language
models that can identify additional types of named entities, which would aid in stan-
dardizing requirements further. For instance, named entities related to a system’s
functional attribute (FUNC) or the performance conditions (COND) under which a
system must operate could be of particular interest.
Figure 8.1: Example showing overlapping named entities (VAL and COND) which is
an avenue for furthering this work.
This work only considered non-overlapping named entities. However, it was noted
that there may be instances where two named entities overlap, particularly if entities
163
like FUNC and COND are included. As shown in the example in Figure 8.1, the
named entities COND and VAL overlap. Thus, another area for future research is to
develop a method for identifying overlapping aerospace named entities that will be
helpful for the standardization of requirements or for converting information present
in NL requirements into data objects. However, another way to achieve this same
functionality might be by employing a text chunking along with a NER model, as
demonstrated by this work.
This dissertation focused on fine-tuning a pre-trained BERT language model
for NER tailored to the aerospace domain. Hence, training an LM on aerospace
text from scratch, fine-tuning it for the NER task (using the annotated dataset
archanatikayatray/aeroBERT-NER), and comparing its performance to that of aeroBERT-
NER for NER would be interesting to explore.
8.3.2 aeroBERT-Classifier
164
requirement type.
An interesting area for further investigation would be to compare the performance
of aeroBERT-Classifier with a LM that is trained from scratch on aerospace require-
ments and then fine-tuned for the requirements classification task using the annotated
dataset archanatikayatray/aeroBERT-classification.
165
Appendices
APPENDIX A
DATASETS
from d a t a s e t s import l o a d d a t a s e t
import pandas a s pd
d a t a s e t = l o a d d a t a s e t ( ” a r c h a n a t i k a y a t r a y /aeroBERT−NER” )
#C o n v e r t i n g t h e d a t a s e t i n t o a pandas DataFrame
d a t a s e t = pd . DataFrame ( d a t a s e t [ ” t r a i n ” ] [ ” t e x t ” ] )
d a t a s e t = d a t a s e t [ 0 ] . s t r . s p l i t ( ’ ∗ ’ , expand = True )
#G e t t i n g t h e h e a d e r s from t h e f i r s t row
header = dataset . i l o c [ 0 ]
#E x c l u d i n g t h e f i r s t row s i n c e i t c o n t a i n s t h e h e a d e r s
dataset = dataset [ 1 : ]
#A s s i g n i n g t h e h e a d e r t o t h e DataFrame
d a t a s e t . columns = h e a d e r
#Viewing t h e l a s t 10 rows o f t h e a n n o t a t e d d a t a s e t
dataset . t a i l (10)
167
A.0.2 Accessing the open-source aerospace requirements classification dataset
from d a t a s e t s import l o a d d a t a s e t
import pandas a s pd
d a t a s e t = l o a d d a t a s e t ( ” a r c h a n a t i k a y a t r a y /aeroBERT− c l a s s i f i c a t i o n ” )
#C o n v e r t i n g t h e d a t a s e t i n t o a pandas DataFrame
d a t a s e t = pd . DataFrame ( d a t a s e t [ ” t r a i n ” ] [ ” t e x t ” ] )
d a t a s e t = d a t a s e t [ 0 ] . s t r . s p l i t ( ’ ∗ ’ , expand = True )
#G e t t i n g t h e h e a d e r s from t h e f i r s t row
header = dataset . i l o c [ 0 ]
#E x c l u d i n g t h e f i r s t row s i n c e i t c o n t a i n s t h e h e a d e r s
dataset = dataset [ 1 : ]
#A s s i g n i n g t h e h e a d e r t o t h e DataFrame
d a t a s e t . columns = h e a d e r
#Viewing t h e l a s t 10 rows o f t h e a n n o t a t e d d a t a s e t
dataset . t a i l (10)
168
APPENDIX B
TEST SET FOR IDENTIFICATION OF NAMED ENTITIES
169
Serial No. Aerospace requirements
10 Before December 1, 2012, an electrical or electronic system that per-
forms a function whose failure would prevent the continued safe flight
and landing of an airplane may be designed and installed without meet-
ing the provisions of paragraph (a) provided the system has previously
been shown to comply with special conditions for HIRF, prescribed
under Section 21-16, issued before December 1, 2007.
11 Each flight, navigation, and powerplant instrument for use by any pilot
must be plainly visible to him from his station with the minimum
practicable deviation from his normal position and line of vision when
he is looking forward along the flight path.
12 The flight instruments required by Section 25-1303 must be grouped
on the instrument panel and centered as nearly as practicable about
the vertical plane of the pilot’s forward vision.
13 Each airspeed indicating instrument must be approved and must be
calibrated to indicate true airspeed (at sea level with a standard atmo-
sphere) with a minimum practicable instrument calibration error when
the corresponding pitot and static pressures are applied.
14 Each pressure altimeter must be approved and must be calibrated to
indicate pressure altitude in a standard atmosphere, with a minimum
practicable calibration error when the corresponding static pressures
are applied.
15 If a flight instrument pitot heating system is installed, an indication
system must be provided to indicate to the flight crew when that pitot
heating system is not operating.
16 Each magnetic direction indicator must be installed so that its accuracy
is not excessively affected by the airplane’s vibration or magnetic fields.
17 The effects of a failure of the system to disengage the autopilot or
autothrust functions when manually commanded by the pilot must be
assessed in accordance with the requirements of Section 25-1309.
18 Under normal conditions, the disengagement of any automatic control
function of a flight guidance system may not cause a transient response
of the airplane’s flight path any greater than a minor transient.
19 Each powerplant and auxiliary power unit instrument line must meet
the requirements of Sections 25-993 and 25-1183.
20 Each powerplant and auxiliary power unit instrument that utilizes
flammable fluids must be installed and located so that the escape of
fluid would not create a hazard.
170
APPENDIX C
DEFINITIONS AND NLP CONCEPTS
• Corpus: Collection of texts, for example, all the text available on Wikipedia
on which language models are trained
• Uncased vs Cased text: For BERT LM, the text has to be all cased/uncased
before being tokenized (Figure C.4)
171
Figure C.3: Word sense disambiguation - A computer mouse vs. a mouse (rodent)
Figure C.4: Cased and uncased text for BERT language model
• Tokenization: Separating text into smaller units (tokens) such as words, char-
acters, sub-words (Figure C.6)
172
Figure C.7: WordPiece Tokenizer
• Stop words: Words that contribute very little to the overall meaning of a
sentence and hence can be removed depending on the use case (Example: the,
a, an, etc.) (Figure C.8)
173
Figure C.10: Word Embeddings Word2Vec
Word2Vec will encode both the “mouse” (Figure C.3) as the same vector irre-
spective of the context, which is not helpful. Hence, we will look into BERT
embeddings that do not have this pitfall.
174
data (Wikipedia, BookCorpus, etc.) and have the ability to generate coherent
and contextually appropriate language responses, making them ideal for a wide
range of NLP tasks such as language translation, question-answering, text sum-
marization, and sentiment analysis. Examples of LLMs include GPT-3, BERT,
and XLNet.
Figure C.13: Transformer Architecture showing the encoder and decoder blocks [37]
• Encoder Block: The encoder block is responsible for processing the input
sequence and extracting its features. It consists of a stack of identical layers,
175
with each layer containing two sub-layers: a self-attention mechanism and a
feedforward neural network. The self-attention mechanism allows the encoder
to attend to different parts of the input sequence, while the feedforward network
processes the attended information [37] (Figure C.13). BERTBASE contains 12
encoder blocks and BERTLARGE contains 24 encoder layers [38].
• Decoder Block: The decoder block is responsible for generating the output
sequence based on the extracted features from the encoder. It also consists of
a stack of identical layers, each with two sub-layers: a masked self-attention
mechanism that allows the decoder to attend to its own previously generated
outputs and a cross-attention mechanism that allows it to attend to the en-
coder’s output [37] (Figure C.13).
176
Figure C.14: Query, Key, and Value matrices [37]
In order to map the tokenized input (X) onto query space, key space, and value
space, three weight matrices - W Q , W K , and W V - were multiplied with the
input matrix, resulting in Q, K, and V, respectively.
In this example, the Value matrix (V) has a dimension of 3 (dk = 3), and its
square root will be utilized to normalize the attention score via Equation 3.1.
Lastly, the attention scores obtained are multiplied with the Value matrix (V)
177
Figure C.15: Example showing Query (Q) and Key (K) vectors and attention score
calculation for the word “like” in the sentence “I like dogs”. Key and Query live in
different vector spaces. These figures are for demonstration purposes only.
to generate the “context-ful” embedding for the word “like”. This process is
repeated for each word/token present in the sequence.
Figure C.16: Example showing Query (Q) and Key (K) vectors and attention score
calculation for the word “like” in the sentence “I like dogs”. The equations for the
calculation of attention are shown. Key and Query live in different vector spaces.
These figures are for demonstration purposes only.
178
performance of self-attention by allowing the model to attend to different parts
of the input sequence simultaneously. In multi-headed attention, the input
sequence is transformed into multiple representations (or “heads”) using differ-
ent sets of learned Query (Q), Key (K), and Value (V) matrices. Each head
operates independently, allowing the model to capture different types of rela-
tionships between the input sequence elements. The output of each head is then
concatenated and transformed into the final output. This approach enhances
the model’s capacity to attend to different aspects of the input sequence and
extract more complex relationships between them (Figure C.17). Multi-headed
attention has been shown to be effective in various natural language process-
ing tasks, such as machine translation, text classification, and named entity
recognition.
Figure C.17: Matrix multiplication intuition for the calculation of multi-headed at-
tention [104]
179
REFERENCES
[5] INCOSE, “Incose infrastructure working group charter,” pp. 3–5, (accessed:
01.10.2023).
[6] NASA, “Appendix c: How to write a good requirement,” no. 5, pp. 115–119,
(accessed: 01.05.2022).
[7] NASA, “2.1 the common technical processes and the se engine,” J. Object
Technol., vol. 4, no. 1, (accessed: 01.10.2023).
[9] B. Regnell, R. B. Svensson, and K. Wnuk, “Can we beat the complexity of very
large-scale requirements engineering?” In International Working Conference on
Requirements Engineering: Foundation for Software Quality, Springer, 2008,
pp. 123–128.
180
[10] Google, “Natural language — google arts culture,” Google, (accessed: 01.10.2022).
[13] T. E. Bell and T. A. Thayer, “Software requirements: Are they really a prob-
lem?” In Proceedings of the 2nd international conference on Software engineer-
ing, 1976, pp. 61–68.
181
[19] H. Yang, A. De Roeck, V. Gervasi, A. Willis, and B. Nuseibeh, “Speculative
requirements: Automatic detection of uncertainty in natural language require-
ments,” in 2012 20th IEEE International Requirements Engineering Confer-
ence (RE), IEEE, 2012, pp. 11–20.
[24] J. Oberg, “Why the mars probe went off course [accident investigation],” IEEE
Spectrum, vol. 36, no. 12, pp. 34–39, 1999.
182
[29] L. Lemazurier, V. Chapurlat, and A. Grossetête, “An mbse approach to pass
from requirements to functional architecture,” IFAC-PapersOnLine, vol. 50,
no. 1, pp. 7260–7265, 2017.
[31] L. Wheatcraft, M. Ryan, J. Llorens, and J. Dick, “The need for an information-
based approach for requirement development and management,” in INCOSE
International Symposium, Wiley Online Library, vol. 29, 2019, pp. 1140–1157.
[37] A. Vaswani, N. Shazeer, N. Parmar, et al., “Attention is all you need,” Ad-
vances in neural information processing systems, vol. 30, 2017.
[38] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, Bert: Pre-training of deep
bidirectional transformers for language understanding, 2019. arXiv: 1810.04805
[cs.CL].
183
[39] M. Lewis, Y. Liu, N. Goyal, et al., “BART: denoising sequence-to-sequence
pre-training for natural language generation, translation, and comprehension,”
CoRR, vol. abs/1910.13461, 2019. arXiv: 1910.13461.
[40] L. Zhao, W. Alhoshan, A. Ferrari, et al., “Natural language processing for re-
quirements engineering: A systematic mapping study,” ACM Computing Sur-
veys (CSUR), vol. 54, no. 3, pp. 1–41, 2021.
[41] Z. Liu, B. Li, J. Wang, and R. Yang, “Requirements engineering for crossover
services: Issues, challenges and research directions,” IET Software, vol. 15,
no. 1, pp. 107–125, 2021. eprint: https://ietresearch.onlinelibrary.wiley.com/
doi/pdf/10.1049/sfw2.12014.
[44] R. Sonbol, G. Rebdawi, and N. Ghneim, “The use of nlp-based text repre-
sentation techniques to support requirement engineering tasks: A systematic
mapping review,” IEEE Access, vol. PP, pp. 1–1, 2022.
184
[47] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever, et al., “Lan-
guage models are unsupervised multitask learners,” OpenAI blog, vol. 1, no. 8,
p. 9, 2019.
[53] Y. Goldberg, “Neural network methods for natural language processing,” Syn-
thesis lectures on human language technologies, vol. 10, no. 1, pp. 1–309, 2017.
[54] D. Jurafsky and J. H. Martin, “Speech and language processing (draft),” 2021.
185
[57] T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient estimation of word
representations in vector space,” arXiv preprint arXiv:1301.3781, 2013.
[59] K. Cho, B. van Merrienboer, Ç. Gülçehre, et al., “Learning phrase repre-
sentations using rnn encoder–decoder for statistical machine translation,” in
EMNLP, 2014.
[63] T. Brown, B. Mann, N. Ryder, et al., “Language models are few-shot learn-
ers,” in Advances in Neural Information Processing Systems, H. Larochelle, M.
Ranzato, R. Hadsell, M. Balcan, and H. Lin, Eds., vol. 33, Curran Associates,
Inc., 2020, pp. 1877–1901.
186
[65] C. Sun, X. Qiu, Y. Xu, and X. Huang, “How to fine-tune bert for text classi-
fication?” In China national conference on Chinese computational linguistics,
Springer, 2019, pp. 194–206.
[66] J. Alammar. “The illustrated bert, elmo, and co. (how nlp cracked transfer
learning).” (2018). (accessed: 02.21.2022).
187
[74] J. D. Palmer, Y. Liang, and L. Want, “Classification as an approach to re-
quirements analysis,” Advances in Classification Research Online, vol. 1, no. 1,
pp. 131–138, 1990.
[77] J. Cleland-Huang, R. Settimi, X. Zou, and P. Solc, “The detection and clas-
sification of non-functional requirements with application to early aspects,”
in 14th IEEE International Requirements Engineering Conference (RE’06),
IEEE, 2006, pp. 39–48.
188
[82] M. Binkhonain and L. Zhao, “A review of machine learning algorithms for
identification and classification of non-functional requirements,” Expert Sys-
tems with Applications: X, vol. 1, p. 100 001, 2019.
[89] A. Rajan and T. Wahl, CESAR: Cost-efficient methods and processes for
safety-relevant embedded systems. Springer, 2013.
189
[90] A. Ruiz, M. Sabetzadeh, P. Panaroni, et al., “Challenges for an open and evo-
lutionary approach to safety assurance and certification of safety-critical sys-
tems,” in 2011 First International Workshop on Software Certification, IEEE,
2011, pp. 1–6.
[96] T. Rose, N. Haddock, and R. Tucker, “The effects of corpus size and homo-
geneity on language model quality,” 1997.
[97] I. Beltagy, K. Lo, and A. Cohan, “Scibert: A pretrained language model for
scientific text,” arXiv preprint arXiv:1903.10676, 2019.
190
[99] D. Araci, “Finbert: Financial sentiment analysis with pre-trained language
models,” arXiv preprint arXiv:1908.10063, 2019.
[102] J.-S. Lee and J. Hsiang, “Patentbert: Patent classification with fine-tuning a
pre-trained bert model,” arXiv preprint arXiv:1906.02124, 2019.
[103] O. Sharir, B. Peleg, and Y. Shoham, “The cost of training nlp models: A
concise overview,” arXiv preprint arXiv:2004.08900, 2020.
191
[109] M. Warnier and A. Condamines, “Improving requirement boilerplates using
sequential pattern mining,” in Europhras 2017, 2017.
[114] A. Akbik, D. Blythe, and R. Vollgraf, “Contextual string embeddings for se-
quence labeling,” in COLING 2018, 27th International Conference on Com-
putational Linguistics, 2018, pp. 1638–1649.
[115] D. Jurafsky. “Speech and language processing – chapter 8 slides.” (). (accessed:
02.21.2022).
[118] Y. Wautelet, S. Heng, M. Kolp, and I. Mirbel, “Unifying and extending user
story models,” M. Jarke, J. Mylopoulos, C. Quix, et al., Eds., pp. 211–225,
2014.
192
[119] FAA, Overview — title 14 of the code of federal regulations (14 cfr), 2013,
(accessed: 02.21.2022).
[122] L. S. Wheatcraft. “Everything you wanted to know about interfaces, but were
afraid to ask.” (2017). (accessed: 01.10.2023).
193
[131] X. Wang, Y. Jiang, N. Bach, et al., “Automated concatenation of embeddings
for structured prediction,” in Proceedings of the 59th Annual Meeting of the
Association for Computational Linguistics and the 11th International Joint
Conference on Natural Language Processing (Volume 1: Long Papers), Online:
Association for Computational Linguistics, Aug. 2021, pp. 2643–2660.
[132] J. Pennington, R. Socher, and C. Manning, “GloVe: Global vectors for word
representation,” in Proceedings of the 2014 Conference on Empirical Meth-
ods in Natural Language Processing (EMNLP), Doha, Qatar: Association for
Computational Linguistics, Oct. 2014, pp. 1532–1543.
[135] C. McCormick and N. Ryan. “Bert word embeddings tutorial.” (2019). (ac-
cessed: 02.21.2022).
194