Python For Natural Language Processing: Pierre M. Nugues
Python For Natural Language Processing: Pierre M. Nugues
Pierre M. Nugues
Python
for Natural
Language
Processing
Programming with NumPy, scikit-learn,
Keras, and PyTorch
Third Edition
Cognitive Technologies
Editor-in-Chief
Daniel Sonntag, German Research Center for AI, DFKI, Saarbrücken, Saarland,
Germany
Titles in this series now included in the Thomson Reuters Book Citation Index and
Scopus!
The Cognitive Technologies (CT) series is committed to the timely publishing
of high-quality manuscripts that promote the development of cognitive technolo-
gies and systems on the basis of artificial intelligence, image processing and
understanding, natural language processing, machine learning and human-computer
interaction.
It brings together the latest developments in all areas of this multidisciplinary
topic, ranging from theories and algorithms to various important applications. The
intended readership includes research students and researchers in computer science,
computer engineering, cognitive science, electrical engineering, data science and re-
lated fields seeking a convenient way to track the latest findings on the foundations,
methodologies and key applications of cognitive technologies.
The series provides a publishing and communication platform for all cognitive
technologies topics, including but not limited to these most recent examples:
• Interactive machine learning, interactive deep learning, machine teaching
• Explainability (XAI), transparency, robustness of AI and trustworthy AI
• Knowledge representation, automated reasoning, multiagent systems
• Common sense modelling, context-based interpretation, hybrid cognitive tech-
nologies
• Human-centered design, socio-technical systems, human-robot interaction, cog-
nitive robotics
• Learning with small datasets, never-ending learning, metacognition and intro-
spection
• Intelligent decision support systems, prediction systems and warning systems
• Special transfer topics such as CT for computational sustainability, CT in
business applications and CT in mobile robotic systems
The series includes monographs, introductory and advanced textbooks, state-
of-the-art collections, and handbooks. In addition, it supports publishing in Open
Access mode.
Pierre M. Nugues
Third Edition
Pierre M. Nugues
Department of Computer Science
Lund University
Lund, Sweden
© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland
AG 2006, 2014, 2024
This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether
the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse
of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and
transmission or information storage and retrieval, electronic adaptation, computer software, or by similar
or dissimilar methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication
does not imply, even in the absence of a specific statement, that such names are exempt from the relevant
protective laws and regulations and therefore free for general use.
The publisher, the authors, and the editors are safe to assume that the advice and information in this book
are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or
the editors give a warranty, expressed or implied, with respect to the material contained herein or for any
errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional
claims in published maps and institutional affiliations.
This Springer imprint is published by the registered company Springer Nature Switzerland AG
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
Many things have changed since the last edition of this book. In all the areas of
natural language processing, progress has been astonishing. Recent achievements
in text generation even spurred a media interest that went beyond the traditional
academic circles. Text processing has become more of a mainstream industrial tool,
and countless companies now use it to some extent. A revision of this book was then
necessary to adapt it to this recent evolution.
As in the two first editions, the intention is to expose the reader to the theories
used in natural language processing but also to programming examples that are
essential to a good understanding of the concepts. Although present in the previous
editions, machine learning is even more pregnant and has replaced many of
the earlier techniques to process text. Machine learning relies on mathematical
principles that we cannot presuppose from a reader. New ways to combine the
processing modules also appeared, and Python emerged as the dominant pro-
gramming language in the field. This made a complete rewrite of the book even
more indispensable to replace the programming language, present the mathematical
background of machine learning, describe the new architectures, and update all the
programming parts accordingly.
Along the chapters, the reader will get familiar with new topics, mathematical
models, as well as programs to experiment with. Many new techniques build on
the availability of text. Using Python notebooks, the reader will load small corpora,
format the text, apply the models through the execution of pieces of code, discover
gradually the theoretical parts, possibly modify the code or the parameters, and
traverse theories and concrete problems through a constant interaction between the
user and the machine. We tried to keep all the data sizes and hardware requirements
reasonable so that a user can see instantly, or at least quickly, the results of most
experiments on most machines.
All these programs are available in the form of Python notebooks from the
GitHub repository: https://github.com/pnugues/pnlp. We hope the reader will enjoy
them and that they will spark ideas to modify and improve the code. Notebooks are
a wonderful tool that makes it so easy and engaging to test hypotheses. Ideally, they
will inspire the reader to come up with new algorithms and, hopefully, results better
vii
viii Preface to the Third Edition
than ours. While this book is intended to be a textbook, contrary to the two previous
editions, we did not include exercises at the end of chapters as we felt the notebooks
were already exercises in themselves.
No preface goes without acknowledgments, and I would like to thank the PhD
candidates I supervised between the two editions of this book, for all the stimulating
discussions and work we did together; in chronological order: Peter Exner, Dennis
Medved, Håkan Jonsson, and Marcus Klang. I would also like to thank Maj
Stenmark and John Rausér Porsback for pointing out typos or mistakes in the last
drafts, as well as Marcus again for advice on some programs.
A last word on me: August 15, 2016, was an unfortunate day in my life, when
employees from a renovation company demolished the window of my office without
proper warning, signaling, or securing the workplace. This left me with mutilated
inner ears and a debilitating tinnitus. It considerably delayed this edition. Only the
help of my wife, Charlotte, and the attention of my children, Andreas and Louise,
helped me overcome this maddening condition. I hope one day research will find a
way to cure it (Wang et al., 2019).
Eight years, from 2006 to 2014, is a very long time in computer science. The trends
I described in the preface of the first edition have not only been confirmed, but
accelerated. I tried to reflect this with a complete revision of the techniques exposed
in this book: I redesigned or updated all the chapters, I introduced two new ones,
and, most notably, I considerably expanded the sections using machine-learning
techniques. To make place for them, I removed a few algorithms of lesser interest.
This enabled me to keep the size of the book to ca. 700 pages. The programs and
companion slides are available from the book web site at http://ilppp.cs.lth.se/.
This book corresponds to a course in natural language processing offered at Lund
University. I am grateful to all the students who took it and helped me write this new
edition through their comments and questions. Curious readers can visit the course
site at http://cs.lth.se/EDAN20/ and see how we use this book in a teaching context.
I would like to thank the many readers of the first edition who gave me feedback
or reported errors, the anonymous copy editor of the first and second editions,
Richard Johansson and Michael Covington for their suggestions, as well as Peter
Exner, the PhD candidate I supervised during this period, for his enthusiasm. Special
thanks go to Ronan Nugent, my editor at Springer, for his thorough review and
copyediting along with his advice on style and content.
This preface would not be complete without a word to those who passed away,
my aunt, Madeleine, and my father, Pierre. There is never a day I do not think of
you.
ix
Preface to the First Edition
In the past 20 years, natural language processing and computational linguistics have
considerably matured. The move has mainly been driven by the massive increase of
textual and spoken data and the need to process them automatically. This dramatic
growth of available data spurred the design of new concepts and methods, or their
improvement, so that they could scale up from a few laboratory prototypes to proven
applications used by billions of people. Concurrently, the speed and capacity of
machines became an order of magnitude larger, enabling us to process gigabytes of
data and billions of words in a reasonable time, to train, test, retrain, and retest
algorithms like never before. Although systems entirely dedicated to language
processing remain scarce, there are now scores of applications that, to some extent,
embed language processing techniques.
The industry trend, as well as the user’s wishes, toward information systems able
to process textual data has made language processing a new requirement for many
computer science students. This has shifted the focus of textbooks from readers
being mostly researchers or graduate students to a larger public, from readings by
specialists to pragmatism and applied programming. Natural language processing
techniques are not completely stable, however. They consist of a mix that ranges
from well-mastered and routine to rapidly changing. This makes the existence of a
new book an opportunity as well as a challenge.
This book tries to take on this challenge and find the right balance. It adopts
a hands-on approach. It is a basic observation that many students have difficulties
going from an algorithm exposed using pseudocode to a runnable program. I did
my best to bridge the gap and provide the students with programs and ready-made
solutions. The book contains real code the reader can study, run, modify, and run
again. I chose to write examples in two languages to make the algorithms easy to
understand and encode: Perl and Prolog.
One of the major driving forces behind the recent improvements in natural
language processing is the increase of text resources and annotated data. The huge
amount of texts made available by the Internet and never-ending digitization led
many practitioners to evolve from theory-oriented, armchair linguists to frantic
empiricists. This book attempts as well as it can to pay attention to this trend and
xi
xii Preface to the First Edition
stresses the importance of corpora, annotation, and annotated corpora. It also tries
to go beyond English only and expose examples in two other languages, namely
French and German.
The book was designed and written for a quarter or semester course. At Lund,
I used it when it was still in the form of lecture notes in the EDA171 course. It
comes with a companion web site where slides, programs, corrections, an additional
chapter, and Internet pointers are available: http://www.cs.lth.se/~pierre/ilppp/. All
the computer programs should run with Perl (available from www.perl.com) or
Prolog. Although I only tested the programs with SWI Prolog available from www.
swi-prolog.org, any Prolog compatible with the ISO reference should apply.
Many people helped me during the last 10 years when this book took shape,
step-by-step. I am deeply indebted to my colleagues and to my students in classes
at Caen, Nottingham, Stafford, Constance, and now in Lund. Without them, it could
never have existed. I would like most specifically to thank the PhD students I
supervised, in chronological order, Pierre-Olivier El Guedj, Christophe Godéreaux,
Dominique Dutoit, and Richard Johansson.
Finally, my acknowledgments would not be complete without the names of the
people I most cherish and who give meaning to my life: my wife, Charlotte, and my
children, Andreas and Louise.
xiii
xiv Contents
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 501
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 513
Chapter 1
An Overview of Language Processing
Γνῶθι σεαυτόν
‘Know thyself’
Inscription at the entrance to Apollo’s Temple at Delphi
Spelling and grammar checkers. These programs are now ubiquitous in text
processors, and hundred of millions of people use them every day. Spelling
checkers have been based primarily on computerized dictionaries. They can
now remove most misspellings that occur in documents. Grammar checkers,
although not perfect, have improved to a point that many users could not write a
single e-mail without them. Grammar checkers use logical rules or mathematical
language models to detect common grammar and style errors.
Text indexing and information retrieval from the internet. These programs
are among the most popular of the Web. They are based on crawlers that visit
internet sites and that download texts they contain. Crawlers track the links
occurring on the pages and thus explore the Web. Many of these systems carry
out a full text indexing of the pages. Users ask questions and text retrieval
systems return the internet addresses of documents containing words of the
question or related concepts. Using statistics on words, popularity measures, and
user interactions, text retrieval systems rank the documents and present their
results instantly to the user.
Speech transcription. These systems are based on speech recognition. Instead
of typing using a keyboard, speech dictation systems allow a user to dictate
reports and transcribe them automatically into a written text. Systems like
Microsoft’s Speech Recognition or Google’s Voice Search have high performance
and recognize many languages. Some systems transcribe radio, TV broadcast
news, and even songs providing thus automatic subtitles.
Voice control of domestic devices. These systems are embedded in objects to
provide them with a friendlier interface and when your hands are busy like when
driving cars. Many people find electronic devices complicated and are unable to
use them satisfactorily. A spoken interface would certainly be an easier means
to control them. One challenge they still have to overcome is to operate in noisy
environments that impair the recognition.
1.2 Evaluating the Applications 3
1 https://www.kaggle.com/competitions.
4 1 An Overview of Language Processing
on Machine Translation (WMT) and the shared tasks of the Conference on Natural
Language Learning (CoNLL).
1.3.1 Ambiguity
which shows eight other alternative readings at the word decoding stage:
*The boy seat the sandwiches.
*The boy seat this and which is.
*The boys eat this and which is.
The buoys eat the sandwiches.
*The buoys eat this and which is.
The boys eat the sand which is.
*The buoys seat this and which is.
Natural language processing often starts with the choice or the development of a
formal model and its algorithmic implementation. In any scientific discipline, good
models are difficult to design. This is specifically the case with language. Language
is closely tied to human thought and understanding, and in some instances such
models also involve the study of the human mind. This gives a measure of the
complexity of the description and the representation of language.
The NLP field has seen a lot of theories and models happen. Unfortunately,
few of them have been elaborate enough to encompass and describe language
effectively. Some models have also been misleading. This explains somewhat the
failures of early attempts in language processing. In addition, many of the current or
potential models require massive computing power. Processors and storage able to
support their implementation with substantial corpora, for instance, were not widely
available until recently.
However, in the last two decades models have matured, data has become
available, and computing power has become inexpensive. Although models and
implementations are rarely perfect, they now enable us to obtain exceptional results,
unimaginable a few years ago. Most use a limited set of techniques pertaining
to the theory of probability, statistics, and machine learning that we will consider
throughout this book.
In this book, we will cover recent models and architectures that have been
instrumental in the recent progress of natural language processing. Each chapter
is dedicated to a specific topic that we will relate to applications and illustrate with
Python programs. As interaction is of primary importance to analyze the behavior
of an algorithm or a piece of code, the programs are all available in the form of
notebooks where readers can run the different pieces, modify them, see the results,
and ultimately understand them.
This book does not presuppose a deep knowledge of Python. Chapter 2, A Tour of
Python, gives an introduction to this language, especially directed to text processing.
This will enable us to touch all the data structures and programming constructs we
need to implement our algorithms.
Large collections of texts form the raw material of natural language processing.
Chapter 3, Corpus Processing Tools, as a follow up to the Python introduction,
6 1 An Overview of Language Processing
describes regular expressions, both a simple and efficient tool to process text. We
will see how to use them in Python to match patterns in text and apply some
transformations. This chapter also reviews techniques to carry out approximate
string matching.
The internet is multilingual and languages use different scripts. Chapter 4,
Encoding and Annotation Schemes, describes Unicode, a standard to encode about
all the existing characters, and the Unicode regular expressions. Once encoded, most
texts not only contain sequences of characters, but also embed a structure in the form
of markups. This chapter outlines them as well as elementary techniques to collect
corpora from the internet and parse their markup.
Machine learning is the main technique we use in this book. Chapter 5, Python
for Numerical Computations, describes NumPy arrays and PyTorch tensors, the
fundamental structures to represent and process numerical data in Python. This
chapter also provides a reminder on the mathematical operations on vectors and
matrices and how to apply them with Python.
Chapter 6, Topics in Information Theory and Machine Learning, proceeds with
elementary concepts of information theory such as entropy and perplexity. Using
these concepts, a small dataset, and the scikit-learn machine-learning toolkit, we
will create our first classifier based on decision trees.
Logistic regression is a linear classification technique and a fundamental element
of many machine-learning systems. Chapter 7, Linear and Logistic Regression,
introduces it as well as gradient descent, a technique to fit the parameters of a
logistic regression model to a dataset. Using scikit-learn again, we will train a
logistic regression classifier to determine the language of a text from its character
counts.
Logistic regression is an effective, but elementary technique. Chapter 8, Neural
Networks, describes how we can extend it by stacking more layers and functions,
and create a feed-forward neural network. This chapter also describes the backprop-
agation algorithm to adapt gradient descent to networks with multiple layers. Using
Keras and PyTorch this time, we will train new neural networks to classify our texts.
In the chapters so far, we did not use the concept of word. Chapter 9, Counting
and Indexing Words, describes how to segment a text in words and sentences either
using regular expressions or a classifier. It also explains how to count the words
of a text and index all the words contained in a collection of texts. Finally, it
outlines a document representation using vectors of word frequencies and applies
this representation in a new PyTorch model to classify texts.
Starting from the words counts of the previous chapter, Chap. 10, Word Se-
quences, introduces words sequences, N-grams, and language models. It illustrates
them with programs to derive sequence probabilities and smooth their distributions.
We finally use these probabilities to generate text sequences automatically.
In the two previous chapters, we represented the words as strings or Boolean vec-
tors. Chapter 11, Dense Vector Representations, describes techniques to convert the
words into relatively small numerical vectors also called dense representations. We
examine more specifically principal component analysis as well as neural network
alternatives, GloVe and CBOW. The chapter contains programs to experiment with
1.4 The Domains We Will Cover 7
The internet has become the main source of pedagogical and technical references:
on-line courses, digital libraries, general references, corpus, and software resources,
together with registries and portals. Wikipedia contains definitions and general
articles on concepts and theories used in machine learning and natural language
processing.
Many programs are available as open source. They include speech synthesis and
recognition, language models, morphological analysis, parsing, transformers, and so
on. The spaCy library2 is an open-source NLP toolkit in Python that features many
components to train and apply models. The Natural Language Toolkit (NLTK)3 is
another valuable suite of open-source Python programs, datasets, and tutorials. It has
a companion book: Natural Language Processing with Python by Bird et al. (2009).
On the machine-learning side, the scikit-learn toolkit4 (Pedregosa et al. 2011) is
an easy-to-use, general purpose set of machine-learning algorithms. It has excellent
documentation and tutorials.
The source code of these three toolkits, spaCy, NLTK, and scikit-learn, is
available from GitHub and is worth examining, at least partly. To understand how a
program is designed; how its files are structured; what are the best coding practices,
reading is code is always informative. Beyond these toolkits, GitHub has an amazing
wealth of open-source code and, when in need to understand an algorithm or a
model, searching GitHub may help you find the rare gem.
As the field is constantly evolving, reading scientific papers is necessary to keep
up with the changes. A starting point is the ACL anthology,5 an extremely valuable
source of research papers (journal and conferences) published under the auspices of
the ACL. Such papers are generally reliable as they are reviewed by other scientists
before they are published.
A key difference with scientific publishing from a few years ago is the speed of
dissemination and the emergence of a permanent benchmarking. Many authors, as
soon as they come with a new finding or surpass the state of the art in a specific
application, publish a paper immediately and submit it to a conference later. arXiv6
is the major open-access repository of such papers.
2 https://spacy.io/.
3 https://www.nltk.org/.
4 https://scikit-learn.org/.
5 https://aclanthology.org/.
6 https://arxiv.org/.
Chapter 2
A Tour of Python
学 知 见 闻 不
至 之 之 之 闻
于 不 不 不 不
行 若 若 若 若
之 行 知 见 闻
而 之 之 之 之
止
矣
Chinese tenet sometimes attributed to Xunzi, usually abridged in English as
I hear and I forget, I see and I remember, I do and I understand.
Python has become the most popular scripting language. Perl, Ruby, or Lua have
similar qualities and goals, sport active developer communities, and have application
niches. Nonetheless, none of them can claim the spread and universality of Python.
Python’s rigorous design, ascetic syntax, simplicity, and the availability of scores of
libraries made it the language chosen by almost 70% of the American universities
for teaching computer science (Guo 2014). This makes Python unescapable when it
comes to natural language processing.
We used Perl in the first editions of this book as it featured rich regular
expressions and a support for Unicode; they are still unsurpassed. Python later
adopted these features to a point that now makes the lead of Perl in these areas
less significant. And the programming style conveyed by Python, both Spartan and
elegant, eventually prevailed. The purpose of this chapter is to provide a quick
introduction to Python’s syntax to readers with some knowledge in programming.
Python comes in two flavors: Python 2 and Python 3. In this book, we only
use Python 3 as Python 2 does not properly support Unicode. Moreover, given a
problem, there are often many ways to solve it. Among the possible constructs,
some are more conformant to the spirit of Python. van Rossum et al. (2013) wrote a
guide on the Pythonic coding style that we try to follow in this book.
$ python
>>> a = 1 We create variable a and assign it with 1
>>> b = 2 We create b and assign it with 2
>>> b + 1 We add 1 to b
3 And Python returns the result
>>> c = a / (b + 1) We carry out a computation and assign it to c
>>> c We print c
0.3333333333333333 We create text and assign it with a string
>>> text = ’Result:’ And we print both text and c
>>> print(text, c)
Result: 0.3333333333333333
>>> quit()
$
Like all the structured languages, programs in Python consist of blocks, i.e. bodies
of contiguous statements together with control statements. In Python, these blocks
are defined by an identical indentation: We create a new block by adding an
indentation of four spaces from the previous line. This indentation is decreased by
the same number of spaces to mean the end of the block.
The program below uses a loop to print the numbers of a list. The loop starts
with the for and in statements ended with a colon. After this statement, we add an
indentation of four spaces to define the body of the loop: The statements executed
by this loop. We remove the indentation when the block has ended:
for i in [1, 2, 3, 4, 5, 6]:
print(i)
print(’Done’)
The next program introduces a condition with the if and else statements, also
ended with a colon, and the modulo operator, %, to print the odd and even numbers:
for i in [1, 2, 3, 4, 5, 6]:
if i % 2 == 0:
print(’Even:’, i)
else:
print(’Odd:’, i)
print(’Done’)
2.4 Strings
wrap the line so that it fits our text editor, we will use the backslash continuation
character, \, as in:
iliad_opening2 = ’Sing, O goddess, the anger of Achilles son of \
Peleus, that brought countless ills upon the Achaeans.’
where the line break is ignored and iliad_opening2 is equivalent to one single
line. We can use any type of quote then.
We access the characters in a string using their index enclosed in square brackets,
starting at 0:
alphabet = ’abcdefghijklmnopqrstuvwxyz’
alphabet[0] # ’a’
alphabet[1] # ’b’
alphabet[25] # ’z’
We can use negative indices, that start from the end of the string:
alphabet[-1] # the last character of a string: ’z’
alphabet[-2] # the second last: ’y’
alphabet[-26] # ’a’
An index outside the range of the string, like alphabet[27], will throw an index
error.
The length of a string is given by the len() function:
len(alphabet) # 26
2.4 Strings 13
There is no limit to this length; we can use a string to store a whole corpus, provided
that our machine has enough memory.
Once created, strings are immutable and we cannot change their content:
alphabet[0] = ’b’ # throws an error
Strings come with a set of built-in operators and functions. We concatenate and
repeat strings using + and * as in:
’abc’ + ’def’ # ’abcdef’
’abc’ * 3 # ’abcabcabc’
We can iterate over the characters of a string using a for in loop, and for
instance extract all its vowels as in:
text_vowels = ’’
for c in iliad_opening:
if c in ’aeiou’:
text_vowels = text_vowels + c
print(text_vowels) # ’ioeeaeoieooeeuaououeiuoeaea’
14 2 A Tour of Python
into
text_vowels += c
as well as for all the arithmetic operators: -=, *=, /=, **=, and %=.
2.4.3 Slices
We can extract substrings of a string using slices: A range defined by a start and an
end index, [start:end], where the slice will include all the characters from index
start up to index end - 1:
alphabet[0:3] # the three first letters of alphabet: ’abc’
alphabet[:3] # equivalent to alphabet[0:3]
alphabet[3:6] # substring from index 3 to index 5: ’def’
alphabet[-3:] # the three last letters of alphabet: ’xyz’
alphabet[10:-10] # ’klmnop’
alphabet[:] # all the letters: ’a...z’
The characters in the strings are interpreted literally by Python, except the quotes
and backslashes. To create strings containing these two characters, Python defines
two escape sequences: \’ to represent a quote and \\ to represent a backslash as in:
’Python\’s strings’ # "Python’s strings"
This expression creates the string Python’s strings; the backslash escape charac-
ter tells Python to read the quote literally instead of interpreting it as an end-of-string
delimiter.
2.4 Strings 15
We can also use literal single quotes inside a string delimited by double quotes
as in:
"Python’s strings" # "Python’s strings"
These raw strings will be useful to write regular expressions; see Sect. 3.4.
16 2 A Tour of Python
Python can interpolate variables inside strings. This process is called formatting
and uses the str.format() function. The positions of the variables in the string
are given by curly braces: {} that will be replaced by the arguments in format() in
the same order as in:
begin = ’my’
’{} string {}’.format(begin, ’is empty’)
# ’my string is empty’
format() has many options like reordering the arguments through indices:
begin = ’my’
’{1} string {0}’.format(’is empty’, begin)
# ’my string is empty’
If the input string contains braces, we escape them by doubling them: {{ for a
literal { and }} for }.
Python uses objects to model data. This means that all the variables we have used
so far refer to objects. Each Python object has an identity, a value, and a type to
categorize it. The identity is a unique number that is assigned to the object at runtime
when it is created so that there is no ambiguity on it. The type defines what kind of
operations we can apply to this object.
We return the object identity with the id() function and its type with type().
The respective value, identity, and type of 12 are:
12 # 12
id(12) # 140390267284112
type(12) # <class ’int’>
When we assign the object to a (or we give the name a to the Python object 12),
we have the same identity:
a = 12
id(a) # 140390267284112
type(a) # <class ’int’>
The identity value is unique, but it will depend on the machine the program is
running on.
In addition to integers, int, the primitive types include:
• The floating point numbers, float.
• The Boolean type, bool, with the values True and False;
2.6 Data Structures 17
• The None type with the None value as unique member, equivalent to null in C or
Java;
A few examples:
type(12.0) # <class ’float’>
type(True) # <class ’bool’>
type(1 < 2) # <class ’bool’>
type(None) # <class ’NoneType’>
We have also seen the str string data type consisting of sequences of Unicode
characters:
id(’12’) # 140388663362928
type(’12’) # <class ’str’>
alphabet # ’abcdefghijklmnopqrstuvwxyz’
id(alphabet) # 140389733251552
type(alphabet) # <class ’str’>
Python supports the conversion of types using a function with the type name as
int() or str(). When the conversion is not possible, Python throws an error:
int(’12’) # 12
str(12) # ’12’
int(’12.0’) # ValueError
int(alphabet) # ValueError
int(True) # 1
int(False) # 0
bool(7) # True
bool(0) # False
bool(None) # False
Like in other programming languages, the Boolean True and False values have
synonyms in the other types:
False: int: 0, float: 0.0, in the none type, None. The empty data structures in
general are synonyms of False as the empty string (str) ’’ and the empty list,
[];
True: The rest.
2.6.1 Lists
Lists in Python are data structures that can hold any number of elements of any
type. Like in strings, each element has a position, where we can read data using the
position index. We can also write data to a specific index and a list grows or shrinks
automatically when elements are appended, inserted, or deleted. Python manages
the memory without any intervention from the programmer.
18 2 A Tour of Python
Reading or writing a value to a position of the list is done using its index between
square brackets starting from 0. If an element is read or assigned to a position that
does not exist, Python returns an index error:
list2[1] # 2
list2[1] = 8
list2 # [1, 8, 3]
list2[4] # Index error
As with strings, we can extract sublists from a list using slices. The syntax is the
same, but unlike strings, we can also assign a list to a slice:
list3[1:3] # [3.14, ’Prolog’]
list3[1:3] = [2.72, ’Perl’, ’Python’]
list3 # [1, 2.72, ’Perl’, ’Python’, ’my string’]
where we access the elements of the inner lists with a sequence of indices between
square brackets:
list4[0][1] # 8
list4[1][3] # ’Python’
We can also assign a complete list to a variable and a list to a list of variables as
in:
list5 = list2
[v1, v2, v3] = list5
where list5 and list2 refer to the same list, and v1, v2, v3 contain, respectively,
1, 8, and 3.
2.6 Data Structures 19
The statement list5 = list2 creates a variable referring to the same list as
list2. In fact, list5 and list2 are two names for the same object. To create a
new shallow copy of a list, we need to use list.copy(). For example
list6 = list2.copy()
At this point, list2 and list6 are only equal, while list2 and list5 are
identical. We check the equality with == and the identity with is:
list2 == list5 # True
list2 == list6 # True
list2 # [1, 2, 3]
list5 # [1, 2, 3] same as list2
list6 # [1, 8, 3] independent list
Note that copy() does not copy the inner objects of the list, only the identities:
id(list2) # 140388791294656
id(list4.copy()[0]) # 140388791294656
This means that if we modify the content of list2, a shallow copy of list4 will
also change.
To create a deep copy of a list, or recursive copy, we need to use
copy.deepcopy():
import copy
id(copy.deepcopy(list4)[0]) # 140460465905280
Now the deep copy is completely independent from the original and the first item of
list4 is no longer related to list2.
20 2 A Tour of Python
Lists have built-in operators and functions. Like for strings, we can use the + and *
operators to concatenate and repeat lists:
list2 # [1, 2, 3]
list3[:-1] # [1, 2.72, ’Perl’, ’Python’]
[1, 2, 3] + [’a’, ’b’] # [1, 2, 3, ’a’, ’b’]
list2[:2] + list3[2:-1] # [1, 2, ’Perl’, ’Python’]
list2 * 2 # [1, 2, 3, 1, 2, 3]
[0.0] * 4 # Initializes a list of four 0.0s
# [0.0, 0.0, 0.0, 0.0]
To know all the functions associated with a type, we can use dir(), as in:
dir(list)
or
dir(str)
and
help(list.append)
2.6.4 Tuples
Tuples are sequences enclosed in parentheses. They are very similar to lists, except
that they are immutable. Once created, we access the elements of a tuple, including
slices, using the same notation as with the lists.
tuple1 = () # An empty tuple
tuple1 = tuple() # Another way to create an empty tuple
tuple2 = (1, 2, 3, 4)
tuple2[3] # 4
tuple2[1:4] # (2, 3, 4)
tuple2[3] = 8 # Type error: Tuples are immutable
2.6.5 Sets
Sets are collections that have no duplicates. We create a set with a sequence enclosed
in curly braces or an empty set with the set() function:
set1 = set() # An empty set
set2 = {’a’, ’b’, ’c’, ’c’, ’b’} # {’a’, ’b’, ’c’}
type(set2) # <class ’set’>
We can then add and remove elements with the add() and remove() functions:
set2.add(’d’) # {’a’, ’b’, ’c’, ’d’}
set2.remove(’a’) # {’b’, ’c’, ’d’}
Sets are useful to extract the unique elements of lists or strings as in:
list9 = [’a’, ’b’, ’c’, ’c’, ’b’]
set3 = set(list9) # {’a’, ’b’, ’c’}
iliad_chars = set(iliad_opening.lower())
# Set of unique characters of the iliad_opening string
Sets are unordered. We can create a sorted list of them using sorted() as in:
>>> sorted(iliad_chars)
[’\n’, ’ ’, ’,’, ’.’, ’a’, ’b’, ’c’, ’d’, ’e’, ’f’,
’g’, ’h’, ’i’, ’l’, ’n’, ’o’, ’p’, ’r’, ’s’, ’t’, ’u’]
2.6.7 Dictionaries
Dictionaries are collections, where the values are indexed by keys instead of ordered
positions, like in lists or tuples. Counting the words of a text is a very frequent
operation in natural language processing, as we will see in the rest of this book.
Dictionaries are the most appropriate data structures to carry this out, where we use
the keys to store the words and the values to store the counts.
We create a dictionary by assigning it a set of initial key-value pairs, possibly
empty, where keys and values are separated by a colon, and then adding keys and
values using the same syntax as with the lists. The statements:
wordcount = {} # We create an empty dictionary
wordcount = dict() # Another way to create a dictionary
wordcount[’a’] = 21 # The key ’a’ has value 21
wordcount[’And’] = 10 # ’And’ has value 10
wordcount[’the’] = 18
create the dictionary wordcount and add three keys: a, And, the, whose values are
21, 10, and 18.
We refer to the whole dictionary using the notation wordcount.
wordcount # {’the’: 18, ’a’: 21, ’And’: 10}
type(wordcount) # <class ’dict’>
The order of the keys is not defined at run-time and we cannot rely on it.
The values of the resulting dictionary can be accessed by their keys with the same
syntax as with lists:
wordcount[’a’] # 21
wordcount[’And’] # 10
A dictionary entry is created when a value is assigned to it. Its existence can be
tested using the in Boolean function:
’And’ in wordcount # True
’is’ in wordcount # False
Just like indices for lists, the key must exist to access it, otherwise it generates an
error:
wordcount[’is’] # Key error
To access a key in a dictionary without risking an error, we can use the get()
function that has a default value if the key is undefined:
• get(’And’) returns the value of the key or None if undefined;
• get(’is’, val) returns the value of the key or val if undefined.
as in:
wordcount.get(’And’) # 10
wordcount.get(’is’, 0) # 0
wordcount.get(’is’) # None
24 2 A Tour of Python
missing_proof = defaultdict(int)
missing_proof[’the’] # 0
Dictionaries have a set of built-in functions. The most useful ones are:
• keys() returns the keys of a dictionary;
• values() returns the values of a dictionary;
• items() returns the key-value pairs of a dictionary.
A few examples:
wordcount.keys() # dict_keys([’the’, ’a’, ’And’])
wordcount.values() # dict_values([18, 21, 10])
wordcount.items() # dict_items([(’the’, 18), (’a’, 21),
# (’And’, 10)])
Keys can be strings, numbers, or immutable structures. Mutable keys, like a list,
will generate an error:
my_dict = {}
my_dict[(’And’, ’the’)] = 3 # OK, we use a tuple
my_dict[[’And’, ’the’]] = 3 # Type error:
# unhashable type: ’list’
Let us finish with a program that counts the letters of a text. We use the for in
statement to scan the iliad_opening text set in lowercase letters; we increment
the frequency of the current letter if it is in the dictionary or we set it to 1, if we have
not seen it before.
The complete program is:
letter_count = {}
for letter in iliad_opening.lower():
if letter in alphabet:
if letter in letter_count:
letter_count[letter] += 1
2.6 Data Structures 25
else:
letter_count[letter] = 1
resulting in:
>>> letter_count
{’g’: 4, ’s’: 10, ’o’: 8, ’u’: 4, ’h’: 6, ’c’: 3, ’l’: 6,
’a’: 6, ’t’: 6, ’d’: 2, ’e’: 9, ’b’: 1, ’p’: 2, ’f’: 2,
’r’: 2, ’n’: 6, ’i’: 3}
To print the result in alphabetical order, we extract the keys; we sort them; and
we print the key-value pairs. We do all this with this loop:
for letter in sorted(letter_count.keys()):
print(letter, letter_count[letter])
In Python, the control flow statements include conditionals, loops, exceptions, and
functions. These statements consist of two parts, the header and the suite. The header
starts with a keyword like if, for, or while and ends with a colon. The suite
consists of the statement sequence controlled by the header; we have seen that the
statement in the suite must be indented with four characters.
At this point, we may wonder how we can break expressions in multiple lines, for
instance to improve the readability of a long list or long arithmetic operations. The
answer is to make use of parentheses, square or curly brackets. A statement inside
parentheses or brackets is equivalent to a unique logical line, even if it contains line
breaks.
2.7.1 Conditionals
Python expresses conditions with the if, elif, and else statements as in:
digits = ’0123456789’
punctuation = ’.,;:?!’
char = ’.’
if char in alphabet:
print(’Letter’)
elif char in digits:
print(’Number’)
elif char in punctuation:
print(’Punctuation’)
else:
print(’Other’)
sum += i
print(sum) # Sum of integers from 0 to 99: 4950
# Using the built-in sum() function,
# sum(range(100)) would produce the same result.
The range() behavior is comparable to that of a list, but as a list will grow
with its length, range() will use a constant memory. Nonetheless, we can convert
a range into a list:
list10 = list(range(5)) # [0, 1, 2, 3, 4]
We have seen how to iterate over a list and over indices using range(). Should
we want to iterate over both, we can use the enumerate() function. enumerate()
takes a sequence as argument and returns a sequence of (index, element) pairs,
where element is an element of the sequence and index, its index.
We can use enumerate() to get the letters of the alphabet and their index with
the program:
for idx, letter in enumerate(alphabet):
print(idx, letter)
that prints:
0 a
1 b
2 c
3 d
4 e
5 f
...
The while loop is an alternative to for, although less frequent in Python programs.
This loop executes a block of statements as long as a condition is true. We can
28 2 A Tour of Python
Another possible structure is to use an infinite loop and a break statement to exit
the loop:
sum, i = 0, 0
while True:
sum += i
i += 1
if i >= 100:
break
print(sum)
2.7.4 Exceptions
Python has a mechanism to handle errors so that they do not stop a program. It
uses the try and except keywords. We saw in Sect. 2.5 that the conversion of the
alphabet and ’12.0’ strings into integers prints an error and exits the program.
We can handle it safely with the try/except construct:
try:
int(alphabet)
int(’12.0’)
except:
pass
print(’Cleared the exception!’)
where pass is an empty statement serving as a placeholder for the except block.
It is also possible, and better, to tell except to catch specific exceptions as in:
try:
int(alphabet)
int(’12.0’)
except ValueError:
print(’Caught a value error!’)
except TypeError:
print(’Caught a type error!’)
that prints:
Caught a value error!
2.9 Documenting Functions 29
2.8 Functions
We define a function in Python with the def keyword and we use return to return
the results. In Sect. 2.6.7, we wrote a small program to count the letters of a text.
Let us create a function from it that accepts any text instead of iliad_opening.
We also add a Boolean, lc, to set or not the text in lowercase:
def count_letters(text, lc):
letter_count = {}
if lc:
text = text.lower()
for letter in text:
if letter.lower() in alphabet:
if letter in letter_count:
letter_count[letter] += 1
else:
letter_count[letter] = 1
return letter_count
If most of the calls use a parameter with a specific value, we can use it as default
with the notation:
def count_letters(text, lc=True):
2.9.1 Docstrings
When writing large programs, it is essential to document the code with comments. In
the case of functions, Python has a specific documentation string called a docstring:
A string just below the function name that usually consists of a description of the
function, its arguments, and what the function returns as in:
def count_letters(text, lc=True):
"""
Count the letters in a text
30 2 A Tour of Python
Arguments:
text: input text
lc: lowercase. If true, sets the characters
in lowercase
Returns: The letter counts
"""
...
2.10.1 Comprehensions
where we iterate over the sequence of character indices and we create pairs
consisting of a prefix and a rest.
If the input word is acress, the resulting list in splits is:
[(’’, ’acress’), (’a’, ’cress’), (’ac’, ’ress’),
(’acr’, ’ess’), (’acre’, ’ss’), (’acres’, ’s’), (’acress’, ’’)]
Then, we apply the deletions, where we concatenate the prefix and the rest
deprived from its first character. We check that the rest is not an empty list:
deletes = [a + b[1:] for a, b in splits if b]
We can create set and dictionary comprehensions the same way by replacing the
enclosing square brackets with curly braces: {}.
32 2 A Tour of Python
2.10.2 Generators
List comprehensions are stored in memory. If the list is large, it can exceed the
computer capacity. Generators generate the elements on demand instead and can
handle much longer sequences.
Generators have a syntax that is identical to the list comprehensions except that
we replace the square brackets with parentheses:
splits_generator = ((word[:i], word[i:])
for i in range(len(word) + 1))
We can iterate over this generator exactly as with a list. The statement:
for i in splits_generator: print(i)
prints
(’’, ’acress’)
(’a’, ’cress’)
(’ac’, ’ress’)
(’acr’, ’ess’)
(’acre’, ’ss’)
(’acres’, ’s’)
(’acress’, ’’)
However, this iteration can only be done once. We need to create the generator again
to retraverse the sequence.
Finally, we can also use functions to create generators. We replace the return
keyword with yield to do this, as in the function:
def splits_generator_function():
for i in range(len(word) + 1):
yield (word[:i], word[i:])
2.10.3 Iterators
We just saw that we can iterate only once over a generator. Objects with this property
in Python are called iterators. We can think of an iterator as an open file (a stream
of data), where we read sequentially the elements until we reach the end of the file
and no data is available.
2.10 Comprehensions and Generators 33
We can create iterators from certain existing objects such as strings, lists, or
tuples with the iter() function and return the next item with the next() function
as in:
my_iterator = iter(’abc’)
next(my_iterator) # a
next(my_iterator) # b
next(my_iterator) # c
2.10.4 zip
Let us examine now a useful iterator: zip(). Let us first create three strings with
the Latin, Greek, and Russian Cyrillic alphabets:
latin_alphabet = ’abcdefghijklmnopqrstuvwxyz’
len(latin_alphabet) # 26
greek_alphabet = ’αβγδεζηθικλμνξοπρστυφχψω’
len(greek_alphabet) # 24
cyrillic_alphabet = ’абвгдеёжзийклмнопрстуфхцчшщъыьэюя’
len(cyrillic_alphabet) # 33
zip() weaves strings, lists, or tuples and creates an iterator of tuples, where
each tuple contains the items with the same index: latin_alphabet[0] and
greek_alphabet[0], latin_alphabet[1] and greek_alphabet[1], and so
on. If the strings are of different sizes, zip() will stop at the shortest.
The following code applies zip() to the three first letters of our alphabets:
la_gr = zip(latin_alphabet[:3], greek_alphabet[:3])
la_gr_cy = zip(latin_alphabet[:3], greek_alphabet[:3],
cyrillic_alphabet[:3])
Once created, we access the items with next() as we saw in our first example:
next(la_gr) # (’a’, ’α’)
next(la_gr) # (’b’, ’β’)
next(la_gr) # (’c’, ’γ’)
We must be aware that the list conversion runs the iterator through the sequence
and if we try to convert la_gr_cy a second time, we just get an empty list:
list(la_gr_cy) # []
To restore the original lists of alphabet, we can use the zip(*) inverse function:
la_gr_cy = zip(latin_alphabet[:3], greek_alphabet[:3],
cyrillic_alphabet[:3])
# (’a’, ’α’, ’а’), (’b’, ’β’, ’б’), (’c’, ’γ’, ’в’)
zip(*la_gr_cy)
# (’a’, ’b’, ’c’), (’α’, ’β’, ’γ’), (’а’, ’б’, ’в’)
We can use this zip(*) iterator to transpose efficiently a list of lists, that is to
move each item of index [i][j] in the input to index [j][i] in the output. This
operation is analogous to the transpose of a matrix that we will see in Sect. 5.5.8
la_gr_cy_list
# [(’a’, ’α’, ’а’), (’b’, ’β’, ’б’), (’c’, ’γ’, ’в’)]
list(zip(*la_gr_cy_list))
# [(’a’, ’b’, ’c’), (’α’, ’β’, ’γ’), (’а’, ’б’, ’в’)]
2.11 Modules
Python comes with a very large set of libraries called modules like, for example,
the math module that contains a set of mathematical functions. We load a module
with the import keyword and we use its functions with the module name as a prefix
2.12 Installing Modules 35
followed by a dot:
import math
math.sqrt(2) # 1.4142135623730951
math.sin(math.pi/2) # 1.0
math.log(8, 2) # 3.0
type(math) # <class ’module’>
Modules are just files, whose names are the module names with the .py suffix.
To import a file, Python searches first the standard library, the files in the current
folder, and then the files in PYTHONPATH.
When Python imports a module, it executes its statements just as when we run:
$ python module.py
If we want to have a different execution when we run the program from the
command line and when we import it, we need to include this condition:
if __name__ == ’__main__’:
print("Running the program")
# Other statements
else:
print("Importing the program")
# Other statements
The first member is executed when the program is run from the command line and
the second one, when we import it.
Python comes with a standard library of modules like math. Although compre-
hensive, we will use external libraries in the next chapters that are not part of
the standard release as the regex module in Chap. 4, Encoding and Annotation
Schemes. We can use pip, the Python package manager to install the modules we
need. pip will retrieve them from the Python package index (PyPI) and fetch them
for us.
To install regex, we just run the command:
$ pip install regex
or
$ python -m pip install regex
36 2 A Tour of Python
Python has a set of built-in input/output functions to read and write files: open(),
read(), write(), and close(). Before you run the code below, you will need
a file. To follow the example, download Homer’s Iliad and Odyssey from the
department of classics at the Massachusetts Institute of Technology (MIT): https://
classics.mit.edu/ and store them on your computer.
The next lines open and read the iliad.txt file that contains Homer’s Iliad:
try:
f_iliad = open(’iliad.mb.txt’, ’r’, encoding=’utf-8’)
iliad_txt = f_iliad.read()
f_iliad.close()
except:
pass
where open() opens a file in the read-only mode, r, and returns a file object;
read() reads the entire content of the file and returns a string; close() closes
the file object; In the code above, we used a try-except block in case the file does
not exist or we cannot read it.
It is easy to forget to close a file. The with statement is a shorthand to close it
automatically after the block. In the code below, we counts the letters in the text with
count_letter() and we store the results in the iliad_stats.txt file: open()
creates a new file using the write mode, w, and write() writes the results as a string.
iliad_stats = count_letters(iliad_txt)
with open(’iliad_stats.txt’, ’w’) as f:
f.write(str(iliad_stats))
# we automatically close the file
In addition to these base functions, Python has modules to read and write a large
variety of file formats.
2.14 Collecting a Corpus from the Internet 37
In the previous section, we used Homer’s Iliad that we manually downloaded from
the department of classics at the MIT and we stored it locally in the iliad.txt
file. In practice, it is much easier to do it automatically with Python’s requests
module. Texts from classical antiquity are easily available. For example, in addition
to Homer, the MIT maintains a corpus of more than 400 works from Greek and
Latin authors translated into English. With the code below, we will collect five texts
from this corpus, Homer’s Iliad and Odyssey and Virgil’s Eclogue, Georgics, and
Aeneid and store them on our computer.
For this, we first create a dictionary with their internet addresses:
classics_url = {
’iliad’:’http://classics.mit.edu/Homer/iliad.mb.txt’,
’odyssey’:’http://classics.mit.edu/Homer/odyssey.mb.txt’,
’eclogue’:’http://classics.mit.edu/Virgil/eclogue.mb.txt’,
’georgics’:’http://classics.mit.edu/Virgil/georgics.mb.txt’,
’aeneid’:’http://classics.mit.edu/Virgil/aeneid.mb.txt’}
We then download the texts with the requests module and get() method. We
access the text content with the text variable. We store the corpus in a dictionary
where the key is the title and the value is the text:
import requests
classics = {}
for key in classics_url:
classics[key] = requests.get(classics_url[key]).text
All the texts from the MIT contain license information at the beginning and at the
end that should not be part of the statistics. The text itself is enclosed between two
dashed lines. We extract it with a regular expression and the re.search() function.
We will describe them extensively in Chap. 3, Corpus Processing Tools. For now,
we just run this code:
import regex as re
We can now save the corpus on our machine. A first possibility is to write the
content in text files:
with open(’iliad.txt’, ’w’) as f_il, \
open(’odyssey.txt’, ’w’) as f_od:
f_il.write(classics[’iliad’])
f_od.write(classics[’odyssey’])
38 2 A Tour of Python
This will not preserve the dictionary structure however and a better way is to
use the Javascript object notation (JSON), a convenient data format, close to that of
Python dictionaries, that will keep the structure. Here, we output the corpus in the
classics.json file:
import json
Memo functions are functions that remember a result instead of computing it. This
process is also called memoization. The Fibonacci series is a case, where memo
functions provide a dramatic execution speed up.
The Fibonacci sequence is defined by the relation:
F (n) = F (n − 1) + F (n − 2)
.
however, this function has an expensive double recursion that we can drastically
improve by storing the results in a dictionary. This store, f_numbers, will save an
exponential number of recalculations:
f_numbers = {}
def fibonacci2(n):
if n == 1: return 1
elif n == 2: return 1
elif n in f_numbers:
return f_numbers[n]
else:
f_numbers[n] = fibonacci2(n - 1) + fibonacci2(n - 2)
return f_numbers[n]
2.16 Object-Oriented Programming 39
2.15.2 Decorators
Python decorators are syntactic notations to simplify the writing of memo functions
(they can be used for other purposes too).
Decorators need a generic memo function to cache the results already computed.
Let us define it:
def memo_function(f):
cache = {}
def memo(x):
if x in cache:
return cache[x]
else:
cache[x] = f(x)
return cache[x]
return memo
Using this memo function, we can redefine fibonacci() with the statement:
fibonacci = memo_function(fibonacci)
that results in memo() being assigned to the fibonacci() function. When we call
fibonacci(), we in fact call memo() that will either lookup the cache or call the
original fibonacci() function.
One detail may be puzzling: How does the new function know of the cache()
variable and its initialization as well as the value of the f argument, the original
fibonacci() function? This is because Python implements a closure mechanism
that gives the inner functions access to the local variables of their enclosing function.
Now the decorators: Python provides a short notation for memo functions;
instead of writing:
fibonacci = memo_function(fibonacci)
We define our own classes with the class keyword. In Sect. 2.8, we wrote a
count_letters() function that basically is to be applied to a text. Let us reflect
this with a Text class and let us encapsulate this function as a method in this class.
In addition, we give the Text class four variables: The content, its length, and the
letter counts, which will be specific to each object and the alphabet string that
will be shared by all the objects. We say that alphabet is a class variable while
content, length, and letter_count are instance variables.
We encapsulate a function by inserting it as a block inside the class. Among the
methods, one of them, the constructor, is called at the creation of an object. It has
the __init()__ name. This notation in Python is, unfortunately, not as intuitive
as the rest of the language, and we need to add a self extra-parameter to the
methods as well as to the instance variables. This self keyword denotes the object
itself. We use __init()__ to assign an initial value to the content, length, and
letter_count variables.
Finally, we have the class:
class Text:
"""Text class to hold and process text"""
alphabet = ’abcdefghijklmnopqrstuvwxyz’
self.content = text
self.length = len(text)
self.letter_counts = {}
letter_counts = {}
if lc:
text = self.content.lower()
else:
text = self.content
for letter in text:
if letter.lower() in self.alphabet:
if letter in letter_counts:
letter_counts[letter] += 1
else:
letter_counts[letter] = 1
self.letter_counts = letter_counts
return letter_counts
Finally, we added docstrings and signatures to the class and its methods. We
access the docstring using the .__doc__ variable as in:
Text.__doc__ # ’Text class to hold and process text’
Text.count_letters.__doc__
# ’Function to count the letters of a text’
2.16.2 Subclassing
Using classes, we can build a hierarchy, where the subclasses will inherit methods
from their superclass parents.
Let us create a Word class that we define as a subclass of Text. Each word has
a type, called a part of speech, such as verb, noun, pronoun, adjective, etc. Let us
add this part of speech as an instance variable part_of_speech and let us add an
annotate() function to assign a word with its part of speech. We have the new
class:
class Word(Text):
def __init__(self, word: str = None):
super().__init__(word)
self.part_of_speech = None
and methods
word.count_letters(lc=False)
# {’M’: 1, ’u’: 1, ’s’: 1, ’e’: 1}
and have:
word.part_of_speech # Noun
The Text class we have created has a built-in equivalent in Python: Counter. It is a
subclass of dict with counting capabilities that we can apply to any iterable: string,
list, tuple, etc. We create a counter by passing it the iterable as an argument:
from collections import Counter
char_cnts = Counter(odyssey_opening)
char_cnts
# Counter({’ ’: 15,
# ’e’: 9,
# ’s’: 9,
# ...
# ’.’: 1})
It returns a Counter object, here char_cnts, where the keys are the characters of
the string or the items of the list and the values are the counts. Counter has all the
methods of dictionaries and a few other ones. We have notably most_common(n)
that returns the n most common items and total() that returns the sum of the
values:
char_cnt.most_common(3) # [(’ ’, 15), (’e’, 9), (’s’, 9)]
char_cnt.total() # 100
Python provides some functional programming mechanisms with map and reduce
functions.
2.17.1 map()
map() enables us to apply a function to all the elements of an iterable, a list for
instance. The first argument of map() is the function to apply and the second one,
the iterable. map() returns an iterator.
Let us use map() to compute the length of a sequence of texts, in our case, the
first sentences of the Iliad and the Odyssey. We apply len() to the list of strings
and we convert the resulting iterator to a list to print it.
odyssey_opening = """Tell me, O Muse, of that many-sided hero who
traveled far and wide after he had sacked the famous town
of Troy."""
Let us now suppose that we have a list of files instead of strings, here iliad.txt
and odyssey.txt. To deal with this list, we can replace len() in map() with a
function that reads a file and computes its length:
def file_length(file: str) -> int:
return len(open(file).read())
For such a short function, a lambda expression can do the job more compactly.
A lambda is an anonymous function, denoted with the lambda keyword, followed
by the function parameters, a colon, and the returned expression. To compute the
length of a file, we write the lambda:
lambda file: len(open(file).read())
We can return multiple values using tuples. If we want to both keep the text and
its length in the form of a pair: .(text, length), we just write:
text_lengths = (
map(lambda x: (open(x).read(), len(open(x).read())),
files))
text_lengths = list(text_lengths)
[text_lengths[0][1], text_lengths[1][1]] # [807485, 610483]
In the previous piece of code, we had to read the text twice: In the first element
of the pair and in the second one. We can use two map() calls instead: One to read
the files and a second to compute the lengths. This results in:
text_lengths = (
map(lambda x: (x, len(x)),
map(lambda x: open(x).read(), files)))
text_lengths = list(text_lengths)
[text_lengths[0][1], text_lengths[1][1]] # [807485, 610483]
2.17.3 reduce()
to sum the consecutive elements, where the length of each file is the second element
in the pair; the first one being the text.
reduce() is part of the functools module and we have to import it. The
resulting code is:
import functools
char_count = functools.reduce(
lambda x, y: x[1] + y[1],
map(lambda x: (x, len(x)),
map(lambda x: open(x).read(), files)))
char_count # 1417968
2.17.4 filter()
filter() is a third function that we can use to keep the elements of an iterable that
satisfy a condition. filter() has two arguments: A function, possibly a lambda,
and an iterable. It returns the elements of the iterable for which the function is true.
As an example of the filter() function, let us write a piece of code to extract
and count the lowercase vowels of a text.
2.18 Further Reading 45
We finally count the vowels in the two files using len() that we apply with a second
map():
list(map(len,
map(lambda y:
’’.join(filter(lambda x: x in ’aeiou’,
open(y).read())),
files))) # [230624, 176061]
Python has become very popular and there are plenty of good books or tutorials to
complement this introduction. Python.org is the official site of the Python software
foundation, where one can find the latest Python releases, documentation, tutorials,
news, etc. It also contains masses of pointers to Python resources. One strength of
Python is the large number of libraries or modules available for it. Anaconda is a
Python distribution that includes many of them1 .
Python comes with an integrated development environment (IDE) called IDLE
that fulfills basic needs. PyCharm is a more elaborate code editor with a beau-
tiful interface. It has a free community edition2 . Visual Studio Code in another
development environment. IPython is an interactive computing platform, where the
programmer can mix code and text in the form of notebooks.
There is an impressive numbers of teaching resources on Python that range from
introductions to very detailed or domain-oriented documents. Books by Matthes
1 https://www.anaconda.com/download/
2 https://www.jetbrains.com/pycharm/
46 2 A Tour of Python
(2019), Kong et al. (2020), and Lutz (2013), an older but valuable reference, are
three examples of this variety. There are also many free and high-quality online
courses from universities, pedagogical organizations, companies, or individuals.
Finally, the epigraph of this chapter inspires the entire book. Its origin is disputed
though. For a discussion, see the blog post by Andrew Huang3 :
Tracing the Origins of “I hear and I forget. I see and I remember. I do and I understand” –
Probably Wrongly Attributed to Confucius.
3 https://drandrewhuang.wordpress.com/2021/05/24/tracing-the-origins-of-i-hear-and-i-forget-i-
see-and-i-remember-i-do-and-i-understand-probably-wrongly-attributed-to-confucius/
Chapter 3
Corpus Processing Tools
3.1 Corpora
Table 3.1 List of the most frequent words in present texts and in the book of Genesis. After
Crystal (1997)
English French German
Most frequent words in a collection the de der
of contemporary running texts of le (article) die
to la (article) und
in et in
and les des
Most frequent words in Genesis and et und
the de die
of la der
his à da
he il er
Some corpora focus on specific genres: law, science, novels, news broadcasts,
electronic correspondence, or transcriptions of telephone calls or conversations.
Others try to gather a wider variety of running texts. Texts collected from a unique
source, say from scientific magazines, will probably be slanted toward some specific
words that do not appear in everyday life. Table 3.1 compares the most frequent
words in the book of Genesis and in a collection of contemporary running texts. It
gives an example of such a discrepancy. The choice of documents to include in a
corpus must then be varied to survey comprehensively and accurately a language
usage. This process is referred to as balancing a corpus.
Balancing a corpus is a difficult and costly task. It requires collecting data from
a wide range of sources: fiction, newspapers, technical, and popular literature.
Balanced corpora extend to spoken data. The Linguistic Data Consortium (LDC)
from the University of Pennsylvania and the European Language Resources Asso-
ciation (ELRA), among other organizations, distribute written and spoken corpus
collections. They feature samples of magazines, laws, parallel texts in English,
French, German, Spanish, Chinese, Arabic, telephone calls, radio broadcasts, etc.
In addition to raw texts, some corpora are annotated. Each of their documents,
paragraphs, sentences, possibly words is labeled with a semantic category or a
linguistic tag, for instance a sentiment for a paragraph or a part of speech for a
word. The annotation is done either manually or semiautomatically. Spoken corpora
contain the transcription of spoken conversations. This transcription may be aligned
with the speech signal and sometimes includes prosodic annotation: pause, stress,
etc. Annotation tags, paragraph and sentence boundaries, parts of speech, syntactic
or semantic categories follow a variety of standards, which are called markup
languages.
Among annotated corpora, parsed corpora deserve a specific mention. They
are collections of syntactic structures of sentences. The production of a parsed
3.1 Corpora 49
Lexicons and dictionaries are intended to give word lists, to provide a reader with
word senses and meanings, and to outline their usage. Dictionaries’ main purpose
is related to lexical semantics. Lexicography is the science of building lexicons and
writing dictionaries. It uses electronic corpora extensively.
The basic data of a dictionary is a word list. Such lists can be drawn manually
or automatically from corpora. Then, lexicographers write the word definitions and
choose citations illustrating the words. Since most of the time, current meanings
are obvious to the reader, meticulous lexicographers tended to collect examples—
citations—reflecting a rare usage. Computerized corpora can help lexicographers
avoid this pitfall by extracting all the citations that exemplify a word. An expe-
rienced lexicographer will then select the most representative examples that reflect
the language with more relevance. S/he will prefer and describe more frequent usage
and possibly set aside others.
Finding a citation involves sampling a fragment of text surrounding a given
word. In addition, the context of a word can be more precisely measured by finding
recurrent pairs of words, or most-frequent neighbors. The first process results in
concordance tables, and the second one in collocations.
A concordance is an alphabetical index of all the words in a text, or the most
significant ones, where each word is related to a comprehensive list of passages
where the word is present. Passages may start with the word or be centered on it and
surrounded by a limited number of words before and after it (Table 3.2 and incipit
of this chapter). Furthermore, concordances feature a system of reference to connect
each passage to the book, chapter, page, paragraph, or verse, where it occurs.
50 3 Corpus Processing Tools
Table 3.2 Concordance of miracle in the Gospel of John. English text: King James version;
French text: Augustin Crampon; German text: Luther’s Bible
Language Concordances
English l now. This beginning of miracles did Jesus in Cana of Ga
name, when they saw the miracles which he did. But Jesus
for no man can do these miracles that thou doest, except
This is again the second miracle that Jesus did, when he
im, because they saw his miracles which he did on them th
French Galilée, le premier des miracles que fit Jésus, et il ma
que, beaucoup voyant les miracles qu’il faisait, crurent
nne ne saurait faire les miracles que vous faites, si Die
maison. Ce fut le second miracle que fit Jésus en revenan
parce qu’elle voyait les miracles qu’il opérait sur ceux
German alten. Das ist das erste Zeichen, das Jesus tat, geschehe
as zeigst du uns für ein Zeichen, daß du dies tun darfst?
seinen Namen, da sie die Zeichen sahen, die er tat. Aber
n; denn niemand kann die Zeichen tun, die du tust, es sei
h zu ihm: Wenn ihr nicht Zeichen und Wunder seht, so glau
Concordance tables were first produced for antiquity and religious studies. Hugh
of St-Cher is known to have directed the first concordance to the scriptures in
the thirteenth century. It comprised about 11,800 words ranging from A, a, a. to
Zorobabel and 130,000 references (Rouse and Rouse, 1974). Other more elaborate
concordances take word morphology into account or group words together into
semantic themes. d’Arc (1970) produced an example of such a concordance for
Bible studies.
Concordancing is a powerful tool to study usage patterns and to write definitions.
It also provides evidence on certain preferences between verbs and prepositions,
adjectives and nouns, recurring expressions, or common syntactic forms. These
couples are referred to as collocations. Church and Mercer (1993) cite a striking
example of idiosyncratic collocations of strong and powerful. While strong and
powerful have similar definitions, they occur in different contexts, as shown in
Table 3.3.
Table 3.3 Comparing strong and powerful. The German words eng and schmal ‘narrow’ are near-
synonyms, but have different collocates
English French German
You say Strong tea Thé fort Schmales Gesicht
Powerful computer Ordinateur puissant Enge Kleidung
You don’t say Strong computer Thé puissant Schmale Kleidung
Powerful tea Ordinateur fort Enges Gesicht
3.1 Corpora 51
Table 3.4 Word preferences of strong and powerful collected from the Associated Press corpus.
Numbers in columns indicate the number of collocation occurrences with word w. After Church
and Mercer (1993)
Preference for strong over powerful Preference for powerful over strong
strong w powerful w w strong w powerful w w
161 0 showing 1 32 than
175 2 support 1 32 figure
106 0 defense 3 31 minority
...
Table 3.4 shows additional collocations of strong and powerful. These word
preferences cannot be explained using rational definitions, but can be observed in
corpora. A variety of statistical tests can measure the strength of pairs, and we can
extract them automatically from a corpus.
techniques, annotated corpora enable us train the model parameters and identify
those that influence the most its performance.
Learn from raw corpora. We saw that raw corpora could help us identify the
most frequent usage of a word. Even if there is no annotation, they will also
enable us to learn models. A common training procedure is to teach the model
to guess words missing from a sentence or following a sequence of words. These
language models, if derived from large enough corpora, will eventually capture
the semantics of words.
As a summary, corpora, whether annotated or not, form the raw material of
natural language processing. They are repositories from which models can derive
language rules and encapsulate human knowledge.
3.2.1 A Description
The most frequent operation we do with corpora consists in searching for words or
phrases. To be convenient, search must extend beyond fixed strings. We may want
to search for a word or its plural form, strings consisting of uppercase or lowercase
letters, expressions containing numbers, etc. This is made possible using finite-state
automata (FSA), which we introduce now. FSA are flexible tools to process texts
and are one of the most adequate ways to search strings.
FSA theory was designed in the beginning of computer science as a model
of abstract computing machines. It forms a well-defined formalism that has been
tested and used by generations of programmers. FSA stem from a simple idea.
These are devices that accept—recognize—or reject an input stream of characters.
FSA are very efficient in terms of speed and memory occupation. In addition to
text searching, they have many other applications: morphological parsing, part-of-
speech annotation, and speech processing.
Figure 3.1 shows an automaton with three states numbered from 0 to 2, where
state .q0 is called the start state, and .q2 , the final state. An automaton has a single
start state and any number of final states, indicated by double circles. Arcs between
states designate the possible transitions. Each arc is annotated by a label, which
means that the transition accepts or generates the corresponding character.
An automaton accepts an input string in the following way: it starts in the initial
state, follows a transition where the arc character matches the first character of the
string, consumes the corresponding string character, and reaches the destination
state. It then makes a second transition with the second character of the string,
and continues in this way until it ends up in one of the final states and there is
no character left. The automaton in Fig. 3.1 accepts or generates strings such as:
ac, abc, abbc, abbbc, abbbbbbbbbbbbc, etc. If the automaton fails to reach a final
3.2 Finite-State Automata 53
q0 q1 q2
q0 q1 q2
e
state, either because it has no more characters in the input string or because it is
trapped in a nonfinal state, it rejects the string.
As an example, let us see how the automaton accepts string abbc and rejects
abbcb. The input abbc is presented to the start state .q0 . The first character of
the string matches that of the outgoing arc. The automaton consumes character
a and moves to state .q1 . The remaining string is bbc. Then, the automaton loops
twice on state .q1 and consumes bb. The resulting string is character c. Finally, the
automaton consumes c and reaches state .q2 , which is the final state. On the contrary,
the automaton does not accept string abbcb. It moves to states .q0 , .q1 , and .q2 , and
consumes abbc. The remaining string is letter b. Since there is no outgoing arc with
a matching symbol, the automaton is stuck in state .q2 and rejects the string.
Automata may contain .ε-transitions from one state to another. In this case, the
automaton makes a transition without consuming any character of the input string.
The automaton in Fig. 3.2 accepts strings a, ab, abb, etc., as well as ac, abc, abbc,
etc.
FSA have a formal definition. An FSA consists of five components .(Q, Σ, q0 , F, δ),
where:
1. Q is a finite set of states.
2. .Σ is a finite set of symbols or characters: the input alphabet.
3. .q0 is the start state, .q0 ∈ Q.
the automaton moves when it is in state q and consumes the input symbol i.
The quintuple defining the automaton in Fig. 3.1 is .Q = {q0 , q1 , q2 }, .Σ =
{a, b, c}, .F = {q2 }, and .δ = {δ(q0 , a) = q1 , δ(q1 , b) = q1 , δ(q1 , c) = q2 }. The
state-transition table in Table 3.5 is an alternate representation of the .δ function.
54 3 Corpus Processing Tools
q0 q1 q2
The automaton in Fig. 3.1 is said to be deterministic (DFSA) because given a state
and an input, there is one single possible destination state. On the contrary, a
nondeterministic automaton (NFSA) has states where it has a choice: the path is
not determined in advance.
Figure 3.3 shows an example of an NFSA that accepts the strings ab, abb, abbb,
abbbb, etc. Taking abb as input, the automaton reaches the state .q1 consuming the
letter a. Then, it has a choice between two states. The automaton can either move
to state .q2 or stay in state .q1 . If it first moves to state .q2 , there will be one character
left, and the automaton will fail. The right path is to loop onto .q1 and then to move
to .q2 . .ε-transitions also cause automata to be nondeterministic as in Fig. 3.2, where
any string that has reached state .q1 can also reach state .q2 .
A possible strategy to deal with nondeterminism is to use backtracking. When
an automaton has the choice between two or more states, it selects one of them and
remembers the state where it made the decision: the choice point. If it subsequently
fails, the automaton backtracks to the choice point and selects another state to go to.
In our example in Fig. 3.3, if the automaton moves first to state .q2 with the string
bb, it will end up in a state without outgoing transition. It will have to backtrack and
select state .q1 .
S b
a c
q0 q1 q2
S −a b
a c
q0 q1 q2
S − {a, b, c} a
a
S −a
Fig. 3.5 An automaton to search strings ac, abc, abbc, abbbc, etc., in a text
In doing this, we have built an NFSA that it is preferable to convert into a DFSA.
Hopcroft et al. (2007) describe the mathematical properties of such automata and
an algorithm to automatically build an automaton for a given set of patterns to
search. They notably report that resulting DFSA have exactly the same number of
states as the corresponding NFSA. We present an informal solution to determine the
transitions of the automaton in Fig. 3.4.
If the input text does not begin with an a, the automaton must consume the
beginning characters and loop on the start state until it finds one. Figure 3.5
expresses this with an outgoing transition from state 0 to state 1 labeled with an
a and a loop for the rest of the characters. .Σ − a denotes the finite set of symbols
except a. From state 1, the automaton proceeds if the text continues with either a
b or a c. If it is an a, the preceding a is not the beginning of the string, but there
is still a chance because it can start again. This corresponds to the second loop on
state 1. Otherwise, if the next character falls in the set .Σ − {a, b, c}, the automaton
goes back to state 0. The automaton successfully recognizes the string if it reaches
state 2. Then it goes back to state 0 and starts the search again, except if the next
character is an a, for which it can go directly to state 1.
3.2 Finite-State Automata 57
b q2 b
c
a a c
q0 q1 q0 q1 q2
b
q3
b q2
c
a
q0 q1
b
ε
q3
b
q0
ε
a c
q0 q1 q2
FSA can be combined using a set of operations. The most useful are the union, the
concatenation, and the closure.
The union or sum of two automata A and B accepts or generates all the strings
of A and all the strings of B. It is denoted .A ∪ B. We obtain it by adding a new
initial state that we link to the initial states of A and B (Fig. 3.6) using .ε-transitions
(Fig. 3.7).
The concatenation or product of A and B accepts all the strings that are
concatenations of two strings, the first one being accepted by A and the second
one by B. It is denoted .A.B. We obtain the resulting automaton by connecting all
the final states of A to the initial state of B using .ε-transitions (Fig. 3.8).
The iteration or Kleene closure of an automaton A accepts the concatenations
of any number of its strings and the empty string. It is denoted .A∗ , where .A∗ =
{ε} ∪ A ∪ A.A ∪ A.A.A ∪ A.A.A.A ∪ . . . . We obtain the resulting automaton by
linking the final states of A to its initial state using .ε-transitions and adding a new
initial state, as shown in Fig. 3.9. The new initial state enables us to obtain the empty
string.
58 3 Corpus Processing Tools
b q2 b
c ε
a a c
q0 q1 q0 q1 q2
b ε
q3
b q2
a c
ε
q0 q0 q1
b
q3
ε
The notation .Σ ∗ designates the infinite set of all possible strings generated from
the alphabet .Σ. Other significant operations are:
• The intersection of two automata .A ∩ B that accepts all the strings accepted both
by A and by B. If .A = (Σ, Q1 , q1 , F1 , δ1 ) and .B = (Σ, Q2 , q2 , F2 , δ2 ), the
resulting automaton is obtained from the Cartesian product of states .(Σ, Q1 ×
Q2 , q1 , q2 , F1 × F2 , δ3 ) with the transition function .δ3 ( s1 , s2 , i) = { t1 , t2 |
t1 ∈ δ1 (s1 , i) ∧ t2 ∈ δ2 (s2 , i)}.
• The difference of two automata .A − B that accepts all the strings accepted by A
but not by B.
• The complementation of the automaton A in .Σ ∗ that accepts all the strings that
are not accepted by A. It is denoted .Ā, where .Ā = Σ ∗ − A.
• The reversal of the automaton A that accepts all the reversed strings accepted by
A.
Two automata are said to be equivalent when they accept or generate exactly the
same set of strings. Useful equivalence transformations optimize computation speed
or memory requirements. They include:
• .ε-removal, which transforms an initial automaton into an equivalent one without
.ε-transitions;
• minimization, which determines among equivalent automata the one that has the
smallest number of states.
Optimization algorithms are outside the scope of this book. Hopcroft et al. (2007)
as well as Roche and Schabes (1997) describe them in detail.
The automaton in Fig. 3.1 generates or accepts strings composed of one a, zero or
more b’s, and one c. We can represent this set of strings using a compact notation:
ab*c, where the star symbol means any number of the preceding character. Such
a notation is called a regular expression or regex. Regular expressions are very
powerful devices to describe patterns to search in a text. Although their notation is
different, regular expressions can always be implemented in the form of automata,
and vice versa. However, regular expressions are much easier to use.
Regular expressions are composed of literal characters, that is, ordinary text
characters, like abc, and of metacharacters, like *, that have a special meaning.
The simplest form of regular expressions is a sequence of literal characters: letters,
numbers, spaces, or punctuation signs. The regexes regular and Prolog match,
respectively, the strings regular or Prolog contained in a text. Table 3.8 shows
examples of pattern matching with literal characters. Regular expressions are case-
sensitive and match the first instance of the string or all its instances in a text,
depending on the regex language that is used.
There are currently a dozen major regular expression dialects freely available.
Their common ancestor is grep, which stands for global/regular expression/print.
grep, together with egrep, a modern version of it, is a standard Unix tool that
prints out all the lines of a file that contain a given pattern. The grep user interface
conforms to the Unix command-line style. It consists of the command name, here
grep, options, and the arguments. The first argument is the regular expression
delimited by single straight quotes. The next arguments are the files where to search
the pattern:
grep ’regular expression’ file1 file2 ... filen
prints all the lines of file myFile containing the string abc and
grep ’ab*c’ myFile1 myFile2
prints all the lines of file myFile1 and myFile2 containing the strings ac, abc, abbc,
abbbc, etc.
grep had a considerable influence, and most programming languages, including
Perl, Python, Java, and C#, have a regex library. All the regex variants—or flavors—
adhere to an analog syntax, with some differences, however, that hinder a universal
compatibility.
In the following sections, we will use the syntax defined by Perl. Because of its
built-in support for regexes and its simplicity, Perl was immediately recognized as
a real innovation in the world of scripting languages and was adopted by millions
of programmers. It is probably Perl that made regular expressions a mainstream
programming technique and, in return, it explains why the Perl regex syntax became
a sort of de facto standard that inspires most modern regex flavors, including that
of Python. The set of regular expressions that follows Perl is also called Perl
compatible regular expressions (PCRE).
The dot . is also a metacharacter that matches one occurrence of any character of
the alphabet except a new line. For example, a.e matches the strings ale and ace in
the sentence:
The aerial acceleration alerted the ace pilot
as well as age, ape, are, ate, awe, axe, or aae, aAe, abe, aBe, a1e, etc. We can
combine the dot and the star in the expression .* to match any string of characters
until we encounter a new line.
If the pattern to search contains a character that is also a metacharacter, for instance,
“?”, we need to indicate it to the regex engine using a backslash \ before it. We saw
that abc? matches ab and abc. The expression abc\? matches the string .abc?. In the
same vein, abc\. matches the string .abc., and a\*bc matches a*bc.
We call the backslash an escape character. It transforms a metacharacter into a
literal symbol. We can also say that we “quote” a metacharacter with a backslash.
In Python, we must use a backslash escape with the 14 following characters:
. ^ $ * + ? { } [ ] \ | ( )
In some cases, the greedy strategy is not appropriate. To display the sentence
They match as early and as many characters as they can.
in a web page with two phrases set in bold, we need specific tags that we will insert
in the source file. Using HTML, the language of the web, the sentence will probably
be annotated as
They match <b>as early</b> and <b>as many</b> characters as
they can.
where <b> and </b> mark respectively the beginning and the end of a phrase set in
bold. (We will see annotation frameworks in more detail in Chap. 4, Encoding and
Annotation Schemes.)
A regular expression to search and extract phrases in bold could be:
<b>.*</b>
Unfortunately, applying this regex to the sentence will match one single string:
<b>as early</b> and <b>as many</b>
which is not what we wanted. In fact, this is not a surprise. As we saw, the regex
engine matches as early as it can, i.e., from the first <b> and as many characters as it
can up to the second </b>.
A possible solution is to modify the behavior of repetition metacharacters and
make them “lazy.” They will then consume as few characters as possible. We create
the lazy variant of a repetition metacharacter by appending a question mark to it
(Table 3.9). The regex
<b>.*?</b>
We saw that the dot, ., represented any character of the alphabet. It is possible
to define smaller subsets or classes. A list of characters between square brackets
[...] matches any character contained in the list. The expression [abc] means one
occurrence of either a, b, or c; [ABCDEFGHIJKLMNOPQRSTUVWXYZ] means one uppercase
unaccented letter; and [0123456789] means one digit. We can concatenate character
classes, literal characters, and metacharacters, as in the expressions [0123456789]+
and [0123456789]+\.[0123456789]+, that match, respectively, integers and decimal
numbers.
Character classes are useful to search patterns with spelling differences, such as
[Cc]omputer [Ss]cience, which matches four different strings:
Computer Science
Computer science
computer Science
computer science
We can define the complement of a character class, that is, the characters of the set
that are not member of the class, using the caret symbol, ^, as the first symbol inside
the square brackets. For example:
• the expression [^a] means any character that is not an a;
• [^0123456789] means any character that is not a digit;
• [^ABCD]+ means any string that does not contain A, B, C, or D.
Such classes are also called negated character classes.
Range of Characters
Inside square brackets, we can also specify ranges using the hyphen character: -.
For example:
• The expression [1-4] means any of the digits 1, 2, 3, or 4, and a[1-4]b matches
a1b, a2b, a3c, and a4b.
• The expression [a-zàâäæçéèêëîïôöœßùûüÿ] matches any lowercase accented or
unaccented letter of French and German.
Metacharacters
with a backslash like this: \-. The expression [1\-4] means any of the characters 1,
-, or 4.
In addition to the hyphen, the other metacharacters used in character classes are:
the closing square bracket, ], the backslash, \, the caret, ^, and the dollar sign, $. As
for carets, they need to be quoted to be treated as normal characters in a character
class. However, when they are in an unambiguous position, Python will interpret
them correctly even without the escape sign. For instance, if the caret is not the first
character after the opening bracket, Python will recognize it as a normal character.
The expression [a^b] matches either a, ˆ, or b.
Most regex flavors, Perl, Java, POSIX, have predefined classes. Table 3.10 lists some
you can encounter in Python programs. Some classes are adopted by all the regex
variants, while some others are more specific. Their definition may vary also. For
instance, \w+ will match the accented letters in Python, but not in Perl or Java. In
case of doubt, refer to the appropriate documentation.
Python’s regex module also defines classes as properties using the \p{class}
construct that matches the symbols in class and \P{class} that matches symbols
not in class. To name the properties or classes, Python uses categories defined
by the Unicode standard that we will review in Chap. 4, Encoding and Annotation
Schemes. As a rule, you should always prefer the Unicode classes. They will yield
the same results across the programming languages and they will enable you to
handle nonLatin scripts more easily.
Table 3.10 Some predefined character classes with their definition in Perl, after Wall et al. (2000).
These classes are also available in Python’s regex module. Note however that the content of the
\w class is different in Python as it includes accented characters. Always prefer the Unicode classes
with the \p{...} notation that we will see in Chap. 4, Encoding and Annotation Schemes
Expression Description Equivalent
\w Any word character: letter, digit, or underscore [a-zA-Z0-9_]
\W Any nonword character [^\w]
\s Any whitespace character: space, tabulation, new line, carriage [ \t\n\r\f]
return, or form feed
\S Any nonwhitespace character [^\s]
\d Any digit [0-9]
\D Any nondigit [^0-9]
\p{L} Any Unicode letter. It includes accented letters
\P{L} Any Unicode nonletter
\p{Ll} Any Unicode lowercase letter. It includes accented letters
\p{Lu} Any Unicode uppercase letter. It includes accented letters
\p{N} Any Unicode number
\p{P} Any Unicode punctuation sign
3.3 Regular Expressions 65
Regex languages use three main operators. Two of them are already familiar
to us. The first one is the Kleene star or closure, denoted *. The second one
is the concatenation, which is usually not represented. It is implicit in strings
like abc, which is the concatenation of characters a, b, and c. To concatenate
the word computer, a space symbol, and science, we just write them in a row:
computer science.
The third operation is the union and is denoted “|”. The expression a|b
means either a or b. We saw that the regular expression [Cc]omputer [Ss]cience
could match four strings. We can rewrite an equivalent expression using the
union operator: Computer Science|Computer science|computer Science|computer
science. A union is also called an alternation because the corresponding expression
can match any of the alternatives, here four.
We saw that regular expressions were devices to define and search patterns in texts.
If we want to use them for more elaborate text processing such as translating
characters, substituting words, or counting them, we need to incorporate them in
a full-fledged programming language. We describe now, as well as the next chapter,
the regular expression implementation in Python.
The two main regex operations are match and substitute. They are often abridged
using the Perl regex notations where:
• The m/pattern/ construct denotes a match operation with the regular expression
pattern.
• The s/pattern/replacement/ construct is a substitution operation. This statement
matches the first occurrence of pattern and replaces it by the replacement string.
• We can add a sequence of modifiers to the m// and s/// constructs just after the
last /. For instance, if we want to replace all the occurrences of a pattern, we use
the g modifier, where g stands for globally: s/pattern/replacement/g.
Python has two regex engines provided by the re and regex modules. The first
one is the standard engine, while the second has extended Unicode capabilities.
Outside Unicode, they have similar properties and are roughly interchangeable. As
Unicode is ubiquitous in natural language processing, we will use regex with the
statement:
import regex as re
3.4.1 Matching
The m/pattern/ Operator
The matching operation, m/pattern/, is carried out using the re.search() function
with two arguments: The pattern to search, pattern, and a string. It returns the first
matched object in the string or None, if there is no match.
The next program applies m/ac*e/ to the string The aerial acceleration alerted
the ace pilot as in Fig. 3.10:
import regex as re
and finds a match object spanning from index 4 to 6 with the value ae.
68 3 Corpus Processing Tools
In Sect. 3.3, we used the grep command to read files, search an expression, and print
the lines where we found it. We will now write a minigrep program to replicate this
with a pattern, for instance ac*e, given as an argument. From a Unix terminal, we
will run the Python program with the command:
python 03_minigrep.py ’ac*e’ <file_name
We use a loop to read the lines and we implement m/pattern/ with the re.search()
function:
import regex as re
import sys
pattern = sys.argv[1]
The program first extracts the pattern to match from the command line argument
sys.argv[1]. Then the loop reads from the standard input, sys.stdin, and assigns
the current line from the input to the line variable. The for statement reads all the
lines until it encounters an end of file. re.search() searches the pattern in line and
returns the first matched object, or None if there is no match. The if statement tells
the program to print the input when it contains the pattern.
3.4 Programming with Regular Expressions 69
The re.search() function supports a set of flags as third argument that modifies
the match operation. These flags are equivalent to Perl’s m/pattern/modifiers.
re.findall() and re.finditer() that find all the occurrences of a pattern have
the same flags as re.search(). We saw that they are equivalent to the m/pattern/g
(globally) modifier in the PCRE framework.
Useful modifiers are:
• Case insensitive: i. The instruction m/pattern/i searches pattern in the target
string regardless of its case. In Python, this corresponds to the flag: re.I.
• Multiple lines: m (re.M in Python). By default, the anchors ^ and $ match the start
and the end of the input string. The instruction m/pattern/m considers the input
string as multiple lines separated by new line characters, where the anchors ^ and
$ match the start and the end of any line in the string.
• Single line: s (re.S in Python). Normally, a dot symbol “.” does not match new
line characters. The s modifier makes a dot in the instruction m/pattern/s match
any character, including new lines.
Modifiers can be grouped in any order as in m/pattern/im, for instance, or
m/pattern/sm, where a dot in pattern matches any character and the anchors ^ and
$ match just after and before new line characters.
In Python, the modifiers (called flags) are specified as a sequence separated by
vertical bars: |.
The next program applies the patterns m/^S/g and m/^s/g to a text: It prints the
letters S or s if they start a string.
iliad_opening = """Sing, O goddess, the anger of Achilles
son of Peleus, that brought countless ills upon the Achaeans.
""".strip()
Only S starts the string. The second findall() returns an empty list. We now add
the case-insensitive modifier:
re.findall(’^s’, iliad_opening, re.I) # m/^s/ig
# [’S’]
We match S again, but not s as it starts a line, but not the string. To get both, we add
the multiline modifier:
re.findall(’^s’, iliad_opening, re.I | re.M) # m/^s/img
# [’S’, ’s’]
3.4.4 Substitutions
Python uses the re.sub() function to substitute patterns. It has three arguments:
pattern, replacement, and string, where the substitution occurs. It returns a new
string, where by default, it substitutes all the pattern matches in string with
replacement. Additionally, a fourth parameter, count, gives the maximal number
of substitutions and a fifth, flags, the match modifiers.
We shall write a program to replace all the occurrences of es+ with EZ in
iliad_opening. This corresponds to the s/es+/EZ/g operation:
If we just want to replace the first occurrence, we use this statement instead:
# Replaces the first occurrence
line = re.sub(’es+, ’EZ’, iliad_opening, 1) # s/es+/EZ/
3.4.5 Backreferences
However, we do not know in advance how many c letters we will match with the c+
pattern. To tell Python to remember it, we put parentheses around this pattern. This
is called a capturing group: It creates a buffer to hold the pattern and we refer back
to it by the sequence \1. We access its value in the matched object with the group(1)
method:
match = re.search(’a(c+)e’, line)
match.group(1) # ’cc’
to \1, the second pair to \2, the third to \3, etc. Once the pattern is applied, the
\<digit> reference is returned by group(<digit>):
match_object.group(1)
match_object.group(2)
match_object.group(3)
etc.
Let us change our initial pattern from ac+e to .c+e. To remember the value of
what is matched by the dot, we parenthesize it as well as c+ and we have:
match = re.search(’(.)(c+)e’, line)
match.group(1) # ’a’
match.group(2) # ’cc’
We can even use the backreferences in the pattern. For instance, let us imagine
that we want to find a sequence of three identical characters, which corresponds to
matching a character and checking if the next two characters are identical to the first
one.
The instruction m/(.)\1\1/ matches such sequences. As in the previous section,
we call the group(1) method of the returned match object to obtain the value in the
buffer. Let us apply it to accceleration string (note the three c):
match = re.search(r’(.)\1\1’, ’accceleration’)
match.group(1) # ’c’
In addition to the regular expression escape sequences, we can use the Python escape
sequences defined in Table 2.1 to match nonprintable or numerically-encoded
characters.
For instance, the regular expression using the Unicode number class
m/\p{N}+\t\p{N}+/
matches two integers separated by a tabulation. The equivalent Python code applied
to the Frequencies: 100 200 300 string yields:
re.search(r’\p{N}+\t\p{N}+’, ’Frequencies: 100 200 300’)
# <regex.Match object; span=(13, 20), match=’100\t200’>
The regex below matches amounts of money starting with the dollar sign with
parentheses around the integer and decimal parts:
m/\$ *([0-9]+)\.?([0-9]*)/
and in Python:
price = "We’ll buy it for $72.40"
re.sub(r’\$ *([0-9]+)\.?([0-9]*)’,
r’\1 dollars and \2 cents’, price)
# We’ll buy it for 72 dollars and 40 cents
3.4 Programming with Regular Expressions 73
We saw in Sect. 3.4.1 that the search() operations result in match objects. We used
their group() method to return the matched groups, where:
• match_object.group() or match_object.group(0) return the entire match;
• match_object.group(n) returns the nth parenthesized subgroup.
In addition, the match_object.groups() returns a tuple with all the groups and the
match_object.string instance variable contains the input string.
price = "We’ll buy it for $72.40"
match = re.search(r’\$ *([0-9]+)\.?([0-9]*)’, price)
match.string # We’ll buy it for $72.40
match.groups() # (’72’, ’40’)
where [group] is the group number and where 0 or no argument means the whole
matched substring. We can use them with slices to extract the strings before and
after a matched pattern as in this program:
odyssey_opening = """Tell me, O muse, of that ingenious hero
who travelled far and wide after he had sacked
the famous town of Troy.""".strip()
It is frequently the case that we need to create a regular expression from parameters
like in the minigrep program in Sect. 3.4.2. This corresponds to a function that
returns a regular expression. For minigrep, the function was straightforward as it
used the pattern as is. Let us take a more complex example with a pattern that
matches a string with a certain number of characters to the left and to the right:
’.{0,width}string.{0,width}’
where string and width are the function parameters, for instance:
string = ’my string’
width = 20
74 3 Corpus Processing Tools
To solve this, we can use str.format() (see Sect. 2.4.5). In our example, the
function
def make_regex(string, width):
return (’.{{0,{width}}}{string}.{{0,{width}}}’
.format(string=string, width=width))
that matches the string my string with 0 to 20 characters to the left and to the right.
Note that we escaped the literal curly braces by doubling them.
If string contains metacharacters, for instance a dot as in my string., we need to
escape them as in: my string\.. Otherwise, the dot would match any character. We
can do this automatically with re.escape(string) as in:
string = ’my string.’
re.escape(string)
# ’my\\ string\\.’
Applying this regex to the Odyssey with the string Penelope yields:
pattern = make_regex(’Penelope’, 15)
re.search(pattern, odyssey, re.S).group()
# ’ of his\nmother Penelope, who persist i’
These arguments are passed to Python by the operating system in the form of a list.
Now let us write a concordance program inspired by Cooper (1999). We use
three arguments in the command line: the file name, the pattern to search, and the
span size. Python reads them and stores them in a list with the reserved name:
sys.argv[1:]. We assign these arguments, respectively, to file_name, pattern, and
width.
We open the file using the open() function, read all the text and we assign it to the
text variable. If open() fails, the program exits using except and prints a message
to inform us that it could not open the file.
In addition to single words, we may want to search concordances of a phrase
such as the Achaeans. Depending on the text formatting, the phrase’s words can be
on the same line or spread on two lines of text as in:
I see that the Achaeans are subject to you in great
multitudes.
...
the banks of the river Sangarius; I was their ally,
and with them when the Amazons, peers of men, came up
against them, but even they were not so many as the
Achaeans."
The Python string ’the Achaeans’ matches the first occurrence of the phrase in the
text, but not the second one as the two words are separated by a line break.
There are two ways to cope with that:
1. We can modify pattern, the phrase to search, so that it matches across sequences
of line breaks, tabulations, or spaces. To do this, we replace the sequences of
spaces in pattern with the generic white space character class: s/ +/\\s+/g.
2. The second possibility is to normalize the text, text, so that the line breaks and
all kinds white spaces in the text are replaced with a standard space:
s/\s+/ /g.
Both solutions can deal with the multiple conventions to mark line breaks, the two
most common ones being \n and \r\n adopted, respectively, by Unix and Windows.
Moreover, the text normalization makes it easier to format the concordance output
and print the results. In our program, we will keep both instructions, although they
are somewhat redundant.
Finally, we write a regular expression to search the pattern. To find all the
concordances in text, we use a for loop and the re.finditer() method that returns
all the match objects in the form of an iterator. We use the start and end indices of
the match object to extract and print the left and right contexts.
import re
import sys
76 3 Corpus Processing Tools
The finditer() function, just like findall(), scans the text from left to right and
finds all the nonoverlapping matches. If we match a string and its left and right
contexts with
’.{0,20}my string.{0,20}’
the search is started again from the end index of the match (see Sect. 3.4.10).
This nonoverlapping match means that when the interval between two occur-
rences of my string is smaller than the width of the right context, width, it will
skip the second one. In the excerpt below the two occurrences of Hector are six
characters apart:
Meanwhile great Ajax kept on trying to drive a spear into Hector, but Hector was so skilful
that he held his broad shoulders well under cover of his ox-hide shield, ever on the look-out
for the whizzing of the arrows and the heavy thud of the spears.
Searching Hector with the regex above and a width of 20 characters will miss the
second occurrence. The code below:
3.7 Approximate String Matching 77
So far, we have used regular expressions to match exact patterns. However, in many
applications, such as in spell checkers, we need to extend the match span to search
a set of related patterns or strings. In this section, we review techniques to carry out
approximate or inexact string matching.
Table 3.12 Typographical errors (typos) and corrections. Strings differ by one operation. The
correction is the source and the typo is the target. Unless specified, other operations are just copies.
After Kernighan et al. (1990)
Typo (Target) Correction (Source) Source Target Position Operation
acress actress – t 2 Deletion
acress cress a – 0 Insertion
acress caress ac ca 0 Transposition
acress access r c 2 Substitution
acress across e o 3 Substitution
acress acres s – 4 Insertion
acress acres s – 5 Insertion
carried out from left to right using two pointers that mark the position of the next
character to edit in both strings:
• The copy operation is the simplest. It copies the current character of the source
string to the target string. Evidently, the repetition of copy operations produces
equal source and target strings.
• Substitution replaces one character from the source string by a new character
in the target string. The pointers are incremented by one in both the source and
target strings.
• Insertion inserts a new character in the target string. The pointer in the target
string is incremented by one, but the pointer in the source string is not.
• Deletion deletes the current character in the target string, i.e., the current
character is not copied in the target string. The pointer in the source string is
incremented by one, but the pointer in the target string is not.
• Reversal (or transposition) copies two adjacent characters of the source string
and transposes them in the target string. The pointers are incremented by two
characters.
Kernighan et al. (1990) illustrate these operations with the misspelled word acress
and its possible corrections (Table 3.12). They named the transformations from the
point of view of the correction, not from the typo.
Spell checkers identify the misspelled or unknown words in text. They are ubiqui-
tous tools that we now find in almost all word processors, messaging applications,
editors, etc. Spell checkers start from a pre-defined vocabulary (or dictionary) of
correct words, scan the words from left to right, look them up in their dictionary,
and for the words outside the vocabulary, supposedly typos, suggest corrections.
Given a typo, spell checkers find words from their vocabulary that are close in
terms of edit distance. To carry this out, they typically apply edit operations to the
3.7 Approximate String Matching 79
typo to generate a set of new strings called “edits.” They then look up these edits in
the dictionary, discard the unknown ones, and propose the rest to the user as possible
corrections.
If we allow only one edit operation on a source string of length n, and if we
consider an alphabet of 26 unaccented letters, the deletion will generate n new
strings; the insertion, .(n + 1) × 26 strings; the substitution, .n × 25; and the
transposition, .n − 1 new strings. In the next sections, we examine how to generate
these candidates in Python.
Generating correction candidates is easy in Python. We propose here an imple-
mentation by Norvig (2007)1 that uses list comprehensions: One comprehension
per edit operation: delete, transpose, replace, and insert. The edits1() function
first splits the input, the unknown word, at all possible points and then applies the
operations to the list of splits. See Sect. 2.10 for a description. Finally, the set()
function returns a list of unique candidates.
alphabet = ’abcdefghijklmnopqrstuvwxyz’
def edits1(word):
splits = [(word[:i], word[i:])
for i in range(len(word) + 1)]
deletes = [a + b[1:] for a, b in splits if b]
transposes = [a + b[1] + b[0] + b[2:]
for a, b in splits if len(b) > 1]
replaces = [a + c + b[1:]
for a, b in splits for c in alphabet if b]
inserts = [a + c + b for a, b in splits for c in alphabet]
return set(deletes + transposes + replaces + inserts)
Applying edits1() to acress returns a list of 336 unique candidates that includes
the dictionary words in Table 3.12 and more than 330 other strings such as: aeress,
hacress, acrecs, acrehss, acwress, acrses, etc.
Now how do we extract acceptable words from the full set of edits? In his
program, Norvig (2007) uses a corpus of 1 million words to build a vocabulary
(see Sect. 9.4.2 for a program to carry this out); the candidates are looked up in
this dictionary to find the possible corrections. When there is more than one valid
candidate, Norvig (2007) ranks them by the frequencies he observed in the corpus.
For acress, the edit operations yield five possible corrections listed below, where
the figure is the word frequency in the corpus:
{’caress’:4, ’across’:223, ’access’:57, ’acres’:37, ’actress’:8}
The spell checker proposes across as correction as it is the most frequent word.
We also note that cress is not in the list as this word was not in the corpus.
If edits1() does not generate any known word, we can reapply it to the list
of edits. Studies showed that most typos can be corrected with less than two edit
operations (Norvig, 2007).
edit_distance(i, j − 1) + ins_cost
The boundary conditions for the first row and the first column correspond to a
sequence of deletions and of insertions. They are defined as .edit_distance(i, 0) = i
and .edit_distance(0, j ) = j .
We compute the cell values as a walk through the table from the beginning of
the strings at the bottom left corner, and we proceed upward and rightward to fill
adjacent cells from those where the value is already known. Arrows in Fig. 3.11
represent the three edit operations, and Table 3.13 shows the distances to transform
language into lineage. The value of the minimum edit distance is 5 and is shown at
the upper right corner of the table.
The minimum edit distance algorithm is part of the dynamic programming
techniques. Their principles are relatively simple. They use a table to represent data,
and they solve a problem at a certain point by combining solutions to subproblems.
Dynamic programming is a generic term that covers a set of widely used methods
in optimization.
replace
insert
i −1, j −1 i, j − 1
3.7 Approximate String Matching 81
To implement the minimum edit distance in Python, we compute the length of the
source and target with len(), we create the table as a list of lists, we initialize the
first row and the first column, and we fill the table with the edit distance equation:
[source, target] = (’language’, ’lineage’)
length_s = len(source) + 1
length_t = len(target) + 1
for i in range(length_s):
table[i] = [None] * length_t
table[i][0] = i
for j in range(length_t):
table[0][j] = j
Fig. 3.12 Alignments of lineage and language. The figure contains two possible representations
of them. In the upper row, the deletions in the source string are in italics, as are the insertions in the
target string. The lower row shows a synchronized alignment, where deletions in the source string
as well as the insertions in the target string are aligned with epsilon symbols (null symbols)
Once we have filled the table, we can search the operation sequences that correspond
to the minimum edit distance. Such a sequence is also called an alignment.
Figure 3.12 shows two examples of them.
A frequently used technique is to consider each cell in Table 3.13 and to store
the coordinates of all the adjacent cells that enabled us to fill it. For instance, the
program filled the last cell of coordinates .(8, 7), containing 5 (table[8][7]), using
the content of cell .(7, 6). The storage can be a parallel table, where each cell contains
the coordinates of the immediately preceding positions (the backpointers). Starting
from the last cell down to the bottom left cell, .(0, 0), we traverse the table from
adjacent cell to adjacent cell to recover all the alignments. This program is left as an
exercise.
Corpora are now easy to obtain. Organizations such as the Linguistic Data Consor-
tium, ELRA, or the massive Common Crawl collect and distribute texts in many
languages. Although not widely cited, Busa (1974, 1996) is the author of the first
large computerized corpus, the Index Thomisticus, a complete edition of the works
of Saint Thomas Aquinas. The corpus, which is entirely lemmatized, is available
online.2 FranText is also a notable early corpus of more than 100 million words. It
helped write the Trésor de la langue française (Imbs and Quemada, 1971–1994), a
comprehensive French dictionary. Other early corpora include the Bank of English,
which contributed to the Collins COBUILD Dictionary (Sinclair, 1987).
Concordancing plays a role today that goes well beyond lexicography. Google,
Bing, and other web search engines can be considered as modern avatars of
2 https://www.corpusthomisticum.org/.
3.8 Further Reading 83
3 https://www.regular-expressions.info/.
4 https://regex101.com/.
5 https://www.pcre.org/.
6 https://www.let.rug.nl/~vannoord/Fsa/.
7 https://www.openfst.org/.
Chapter 4
Encoding and Annotation Schemes
At the most basic level, computers only understand binary digits and numbers.
Corpora as well as any computerized texts have to be converted into a digital format
to be read by machines. From their American early history, computers inherited
encoding formats designed for the English language. The most famous one is the
American Standard Code for Information Interchange (ASCII). Although well
established for English, the adaptation of ASCII to other languages led to clunky
evolutions and many variants. It ended (temporarily?) with Unicode, a universal
scheme compatible with ASCII and intended to cover all the scripts of the world.
We saw in Chap. 3, Corpus Processing Tools that some corpora include linguistic
information to complement raw texts. This information is conveyed through anno-
tations that describe quantities of structures. They range from the binary annotation
of sentences or words to text organization, such as titles, paragraphs, and sentences,
to semantic information including grammatical data or syntactic structures, etc. In
contrast to character encoding, no annotation scheme has yet reached a level where
it can claim to be a standard.
In this chapter, we will examine two ways to annotate data: tables and graphs.
Tables consist of rows and columns, where a row stores an observation, such as a
sentence or a word, as well as its properties in different columns. The second one,
more complex, embeds the annotation in the text in the form of brackets, also called
markup, eventually defining a graph structure. To create these graphs, we will use
the Extensible Markup Language (XML), a language to define annotations, with
a shared markup syntax. XML in itself is not an annotation language. It is a scheme
that enables users to define annotations within a specific framework.
In this chapter, we will introduce the most useful character encoding schemes,
see how we can load and write tabular datasets, and review the basics of XML. We
will examine related topics of standardized presentation of time and date, and how
to sort words in different languages. We will finally create a small program to collect
documents from the Internet and parse their XML structure.
Table 4.1 The ASCII character set arranged in a table consisting of 6 rows and 16 columns. We
obtain the ASCII code of a character by adding the first number of its row and the number of the
column. For instance, A has the decimal code 64 + 1 65, and e has the code 96 + 5 = 101
0 1 2 4 3 5 6 7 8 9 10 11 12 13 14 15
32 ! " # $ % & ’ ( ) * + , - . /
48 0 1 2 3 4 5 6 7 8 9 : ; < > ?
64 @ A B C D E F G H I J K L M N O
80 P Q R S T U V W X Y Z [ .\ ] ˆ _
96 ` a b c d e f g h i j k l m n o
112 p q r s t u v w x y z { | } ˜
4.1 Character Sets 87
Table 4.3 The ISO Latin 1 character set (ISO-8859-1) covering most characters from Western
European languages
Table 4.4 The ISO Latin 9 character set (ISO-8859-15) that replaces rare symbols from Latin 1
with the characters œ, Œ, š, Š, ž, Ž, Ÿ, and A
C. The table only shows rows that differ from Latin 1
extended with the eighth unoccupied bit to the values in the range [128..255] (.28
= 256). Unfortunately, these extensions were not standardized and depended on
the operating system. The same character, for instance, ê, could have a different
encoding in the Windows, Macintosh, and Unix operating systems.
The ISO Latin 1 character set (ISO-8859-1) is a standard that tried to reconcile
Western European character encodings (Table 4.3). Unfortunately, Latin 1 was ill-
designed and forgot characters such as the French Œ, œ, the German quote „ or
the Dutch ij, IJ. Operating systems such as Windows and macOS used a variation
of it that they had to complement with the missing characters. Later, ISO Latin 9
(ISO-8859-15) updated Latin 1 (Table 4.4). It restored forgotten French and Finnish
characters and added the euro currency sign, A
C.
4.1.2 Unicode
While ASCII has been very popular, its 128 positions could not support the
characters of most languages in the world. Therefore a group of companies formed a
consortium to create a new, universal coding scheme: Unicode. Unicode has quickly
88 4 Encoding and Annotation Schemes
replaced older encoding schemes, and Windows, macOS, and Python have now
adopted it while sometimes ensuring backward compatibility.
The initial goal of Unicode was to define a superset of all other character sets,
ASCII, Latin 1, and others, to represent all the languages of the world. The Unicode
consortium has produced character tables of most alphabets and scripts of European,
Asian, African, and Near Eastern languages, and assigned numeric values to the
characters. Unicode started with a 16-bit code that could represent up to 65,000
characters. The code was subsequently extended to 32 bits with values ranging from
0 to 10FFFF in hexadecimal. The Unicode code space occupies then 24 bits out of
32 corresponding to a capacity of 1,114,112 valid characters.
The standardized set of Unicode characters is called the universal character set
(UCS). It is divided into several planes, where the basic multilingual plane (BMP)
contains all the common characters, with the exception of some Chinese ideograms.
Characters in the BMP fit on a 2-octet code (UCS-2). The 4-octet code (UCS-4)
can represent, as we saw, more than a million characters. It covers all the UCS-2
characters and rare characters: historic scripts, some mathematical symbols, private
characters, etc.
Unicode identifies each character or symbol by a code point and a name. The
code point consists of a U+ prefix and a single hexadecimal number, starting at U+0000
as with:
U+0041 LATIN CAPITAL LETTER A
U+0042 LATIN CAPITAL LETTER B
U+0043 LATIN CAPITAL LETTER C
...
U+0391 GREEK CAPITAL LETTER ALPHA
U+0392 GREEK CAPITAL LETTER BETA
U+0393 GREEK CAPITAL LETTER GAMMA
We obtain the code point of a character and the character corresponding to a code
point with the ord() and chr() functions, respectively:
ord(’C’), ord(’Γ’) # (67, 915)
hex(67), hex(915) # (’0x43’, ’0x393’)
chr(67), chr(915) # (’C’, ’Γ’)
Table 4.5 Unicode subrange allocation of the universal character set (simplified)
Code Name Code Name
0000 Basic Latin 1400 Unified Canadian Aboriginal Syllabics
0080 Latin-1 Supplement 1680 Ogham
0100 Latin extended-A 16A0 Runic
0180 Latin extended-B 1780 Khmer
0250 IPA extensions 1800 Mongolian
02B0 Spacing modifier letters 1E00 Latin extended additional
0300 Combining diacritical marks 1F00 Greek extended
0370 Greek and Coptic 2000 General punctuation
0400 Cyrillic 2800 Braille patterns
0500 Cyrillic supplement 2E80 CJK radicals supplement
0530 Armenian 2F00 Kangxi Radicals
0590 Hebrew 3000 CJK Symbols and Punctuation
0600 Arabic 3040 Hiragana
0700 Syriac 30A0 Katakana
0750 Arabic supplement 3100 Bopomofo
0780 Thaana 3130 Hangul compatibility Jamo
07C0 NKo 3190 Kanbun
0800 Samaritan 31A0 Bopomofo extended
0900 Devanagari 3200 Enclosed CJK letters and months
0980 Bengali 3300 CJK Compatibility
0A00 Gurmukhi 3400 CJK unified ideographs extension A
0A80 Gujarati 4E00 CJK unified ideographs
0B00 Oriya A000 Yi syllables
0B80 Tamil A490 Yi radicals
0C00 Telugu AC00 Hangul syllables
0C80 Kannada D800 High surrogates
0D00 Malayalam E000 Private use area
0D80 Sinhala F900 CJK compatibility ideographs
0E00 Thai 10000 Linear B syllabary
0E80 Lao 10140 Ancient Greek numbers
0F00 Tibetan 10190 Ancient symbols
1000 Myanmar 10300 Old italic
10A0 Georgian 10900 Phoenician
1100 Hangul Jamo 10920 Lydian
1200 Ethiopic 12000 Cuneiform
13A0 Cherokee 100000 Supplementary private use area-B
90 4 Encoding and Annotation Schemes
Unicode allows the composition of accented characters from a base character and
one or more diacritics. That is the case for the French Ê or the Scandinavian Å. Both
characters have a single code point:
U+00CA LATIN CAPITAL LETTER E WITH CIRCUMFLEX
U+00C5 LATIN CAPITAL LETTER A WITH RING ABOVE
and
U+0041 LATIN CAPITAL LETTER A
U+030A COMBINING RING ABOVE
as we have:
[hex(ord(cp)) for cp in e_1] # [’0xca’]
[hex(ord(cp)) for cp in e_2] # [’0x45’, ’0x302’]
function to carry this out. As form, we use ’NFD’ to decompose the string in a
canonical sequence:
[hex(ord(cp)) for cp in unicodedata.normalize(’NFD’, e_1)]
# [’0x45’, ’0x302’]
[hex(ord(cp)) for cp in unicodedata.normalize(’NFD’, e_2)]
# [’0x45’, ’0x302’]
4.1 Character Sets 91
This will also make two looking alike characters as the angstrom unit of the length
denoted Å with the U+212B code point and the Swedish Å letter equivalent. As the
decomposition always follows the same fixed order, we will be able to compare
strings.
unicodedata.normalize(’NFC’, e_1) == unicodedata.normalize(’NFC’, e_2)
# True
Unicode associates a list of properties to each code point. This list is defined in
the Unicode character database and includes the name of the code point (character
name), its so-called general category—whether it is a letter, digit, punctuation,
symbol, mark, or other—the name of its script, for instance Latin or Arabic, and
its code block (The Unicode Consortium 2012).
Each property has a set of possible values. Table 4.6 shows this set for the general
category, where each value consists of one or two letters. The first letter is a major
class and the second one, a subclass of it. For instance, L corresponds to a letter, Lu
to an uppercase letter; Ll, to a lowercase letter, while N corresponds to a number and
Nd, to a number, decimal digit.
In Python, we extract the character name and category with:
unicodedata.name(’Γ’) # ’GREEK CAPITAL LETTER GAMMA’
unicodedata.category(’Γ’) # ’Lu’
For a block, we build a Python regex by replacing property with the block name
in Table 4.5. Python also requires an In prefix and that white spaces are replaced
with underscores as InBasic_Latin or InLatin_Extended-A.
For example, \p{InGreek_and_Coptic} matches code points in the Greek and
Coptic block whose Unicode range is [0370..03FF]. This roughly corresponds
to the Greek characters. However, some of the code points in this block are not
assigned and some others are Coptic characters.
For a general category, we use either the long or short names in Table 4.6 as
respectively Letter or Lu. For example, \p{Currency_Symbol} matches currency
symbols and \P{L} all nonletters.
For a script, we use its name in Table 4.7. The regex will match all the code
points belonging to this script, even if they are scattered in different blocks.
For example, the regex \p{Greek} matches the Greek characters in the Greek
and Coptic, Greek Extended, and Ancient Greek Numbers blocks, respectively
[0370..03FF], [1F00..1FFF], and [10140..1018F], ignoring the unassigned code
points of these blocks and characters that may belong to another script, here
Coptic.
Practically, the three instructions below match lines consisting respectively of
ASCII characters, of characters in the Greek and Coptic block, and of Greek
characters:
import regex as re
alphabet = ’αβγδεζηθικλμνξοπρστυφχψω’
match = re.search(r’^\p{InBasic_Latin}+$’, alphabet)
match # None
match = re.search(r’^\p{InGreek_and_Coptic}+$’, alphabet)
match # matches alphabet
match = re.search(r’^\p{Greek}+$’, alphabet)
match # matches alphabet
Unicode offers three major different encoding schemes: UTF-8, UTF-16, and UTF-
32. The UTF schemes—Unicode transformation format—encode the same data by
units of 8, 16, or 32 bits and can be converted from one to another without loss.
UTF-16 was the original encoding scheme when Unicode started with 16 bits. It
uses fixed units of 16 bits—2 bytes—to encode directly most characters. The code
units correspond to the sequence of their code points using precomposed characters,
such as Ê in FÊTE
0046 00CA 0054 0045
Depending on the operating system, 16-bit codes like U+00CA can be stored
with highest byte first—00CA—or last—CA00. To identify how an operating system
orders the bytes of a file, it is a possible to insert a byte order mark (BOM), a
dummy character tag, at the start of the file. UTF-16 uses the code point U+FEFF
to tell whether the storage uses the big-endian convention, where the “big” part of
the code is stored first, (FEFF) or the little-endian one: (FFFE).
UTF-8 is a variable-length encoding. It maps the ASCII code characters U+0000
to U+007F to their byte values 00 to 7F. It then takes on the legacy of ASCII. All
the other characters in the range U+007F to U+FFFF are encoded as a sequence of
two or more bytes. Table 4.8 shows the mapping principles of the 32-bit character
code points to 8-bit units.
Let us encode FÊTE in UTF-8. The letters F, T, and E are in the range U-
00000000—U-0000007F. Their numeric code values are exactly the same in ASCII
and UTF-8. The code point of Ê is U+00CA and is in the range U-00000080 – U-
000007FF. Its binary representation is 0000 0000 1100 1010. UTF-8 uses the 11
rightmost bits of 00CA. The first five underlined bits together with the prefix 110
4.2 Locales and Word Order 95
Table 4.8 Mapping of 32-bit character code points to 8-bit units according to UTF-8. The xxx
corresponds to the rightmost bit values used in the character code points
Range Encoding
U-0000–U-007F 0xxxxxxx
U-0080–U-07FF 110xxxxx 10xxxxxx
U-0800–U-FFFF 1110xxxx 10xxxxxx 10xxxxxx
U-010000–U-10FFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
form the octet 1100 0011 that corresponds to C3 in hexadecimal. The seven next
boldface bits with the prefix 10 form the octet 1000 1010 or 8A in hexadecimal. The
letter Ê is then encoded as 1100 0011 1000 1010 or C3 8A in UTF-8. Hence, the
word FÊTE and the code points U+0046 U+00CA U+0054 U+0045 are encoded as
46 C3 8A 54 45
UTF-32 represents exactly the codes points by their code values. One question
remains: how does UTF-16 represent the code points above U+FFFF? The answer
is: it uses two surrogate positions consisting of a high surrogate in the range
U+DC00 .. U+DFFF and a low surrogate in the range U+D800 .. U+DBFF. This is
made possible because the Unicode consortium does not expect to assign characters
beyond the code point U+10FFFF. Using the two surrogates, characters between
U+10000 and U+10FFFF can be converted from UTF-32 to UTF-16, and vice versa.
Finally, the storage requirements of the Unicode encoding schemes are, of
course, different and depend on the language. A text in English will have approxi-
mately the same size in ASCII and in UTF-8. The size of the text will be doubled in
UTF-16 and four times its original size in UTF-32, because all characters take four
bytes.
A text in a Western European language will be larger in UTF-8 than in ASCII
because of the accented characters: a nonaccented character takes one octet, and
an accented one takes two. The exact size will thus depend on the proportion of
accented characters. The text size will be twice its ASCII size in UTF-16. Characters
in the surrogate space take 4 bytes, but they are very rare and should not increase the
storage requirements. UTF-8 is then more compact for most European languages.
This is not the case with other languages. A Chinese or Indic character takes, on
average, three bytes in UTF-8 and only two in UTF-16.
A basic sorting algorithm may suffice for some applications. However, most
of the time it would be unacceptable when the ordered words are presented to a
user. The result would be even more confusing with accented characters, since their
location is completely random in the extended ASCII tables.
In addition, the lexicographic ordering of words varies from language to lan-
guage. French and English dictionaries sort accented letters as nonaccented ones,
except when two strings are equal except for the accents. Swedish dictionaries treat
the letters Å, Ä, and Ö as distinct symbols of the alphabet and sort them after Z.
German dictionaries have two sorting standards. They process accented letters either
as single characters or as couples of nonaccented letters. In the latter case, Ä, Ö, Ü,
and ß are considered respectively as AE, OE, UE, and ss.
The Unicode consortium has defined a collation algorithm (Whistler and Scherer
2023) that takes into account the different practices and cultures in lexical ordering.
It can be parameterized to cover most languages and conventions. It uses three levels
of difference to compare strings. We outline their features for European languages
and Latin scripts:
. The primary level considers differences between base characters, for instance,
between A and B.
. If there are no differences at the first level, the secondary level considers the
accents on the characters.
. And finally, the third level considers the case differences between the characters.
These level features are general, but not universal. Accents are a secondary
difference in many languages, but we saw that Swedish sorts accented letters as
individual ones and hence sets a primary difference between A and Å, or o and Ö.
Depending on the language, the levels may have other features.
To deal with the first level, the Unicode collation algorithm defines classes of
letters that gather upper- and lowercase variants, accented and unaccented forms.
Hence, we have the ordered sets: {a, A, á, Á, à, À, etc.} < {b, B} < {c, C, ć, Ć, ĉ, Ĉ,
ç, Ç, etc.} < {e, E, é, É, è, È, ê, Ê, ë, Ë, etc.} < . . . .
The second level considers the accented letters if two strings are equal at the
first level. Accented letters are ranked after their nonaccented counterparts. The first
accent is the acute one (´), then come the grave accent (`), the circumflex (ˆ), and
the umlaut (¨). So, instances of letter E with accents, in lower- and uppercase have
the order: {e, E} << {é, É} << {è, È} << {ê, Ê} << {ë, Ë}, where << denotes a
difference at the second level. The comparison at the second level is done from the
left to the right of a word in English and most languages. It is carried out from the
right to the left in French, i.e., from the end of a word to its beginning.
98 4 Encoding and Annotation Schemes
Similarly, the third level considers the case of letters when there are no
differences at the first and second levels. Lowercase letters are before uppercase
ones, that is, {a} <<< {A}, where <<< denotes a difference at the third level.
Table 4.11 shows the lexical order of pêcher ‘peach tree’ and Péché ‘sin’,
together with various conjugated forms of the verbs pécher ‘to sin’ and pêcher ‘to
fish’ in French and English. The order takes the three levels into account and the
reversed direction of comparison in French for the second level. German adopts the
English sorting rules for these accents.
Some characters are expanded or contracted before the comparison. In French,
the letters Œ and Æ are considered as pairs of two distinct letters: OE and AE.
In traditional German used in telephone directories, Ä, Ö, Ü, and ß are expanded
into AE, OE, UE, and ss and are then sorted as an accent difference with the
corresponding letter pairs. In traditional Spanish, Ch is contracted into a single letter
that sorts between Cz and D.
The implementation of the collation algorithm (Whistler and Scherer 2023)
first maps the characters onto collation elements that have three numerical fields
to express the three different levels of comparison. Each character has constant
numerical fields that are defined in a collation element table. The mapping may
require a preliminary expansion, as for æ and œ into ae and oe or a contraction.
The algorithm then forms for each string the sequence of the collation elements of
its characters. It creates a sort key by rearranging the elements of the string and
concatenating the fields according to the levels: the first fields of the string, then
second fields, and third ones together. Finally, the algorithm compares two sort keys
using a binary comparison that applies to the first level, to the second level in case
of equality, and finally to the third level if levels 1 and 2 show no differences.
here telling that the machine is using the UTF-8 encoding and it will handle
characters with the US English rules. The parameters follow the POSIX standard
that is organized into categories:
. LC_COLLATE defines how a string will be sorted;
. LC_TIME how to format the time;
. LC_NUMERIC how to format the numbers;
. LC_MONETARY how to format monetary values, etc.
Each category can have a specific language. Normally, we use the same language
for all. We set a new locale with setlocale(). If we want to change all the parameter
values, we use LC_ALL as in
locale.setlocale(locale.LC_ALL, ’fr_FR.UTF-8’) # ’fr_FR.UTF-8’
locale.getlocale(locale.LC_COLLATE) # ’fr_FR.UTF-8’
Note that Python just extracts these paramaters from the operating system and by
default these settings follow its language variable stored in LANG.
Using these parameters, Python can format the time, the currency, etc. in a
specific language. Here we will focus on sorting strings.
By default, the Python sorted() function follows the code point order in the
Unicode table:
accented = ’aäeéAÄEÉ’
sorted(accented) # [’A’, ’E’, ’a’, ’e’, ’Ä’, ’É’, ’ä’, ’é’]
where E is before a. To sort with a locale, we set the key to locale.strxfrm that
will transform the strings so that they can be compared with the Unicode collation
algorithm. With this key, we normally have the usual lexical order:
sorted(accented, key=locale.strxfrm)
# [’a’, ’A’, ’ä’, ’Ä’, ’e’, ’E’, ’é’, ’É’]
where a is before E.
Unfortunately, the locale module has different implementations on different
operating systems. macOS will not print the same results as Linux for instance. That
is why it is preferable to use the international components for Unicode (ICU), which
is part of the Unicode standard. ICU consists of Java and C++ classes to handle
Unicode text, including collation. It has a corresponding Python module called icu.
The statements below create an ICU collator from the French locale and use it as
a key to sort the letters. This yields the correct order on all the operating systems or
programming languages:
import icu
collator = icu.Collator.createInstance(icu.Locale(’fr_FR.UTF8’))
100 4 Encoding and Annotation Schemes
sorted(accented, key=collator.getSortKey)
# [’a’, ’A’, ’ä’, ’Ä’, ’e’, ’E’, ’é’, ’É’]
Many datasets are available in the form of tabular data consisting of rows and
columns. In such tables, a row represents a sample or observation and the columns
contain the parameters of this sample. The construction of models in classification
tasks like the analysis of sentiment, the detection of spam, or the meaning
equivalence of two sentences often relies on tabular datasets.
Tabular datasets include product or movie reviews, when each row contains the
text of a review and its label, for instance positive, neutral, or negative, electronic
messages with the text of the message and if it is a spam or nonspam, or pairs of
questions and whether they are equivalent or not. Tabular datasets may also contain
the annotation of a word with its part of speech and grammatical features as we will
see in Chap. 12, Words, Parts of Speech, and Morphology.
In this section, we will consider the Quora question pairs dataset (QQP) (Iyer
et al. 2017) as it does not require any preprocessing before we can load it. QQP
has more than 400,000 annotated samples and six columns, giving respectively the
index of the question pair, the index of the first question, the index of the second
one, the text of questions 1 and 2, and if the questions are duplicates. Table 4.12
shows examples of such pairs.
The QQP file uses tabulations to separate the values in a row. Such files
have often the TSV suffix, for tabulation-separated values. The comma is another
frequent separator and we have then comma-separated values or CSV files.
Python has a csv module that can read and write TSV and CSV files. The Pandas
module can handle the same files with its DataFrame class and has more advanced
capabilities. Let us first have a look at csv.
Table 4.12 An excerpt from the Quora question pair dataset. After Iyer et al. (2017)
Id Qid1 Qid2 Question1 Question2 Is_duplicate
447 895 896 What are natural numbers? What is a least natural 0
number?
1518 3037 3038 Which pizzas are the most How many calories does a 0
popularly ordered pizzas Dominos pizza have?
on Domino’s menu?
3272 6542 6543 How do you start a bakery? How one can start a bakery 1
business?
3362 6722 6723 Should I learn python or If I had to choose between 1
Java first Java and Python, what
should I choose to learn
first?
QQP is available from the internet. We could download it and read it from a local
file. Instead, we load it directly from the server as we did with the corpus of classics
in Sect. 2.14. We use the requests module to open the connection:
import csv
import requests
qqp_url = ’https://qim.fs.quoracdn.net/quora_duplicate_questions.tsv’
col_names = [’id’, ’qid1’, ’qid2’, ’question1’, ’question2’,
’is_duplicate’]
qqp_reader = csv.DictReader(
requests.get(qqp_url).text.splitlines(),
delimiter=’\t’)
We iterate over the sequence of rows and we create a list from them. The list items
are dictionaries, where the keys are the column names and the values what the reader
extracts from the rows:
>>> qqp_dataset = [row for row in qqp_reader]
>>> qqp_dataset[447]
{’id’: ’447’,
’qid1’: ’892’,
’qid2’: ’893’,
’question1’: ’What are natural numbers?’,
’question2’: ’What is a least natural number?’,
’is_duplicate’: ’0’}
The csv module also has a writer class, csv.DictWriter(), with the
writeheader() and writerow() methods to write the header and the rows:
with open(’qqp.tsv’, ’w’) as qqp_tsv:
writer = csv.DictWriter(qqp_tsv, fieldnames=col_names)
writer.writeheader()
for row in qqp_dataset:
writer.writerow(row)
102 4 Encoding and Annotation Schemes
4.3.2 Pandas
qqp_pandas = pd.read_csv(
StringIO(requests.get(qqp_url).text),
sep=’\t’)
We extract the rows with the integer location slice method, iloc[]:
>>> qqp_pandas.iloc[447]
id 447
qid1 892
qid2 893
question1 What are natural numbers?
question2 What is a least natural number?
is_duplicate 0
Finally, we can also save the records, the list of dictionaries making the dataset,
as a JSON object as in Sect. 2.14:
import json
In tabular datasets, we separated the annotation from the raw text by writing it
in specific columns. Sometimes, this is not very convenient as when we want
to describe the property of a word or a sequence of words inside a sentence. A
more intuitive practice is to incorporate the annotation of words in the text in the
form of sets of labels, also called markup languages. Corpus markup languages
are comparable to those of standard word processors such as Microsoft Word or
LaTeX. They consist of tags inserted in the text that request, for instance, to start a
new paragraph, or to set a phrase in italics or in bold characters. The Office Open
XML from ISO/IEC (2016) and the (La)TeX format designed by Knuth (1986) are
widely used markup languages (Table 4.13).
While TeX has been created by one person and then maintained by a group of
users (TUG), Office Open XML has been adopted as an international standard.
A common point to many of the current standards is that they originate in the
standard markup language (SGML). SGML could have failed and remained a
forgotten international initiative. But the Internet and the World Wide Web, which
use hypertext markup language (HTML), a specific implementation of SGML, have
ensured its posterity. In the next sections, we introduce the extensible markup
language (XML), which builds on the simplicity of HTML that has secured its
success, and extends it to handle any kind of data.
with other tags, for instance, to define what is a chapter and to verify that it contains
a title. Among coding schemes defined by DTDs, there are:
. the extensible hypertext markup language (XHTML), a clean XML implementa-
tion of HTML that models the Internet Web pages;
. the Text Encoding Initiative (TEI), which is used by some academic projects to
encode texts, in particular, literary works;
. DocBook, which is used by publishers and open-source projects to produce books
and technical documents.
A DTD is composed of three kinds of components called elements, attributes,
and entities. Comments of DTDs and XML documents are enclosed between the
<!-- and --> tags.
Elements
Elements are the logical units of an XML document. They are delimited by
surrounding tags. A start tag enclosed between angle brackets precedes the element
content, and an end tag terminates it. End tags are the same as start tags with a /
prefix. XML tags must be balanced, which means that an end tag must follow each
start tag. Here is a simple example of an XML document inspired by the DocBook
specification:
<!-- My first XML document -->
<book>
<title>Language Processing Cookbook</title>
<author>Pierre Cagné</author>
<!-- Image to show on the cover -->
<img></img>
<text>Here comes the text!</text>
</book>
where <book> and </book> are legal tags indicating, respectively, the start and the
end of the book, and <title> and </title> the beginning and the end of the title.
Empty elements, such as the image <img></img>, can be abridged as <img/>. Unlike
HTML, XML tags are case sensitive: <TITLE> and <title> define different elements.
We can visualize the structure of an XML document with a parse tree as in
Figure 4.1 for our first XML document.
Attributes
An element can have attributes, i.e., a set of properties attached to the element.
Let us complement our book example so that the <title> element has an alignment
whose possible values are flush left, right, or center, and a character style taken from
underlined, bold, or italics. Let us also indicate where <img> finds the image file. The
DTD, when it exists, specifies the possible attributes of these elements and the value
4.4 Markup Languages 105
book
Language Processing
Pierre Cagné Here comes the text!
Cookbook
list among which the actual attribute value will be selected. The actual attributes of
an element are supplied as name–value pairs in the element start tag.
Let us name the alignment and style attributes align and style and set them in
boldface characters and centered, and let us store the image file name of the img
element in the src attribute. The markup in the XML document will look like:
<title align="center" style="bold">
Language Processing Cookbook
</title>
<author>Pierre Cagné</author>
<img src="pierre.jpg"/>
Entities
Python has a set of modules that enables a programmer to download the content of
a web page given its address, a URL. To get the Wikipedia page on Aristotle, we
4.5 Collecting Corpora from the Web 107
url_en = ’https://en.wikipedia.org/wiki/Aristotle’
html_doc = requests.get(url_en).text
where we use the requests module, get() to open and read the page, and the text
attribute to access the HTML document.
Printing the html_doc variable shows the page with all its markup:
<!DOCTYPE html>
<html lang="en" dir="ltr" class="client-nojs">
<head>
<meta charset="UTF-8"/>
<title>Aristotle - Wikipedia, the free encyclopedia</title>
...
4.5.2 HTML
HTML is similar to XML with specific elements like <title> to markup a title,
<body> for the body of a page, <h1>, <h2>, ..., <h6>, for headings from the highest
level to the lowest one, <p> for a paragraph, etc. Although, it is no longer the case,
HTML used to be defined with a DTD.
One of the most important features of HTML is its ability to define links to
any page of the Web through hyperlinks. We create such hyperlinks using the <a>
element (anchor) and its href attribute as in:
<a href="https://en.wikisource.org/wiki/Author:Aristotle">
Wikisource</a>
where the text inside the start and end <a> tags, Wikisource, will show in blue on
the page and will be sensitive to user interaction. In the context of Wikipedia, this
text in blue is called a label or, in the Web jargon, an anchor. If the user clicks on
it, s/he will be moved to the page stored in the href attribute: https://en.wikisource.
org/wiki/Author:Aristotle. This part is called the link or the target. In our case, it
will lead to Wikisource, a library of free texts, and here to the works of Aristotle
translated in English.
Before we can apply language processing components to a page collected from the
Web, we need to parse its HTML structure and extract the data. To carry this out, we
will use Beautiful Soup, a very popular HTML parser (https://www.crummy.com/
software/BeautifulSoup/).
108 4 Encoding and Annotation Schemes
url_en = ’https://en.wikipedia.org/wiki/Aristotle’
html_doc = requests.get(url_en).text
parse_tree = bs4.BeautifulSoup(html_doc, ’html.parser’)
The parse_tree variable contains the parsed HTML document from which we
can access its elements and their attributes. We access the title and its markup
through the title attribute of parse_tree (parse_tree.title) and the content of the
title with parse_tree.title.text:
parse_tree.title
# <title>Aristotle - Wikipedia, the free encyclopedia</title>
parse_tree.title.text
# Aristotle - Wikipedia, the free encyclopedia
while the h2 headings contain its subtitles. We access the list of subtitles using the
find_all() method:
headings = parse_tree.find_all(’h2’)
[heading.text for heading in headings]
# [’Contents’, ’Life’, ’Thought’, ’Loss and preservation of
his works’, ’Legacy’, ’List of works’, ’Eponyms’, ’See also’,
’Notes and references’, ’Further reading’, ’External links’,
’Navigation menu’]
Finally, we can easily find all the links and the labels from a Web page with this
statement:
links = parse_tree.find_all(’a’, href=True)
where we collect all the anchors that have a href attribute. Then, we create the list
of labels:
[link.text for link in links]
which will return None if there is no link. Alternatively, we can find the links with a
dictionary notation:
[link[’href’] for link in links]
4.6 Further Reading 109
This will raise a KeyError if there is no link. This should not happen here as we
specified href=True when we collected the links.
The Web addresses (URL) can either be absolute i.e. containing the full address
including the host name like https://en.wikipedia.org/wiki/Aristotle, or relative to
the start page with just the file name, like /wiki/Organon. If we need to access the
latter Web page, we need to create an absolute address, https://en.wikipedia.org/
wiki/Organon, from the relative one. We can do this with the urljoin() function
and this program:
from urllib.parse import urljoin
url_en = ’https://en.wikipedia.org/wiki/Aristotle’
...
[urljoin(url_en, link[’href’]) for link in links]
# List of absolute addresses
1 https://home.unicode.org/
2 https://www.unicode.org/ucd/
3 https://icu.unicode.org/
110 4 Encoding and Annotation Schemes
HTML and XML markup standards are continuously evolving. Their specifica-
tions are available from the World Wide Web consortium4 . HTML and XML parsers
are available for most programming languages. Beautiful Soup is the most popular
one for Python and has an excellent documentation5 . Finally, a good reference on
XML is Learning XML (Ray 2003).
4 https://www.w3.org/
5 https://www.crummy.com/software/BeautifulSoup/
Chapter 5
Python for Numerical Computations
Calculemus!
Leibniz’s precept
Machine learning is now essential to natural language processing. This field uses
mathematical models, where vector and matrix operations are the basic tools to
create or simply use algorithms. One of the compelling features of Python is its
numerical module, NumPy, that facilitates considerably numerical computations.
Using it, a vector sum or a matrix product is a statement that fits on one line. PyTorch
is another library of linear algebra tools equipped with additional machine-learning
capabilities that we will use extensively in this book.
In this chapter, we will review elementary NumPy data structures, operators,
and functions, as well as those of PyTorch. We will illustrate them on a corpus of
texts where we will count the characters and apply arithmetic operations and linear
functions. In Sect. 2.6.7, we used dictionaries. Using this data structure, we could
define functions to scale the counts or to sum them across all the texts of a corpus.
We will see that we can do this much more easily with NumPy arrays or PyTorch
tensors. First, we will represent the texts by vectors and the dataset with a matrix.
Then, we will use the matrix to compute the pairwise similarity between the texts.
In addition to the Python modules, we will also refer to their underlying
mathematical model: The vector space. I hope this will help understand the broader
context of the chapter. This will not replace a serious course on linear algebra
however, and, if needed, readers are invited to complement their knowledge with
good tutorials on the topic.
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 111
P. M. Nugues, Python for Natural Language Processing, Cognitive Technologies,
https://doi.org/10.1007/978-3-031-57549-5_5
112 5 Python for Numerical Computations
5.1 Dataset
Along this chapter, we will use the corpus of classics we collected in Sect. 2.14 to
exemplify the mathematical operations. To start with, we store the texts in a list of
lists:
titles = [’iliad’, ’odyssey’, ’eclogue’, ’georgics’, ’aeneid’]
texts = []
for title in titles:
texts += [classics[title]]
Using the Text class from Sect. 2.16, we compute the character counts:
cnt_dicts = []
for text in texts:
cnt_dicts += [Text(text).count_letters()]
and we extract the counts from the dictionaries that we store in lists:
cnt_lists = []
for cnt_dict in cnt_dicts:
cnt_lists += [list(map(lambda x: cnt_dict.get(x, 0),
alphabet))]
Table 5.1 shows the counts for the ten first letters.
5.2 Vectors
Vectors, matrices, and more generally tensors, are the fundamental mathematical
objects in statistical and machine learning. We will use them to represent, store, and
process datasets. Their corresponding data structure is a NumPy array or a PyTorch
tensor. We start with the vectors that are tuples of n numbers, for instance 2, 3 or
26. We call these numbers the vector coordinates.
Table 5.1 The counts of the ten first alphabet letters in the texts of the dataset
Title a b c d e f g h i j
Iliad 51, 016 8938 11, 558 28, 331 77, 461 16, 114 12, 595 50, 192 38, 149 1624
Odyssey 37, 627 6595 8580 20, 736 59, 777 10, 449 9803 34, 785 28, 793 424
Eclogue 2716 577 722 1440 4363 846 806 2508 2250 22
Georgics 6841 1618 2016 4027 12, 110 2424 2147 6987 6035 59
Aeneid 36, 675 6867 10, 023 23, 862 55, 367 11, 618 9606 33, 055 30, 576 907
5.2 Vectors 113
Using NumPy, we create vectors of letter counts for each text of our corpus with the
array() class. The input is a list and the output is an array object as with:
import numpy as np
Differently to Python lists, all the elements of a NumPy array have the same
data type to optimize computations, for instance 64-bit integers, 32-bit floats, or
Booleans. This type is stored in the dtype attribute, here a 64-bit integer:
odyssey_cnt.dtype # dtype(’int64’)
We optionally give it as an argument when we create the object:
np.array([1, 2, 3], dtype=’float64’)
# array([1.0, 2.0, 3.0])
Otherwise NumPy will try to infer it from the numbers in the tuple:
np.array([1, 2, 3]).dtype
# dtype(’int64’)
There are many numerical datatypes in both NumPy and PyTorch and it is
impossible to describe them all here. The reader should refer to the documentation.
The shape attribute gives us the number of coordinates of the vector, also called the
dimension of the vector space, here 26, the number of letters in the alphabet:
odyssey_cnt.shape # (26,)
114 5 Python for Numerical Computations
We access vector coordinates in NumPy, read or write, with their indices and
the slice notation just like for Python lists. Indices start at 0, following most
programming languages, and contrary to the mathematical convention to start at
1.
vector = np.array([1, 2, 3, 4])
vector[1] # 2
vector[:1] # array([1])
vector[1:3] # array([2, 3])
5.2.5 Operations
3 * np.array([1, 2, 3])
# array([3, 6, 9])
Using our dataset, we compute the character counts for Homer’s works, their
difference, etc.:
iliad_cnt + odyssey_cnt # array([88643, 15533, 20138, ...])
iliad_cnt - odyssey_cnt # array([13389, 2343, 2978, ...])
iliad_cnt - 2 * odyssey_cnt # array([-24238, -4252, ...])
Although the notations used in the addition and multiplication seem very intuitive,
we must make sure that we are applying them to the proper data structure. Should
we try to use them with Python lists, this would result in concatenations:
[1, 2, 3] + [4, 5, 6]
# [1, 2, 3, 4, 5, 6]
3 * [1, 2, 3]
# [1, 2, 3, 1, 2, 3, 1, 2, 3]
5.2 Vectors 115
5.2.7 PyTorch
PyTorch’s syntax is very similar to that of NumPy. We only outline the main
differences here.
PyTorch Tensors
The equivalent of the Numpy arrays are called tensors in PyTorch. As with NumPy,
we create them from lists but this time with the tensor() class:
import torch
Size in PyTorch
Indices in PyTorch
The notation is the same as NumPy, except that scalars are also tensors.
vector = torch.tensor([1, 2, 3, 4])
vector[1] # tensor(2)
vector[:1] # tensor([1])
vector[1:3] # tensor([2, 3])
116 5 Python for Numerical Computations
NumPy/PyTorch Conversion
where the two variables share the same memory and vice-versa:
tensor = torch.tensor([1, 2, 3])
np_array = tensor.numpy()
# array([1, 2, 3])
PyTorch Device
A significant difference between PyTorch and NumPy is that we can specify the
processor that will carry out the computation using the device argument. The
default device is a central processing unit (CPU), but computations are much faster
with a graphics processing unit (GPU). There are multiple GPU makes that have
different programming interfaces. The most popular is the Compute Unified Device
Architecture (CUDA), from the Nvidia company. This is the de-facto standard in
machine learning. Metal Performance Shaders (MPS) is a competitor from Apple.
We access the tensor device with the device attribute. By default a tensor is
created on the CPU:
tensor = torch.tensor([1, 2, 3])
tensor.device
# tensor([1, 2, 3], device=’cpu’)
torch.backends.mps.is_available()
# True
We set a device with this code block, where we fall back on the CPU if a GPU is
not available:
if torch.cuda.is_available():
device = torch.device(’cuda’)
elif torch.backends.mps.is_available():
device = torch.device(’mps’)
else:
device = torch.device(’cpu’)
device # device(type=’mps’)
tensor
# tensor([1, 2, 3], device=’mps:0’)
We move a tensor to a device with the to() method. With these statements, we create
a tensor on the CPU and we move it on the GPU:
tensor = torch.tensor([1, 2, 3]) # on the CPU
tensor.to(device) # on the GPU if available
# tensor([1, 2, 3], device=’mps:0’)
What we have just done can be modelled in mathematics by a vector space. In this
section, we define the vocabulary and we specify a few concepts. A vector space is
a set of elements called vectors. A vector space has a fixed dimension, n, and the
vectors are represented as .Rn tuples for example:
• If .n = 2, .R2 vectors such as .(2, 3),
• If .n = 3, .R3 vectors such as .(1, 2, 3),
• And more generally, n-dimensional vectors in .Rn : .(1, 2, 3, 4, . . .),
In our example, we used 26-dimensional vectors to hold the character counts of the
texts arranged by alphabetical order.
An n-dimensional vector space has two operations called:
1. The addition of two vectors, denoted .+;
2. The multiplication of a vector by a scalar (a real number), denoted by a dot “.·”.
Let .u and .v be two vectors:
u = (u1 , u2 , . . . , un ),
.
v = (v1 , v2 , . . . , vn ),
NumPy has all the elementary mathematical functions we need in this book. When
the input value is an array, the NumPy functions apply to each individual element.
We say the functions are vectorized. Examples of mathematical functions include
118 5 Python for Numerical Computations
the square root and the cosine. We set a precision of three digits following the
decimal point:
np.set_printoptions(precision=3)
This is to oppose to the Python functions that only apply to individual numbers
(and not to lists of numbers). The statement:
math.sqrt(iliad_cnt) # TypeError
Python’s sum() function returns the sum of the items in a list. NumPy will return
the sum of all the elements in the array:
np.sum(odyssey_cnt) # 472937
Using the sum, we can compute the relative frequencies of the letters, first from
the count vectors, either as a multiplication of a scalar by a vector, .λv, where .λ is
the inverse of the sum and .v, the vector of counts:
iliad_dist = (1/np.sum(iliad_cnt)) * iliad_cnt
odyssey_dist = (1/np.sum(odyssey_cnt)) * odyssey_cnt
or as a division:
iliad_cnt / np.sum(iliad_cnt)
# array([0.081, 0.014, 0.018, 0.045, ...])
odyssey_cnt / np.sum(odyssey_cnt)
# array([0.08 , 0.014, 0.018, 0.044, ...])
With the exception of np.vectorize(), PyTorch has similar functions that we skip
in this section.
Let us now apply it to the two vectors of letter frequencies extracted from the
Iliad and the Odyssey:
np.dot(iliad_dist, odyssey_dist)
# 0.06581149298284382
Alternatively with @:
iliad_dist @ odyssey_dist
# 0.06581149298284382
The dot product of a vector by itself defines a metric in the Euclidian vector space.
Its square root is called the norm:
√
||u|| =
. u · u.
Finally, using the operators and functions from the previous sections, we can
compute the generalized cosine of the angle between two vectors. We will apply it
to numerical representations of documents or words. This is a property that we will
frequently use to quantify their similarity. Given two vectors .u and .v, the cosine of
their angle is defined as their dot product divided by the product of their norm:
u·v
. u, v) =
cos( .
||u||.||v||
We now apply this equation to the two vectors of letter frequencies representing
the Iliad and the Odyssey and we obtain the cosine:
(iliad_dist @ odyssey_dist) / (
np.linalg.norm(iliad_dist) *
np.linalg.norm(odyssey_dist))
# 0.9990787113863588
120 5 Python for Numerical Computations
A cosine of 1 corresponds to a zero angle. The 0.999 value means that the letter
distributions of these two works are very close.
In mathematics, the dot product is a bilinear form that maps two vectors to a scalar:
Rn × Rn → R
.
f (u, v) = z.
.
u = (u1 , u2 , . . . , un ),
.
v = (v1 , v2 , . . . , vn ),
it is defined as
. u·v = i ui vi .
When adding such a product to a vector space, we obtain a Euclidian vector space.
u v = (u1 v1 , u2 v2 , . . . , un vn ).
.
The Hadamard product is not part of the vector space, but it is a convenient
operation.
In NumPy and PyTorch, the elementwise product operator is *:
np.array([1, 2, 3]) * np.array([4, 5, 6])
# array([ 4, 10, 18])
5.5 Matrices
In statistics, data is traditionally arranged in tables, where the rows represent the
observations or the objects and the columns the values of a specific attribute. In
Table 5.1, each row is associated with a specific text or document and, given a row,
the columns show the letter counts. This is the dataset format we follow in this book.
Skipping the first column in Table 5.1 and the first row giving the names of the
texts and the letters, we have a rectangular table with a uniform numeric data type
called a matrix. When associated with a dataset, we usually denote it X:
⎡ ⎤
51016 8938 11558 28331 77461 16114 12595 50192 38149 1624
⎢37627 424 ⎥
⎢ 6595 8580 20736 59777 10449 9803 34785 28793 ⎥
⎢ ⎥
.X = ⎢ 2716 577 722 1440 4363 846 806 2508 2250 22 ⎥
⎢ ⎥
⎣ 6841 1618 2016 4027 12110 2424 2147 6987 6035 59 ⎦
36675 6867 10023 23862 55367 11618 9606 33055 30576 907
Similarly to vectors, we create a matrix from the list of lists of counts, this time
holding the whole dataset:
hv_cnts = np.array(cnt_lists)
As with vectors, the data type is inferred from the input numbers and is a 64-bit
integer:
hv_cnts.dtype # dtype(’int64’)
and the shape, this time, gives us the size of the matrix, here 5 .× 26, as we have 5
texts represented by the counts of 26 letters:
hv_cnts.shape # (5, 26)
As with vectors, we read or assign the elements in an array with their indices, for
instance:
iliad_cnt[2] # 11558
hv_cnts[1, 2] # 8580
122 5 Python for Numerical Computations
To access an element in a vector, we need one index, for a matrix, we need two
indices. More generally, we characterize multidimensional arrays, or tensors, by the
number of indices. This number is called the order of the tensor, where each index
refers to an axis of the array. In NumPy, the axes are numbered from 0. For a matrix,
axis 0 corresponds to the vertical axis along the rows, while axis 1 is the horizontal
one along the columns.
Note that there is a frequent confusion between the order, the rank, and the
dimension. The order is often called (improperly) the rank. In NumPy, the number
of indices of an array is called the number of dimensions and is stored in the ndim
attribute:
odyssey_cnt.ndim # 1
hv_cnts.ndim # 2
As with vectors, we add two matrices of identical sizes, for example matrices of size
(2, 2):
.
a11 a12 b b
A=
. ; B = 11 12
a21 a22 b21 b22
for instance:
12 56 6 8
. + = .
34 78 10 12
λa11 λa12
λA =
. ,
λa21 λa22
for instance:
12 0.5 1
0.5
. = .
34 1.5 2
and the size of a specific dimension with the dim argument name:
hv_cnts_pt.size(dim=0) # 5
hv_cnts_pt.size(dim=1) # 26
As with other Python indices, we can use negative numbers and -1 denotes the
last dimension:
hv_cnts_pt.size(dim=-1) # 26
Both NumPy and PyTorch provide functions to create matrices of size .(m, n) with
specific initial values, such as
• Filled with zeros: np.zeros((m,n)) and torch.zeros((m,n));
• Filled with ones: np.ones((m,n)) and torch.ones((m,n));
• Filled with random values from a uniform distribution: np.random.rand(m, n)
and torch.rand(m, n);
• Filled with random values from a normal distribution: np.random.randn(m, n)
and torch.randn(m, n),
We create identity squared matrices with a diagonal of ones and zeros elsewhere
with np.eye(n) and torch.eye(n);
As with vectors, we can apply NumPy and PyTorch functions such as the square
root or the cosine to all the elements of a matrix. We can also apply them selectively
to a specific dimension, for instance, to compute the sums by column to have the
counts of a letter in the whole corpus or by row to have the total number of letters in
a text. For this, we must specify the value of the axis argument in NumPy, 0 being
along the rows, and hence giving the sum of each column, and 1 along the columns,
and giving the sum of each row:
np.sum(hv_cnts) # 1705964
np.sum(hv_cnts, axis=0) # array([ 134875, 24595, 32899, ...])
np.sum(hv_cnts, axis=1) # array([629980, 472937, 36313, 96739,
469995])
The transpose of a matrix is an operation, where we swap the indices of the elements.
The transpose of .X = [xij ] is denoted .X = [yij ], where .yij = xj i . We transpose a
matrix in NumPy or PyTorch by adding the T suffix:
hv_cnts.T # array([[51016, 37627, 2716, 6841, 36675],
[ 8938, 6595, 577, 1618, 6867],
[11558, 8580, 722, 2016, 10023],
...]])
Note that the transpose of a NumPy vector, a tensor of order 1, is the same vector.
The transpose of iliad_cnt:
iliad_cnt.T # array([51016, 8938, 11558,...])
The elements of our row or column vectors have now two indices and we can
transpose them. The transpose of a row vector is then a column vector and vice-
versa.
We create a row vector by wrapping it in a list:
np.array([iliad_cnt]) # array([[51016, 8938, 11558, ...]])
np.array([iliad_cnt]).shape # (1, 26)
We can also change the shape of a vector or a matrix using reshape with the new
shape arguments as in:
iliad_cnt.reshape(1, 26)
to create a row vector or let simply NumPy guess a missing dimension with a .−1:
iliad_cnt.reshape(1, -1) # array([[51016, 8938, 11558, ...]])
that will add one dimension at the specified dim index. For instance, with a dim value
of 0, the element .xi becomes .x0,i :
torch.unsqueeze(torch.tensor([1, 2, 3]), 0)
# tensor([[1, 2, 3]])
5.5.10 Broadcasting
So far, we have seen that we can add or subtract two matrices of equal sizes and
multiply or divide a matrix by a scalar. NumPy defines additional rules to handle
the addition and elementwise multiplication of matrices of different sizes. The rules
automatically duplicate certain elements so that the matrices are of equal sizes and
then apply the operation. This is called broadcasting.
In this section, we will apply broadcasting to our dataset. While we will only use
a few rules, the rest follows the same principles and is quite intuitive. Some cases
can be tricky however, and we refer the reader to the complete documentation if
needed.
In Sect. 5.3, we used the scalar multiplication (or division) to compute the letter
distribution in a text. We will replicate this for the dataset. This will involve two
matrices: the counts of the letters in the texts stored in the hv_cnts matrix and a
column vector containing the sums of the letters of each text.
Using broadcasting, the result of the multiplication of a column vector by a
matrix is:
⎡ ⎤ ⎡ ⎤ ⎡ ⎤
λ1 x1,1 x1,2 x1,3 x1,4 . . . λ1 x1,1 λ1 x1,2 λ1 x1,3 λ1 x1,4 . . .
⎢ λ ⎥ ⎢x x x x . . . ⎥ ⎢λ x λ x λ x λ x . . . ⎥
⎢ 2 ⎥ ⎢ 2,1 2,2 2,3 2,4 ⎥ ⎢ 2 2,1 2 2,2 2 2,3 2 2,4 ⎥
⎢ ⎥ ⎢ ⎥ ⎢ ⎥
. ⎢λ3 ⎥ ⎢x3,1 x3,2 x3,3 x3,4 . . .⎥ = ⎢λ3 x3,1 λ3 x3,2 λ3 x3,3 λ3 x3,4 . . .⎥ ,
⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎣λ4 ⎦ ⎣x4,1 x4,2 x4,3 x4,4 . . .⎦ ⎣λ4 x4,1 λ4 x4,2 λ4 x4,3 λ4 x4,4 . . .⎦
λ5 x5,1 x5,2 x5,3 x5,4 . . . λ5 x5,1 λ5 x5,2 λ5 x5,3 λ5 x5,4 . . .
where NumPy expands the vector by duplicating its elements so that it matches the
matrix size. It then applies the elementwise multiplications. The arithmetic operators
can be the addition, subtraction, or division.
We computed the sum of counts in the previous section, where NumPy returned
a row vector. To apply the division to the rows, we need to convert these sums to a
column vector:
np.array([np.sum(hv_cnts, axis=1)]).T # array([[629980],
[472937],
[ 36313],
[ 96739],
[469995]])
for example:
hv_dist * hv_dist
# array([[6.558e-03, 2.013e-04, 3.366e-04, ...],
[6.330e-03, 1.945e-04, 3.291e-04, ...],
[5.594e-03, 2.525e-04, 3.953e-04, ...], ...]])
With NumPy, the matrix-vector multiplication, and more generally the multi-
plication of matrices, uses an infix operator @. For the Iliad and the Odyssey, this
yields:
hv_dist[0, :].reshape(1, -1) @ hv_dist[1, :]
# array([0.066])
or to
hv_dist[0, :] @ hv_dist[1, :]
We will now compute the cosine of all the pairs of vectors representing the works
in the hv_dist matrix, i.e. the rows of the matrix. For this, we will first compute the
dot products of all the pairs, .x · y, then the norms .||x|| and .||y||, the products of the
x·y
norms, .||x|| · ||y||, and finally the cosines, . .
||x|| · ||y||
130 5 Python for Numerical Computations
2. For the vector noms, .||x|| and .||y||, we can use np.linalg.norm(). Here we
will break down the computation with elementary operations. We will apply the
Hadamard product, .X X, to have the square of the coordinates, then sum along
the rows, and finally extract the square root:
hv_norm = np.sqrt(np.sum(hv_dist * hv_dist, axis=1))
# array([0.257, 0.257, 0.253, 0.257, 0.255])
4. We are now nearly done with the cosines. We only need to divide the matrix
x·y
elements by the norm products, . :
||x|| · ||y||
hv_cos = hv_dot / hv_norm_pairs
# array([[1. , 0.999, 0.997, 0.996, 0.995],
[0.999, 1. , 0.997, 0.995, 0.994],
[0.997, 0.997, 1. , 0.996, 0.995],
[0.996, 0.995, 0.996, 1. , 0.998],
[0.995, 0.994, 0.995, 0.998, 1. ]])
For all the pairs, we have cosines close to 1 and thus angles close to 0. This
indicates that the letter distributions are very similar. This is especially true for
Homer’s works.
5.7 Elementary Mathematical Background for Matrices 131
R→R
.
x → f (x) = ax
R→R
.
x → f (x) = ax + b
f (x + y) = f (x) + f (y)
.
f (λx) = λf (x)
We can extend linear functions to vectors, for example here with a linear combina-
tion of .R2 vector coordinates:
R2 → R2
x → f (x) = y
. x1 a x + a12 x2 y
→ 11 1 = 1
x2 a21 x1 + a22 x2 y2
x → Ax = y
132 5 Python for Numerical Computations
x → Ax + b
. x1 a a x1 b
→ 11 12 + 1
x2 a21 a22 x2 b2
a11 x1 + a12 x2 = y1
.
a21 x1 + a22 x2 = y2
For instance:
1 × x1 + 2 × x2 = y1
.
3 × x1 + 4 × x2 = y2
12 5 17
. =
34 6 39
5.7.4 Transpose
We have seen that the transpose of a matrix .A = [ai,j ] is defined as .A = [aj,i ].
This is the flipped matrix with regard to the diagonal.
5.7 Elementary Mathematical Background for Matrices 133
(AB) = B A .
.
B = np.array([[5, 6],
[7, 8]])
A.T
# array([[1, 3],
[2, 4]])
(A @ B).T
# array([[19, 43],
[22, 50]])
B.T @ A.T
# array([[19, 43],
[22, 50]])
To finish this section, let us have a look at vector rotation. From algebra courses,
we know that we can use a matrix to compute a rotation of angle .θ . For a two-
dimensional vector, the rotation matrix is:
cos θ − sin θ
Rθ =
. .
sin θ cos θ
(f ◦ g)(x) = f (g(x))
.
M(f ◦g) = Mf Mg .
.
As we have seen, we compute .M(f ◦g) by computing the product of .Mf with each
column of .Mg .
In NumPy and PyTorch, the operator of matrix products is @.
M_fg = M_f @ M_g
individual rotations:
here .Rπ/4 Rπ/6 = R5π/12 . This product is just one line with NumPy:
theta_30 = np.pi / 6
rot_mat_30 = np.array([[np.cos(theta_30), -np.sin(theta_30)],
[np.sin(theta_30), np.cos(theta_30]])
rot_mat_45 @ rot_mat_30
# array([[ 0.259, -0.966],
[ 0.966, 0.259]])
It is denoted:
g = f −1 .
.
MM −1 = M −1 M = I,
.
# array([[1.00000000e+00, 0.00000000e+00],
[5.55111512e-17, 1.00000000e+00]])
Neural networks have become very popular in natural language processing, and
more generally in machine learning. While their inspiration seems to come from
brain neurons, the concrete computer implementation uses matrices. In this section,
we outline the link between neural networks and the content of this chapter. In
Chap. 8, Neural Networks, we will explore more deeply the neural networks.
Figure 5.1 shows a simple neural neural network consisting of an input layer with
three numeric values or nodes, then a hidden layer of four nodes, a second hidden
layer of two nodes, and an output node. The information is passed through one layer
to another by weighted connections. At a given layer, each node receives a weighted
sum of the input nodes from the preceding layer. This is exactly a matrix-vector
multiplication, where we store the weights as matrix elements.
(1) (1) (1) (1)
Denoting .x = (x1 , x2 , x3 ), the input vector and .a(1) = (a1 , a2 , a3 , a4 ), the
vector of the first hidden layer, we model the computation in the first step as an
136 5 Python for Numerical Computations
(1)
a1
x1
(1) (2)
a2 a1
x2 (3)
a1 1/0
(1) (2)
a3 a2
x3
(1)
a4
affine transformation:
In neural networks, the matrix elements, .W (1) in the first layer, are called the
weights, and the intercept, here, .b(1) , the bias. Note that in Fig. 5.1, we did not
include biases. In the rest of this section, we will set them aside to simplify the
presentation.
It is probably more intuitive to have the same order, start with .x, and then chain
W (1) , .W (2) , and .W (3) , that is why neural networks transpose the matrices.
.
5.8 Application to Neural Networks 137
Representing the .x input as a column vector (a one-column matrix) and the first
network layer by matrix .W (1) , we have:
. (W (1) x) = x W (1) .
tabular dataset as we have seen in Sect. 4.3, where the observations (samples) are
arranged by rows. We denote the whole dataset X and we compute the matrix
product with the same order of operands: .XW . For the network in Fig. 5.1, we
have three matrices yielding:
We can now check that PyTorch follows this structure. We create a layer function
with the Linear class and as arguments the size of the matrix. We set the bias to zero
for sake of simplicity.
layer1 = torch.nn.Linear(3, 4, bias=False)
The weights have a random initialization and we access them with the weight
attribute:
layer1.weight
# tensor([[ 0.2472, -0.4360, 0.0955],
[-0.4775, -0.2369, 0.0147],
[ 0.2489, 0.3770, 0.2392],
[-0.1870, -0.0463, -0.2020]])
Note that these figures will differ from run to run.
We create an input vector:
x = torch.tensor([1.0, 2.0, 3.0])
and we check the output by passing x to the layer1() function, or by using a matrix
product, .W (1) x or .xW (1) . As .x is a vector, we do not transpose it:
layer1(x)
# tensor([-0.3382, -0.9072, 1.7203, -0.8855])
layer1.weight @ x
# tensor([-0.3382, -0.9072, 1.7203, -0.8855])
x @ layer1.weight.T
# tensor([-0.3382, -0.9072, 1.7203, -0.8855])
138 5 Python for Numerical Computations
We check that the output from the function composition is that same as that of
the matrix multiplication:
layer3(layer2(layer1(x)))
# tensor([0.3210])
A difference between PyTorch and NumPy is that PyTorch tensors, i.e. scalars,
vectors, or matrices here, store a record of the functions that transform them to
compute automatically their gradient. Without it, we could not find the optimal
network parameters that solve a problem. We will develop this in Chap. 7, Linear
and Logistic Regression, but before that, let us examine how to extract gradients
from PyTorch tensors.
As an example, let us consider a 3D curve:
z = x 2 + xy + y 2
.
∂z
∂x = 2x + y,
. ∂z
∂y = x + 2y
∇f (3, 4) = (2 × 3 + 4, 3 + 2 × 4),
.
= (10, 11).
Let us create two tensors to hold the input. We specify that we want to record the
functions applied to these tensors with requires_grad. This will enable PyTorch to
5.10 Further Reading 139
PyTorch builds a graph of all the functions involved in the tensor computation. In
the neural network terminology, this is called the forward pass. We extract the last
function that created this variable with grad_fn as for z:
z.grad_fn
# <AddBackward0 at 0x1a2a66fb0>
From this function, we can traverse the computational graph from the end to the
inputs with next_functions, as for the last node:
z.grad_fn.next_functions
# ((<AddBackward0 at 0x1a2a65780>, 0),
(<PowBackward0 at 0x1a2a66cb0>, 0))
so that we can get the gradient values with respect to the input variables:
x.grad, y.grad
# (tensor(10.), tensor(11.))
We can visualize computational graphs from tensors with the torch_viz module
and its make_dot() function:
from torchviz import make_dot
Linear algebra is a very large field and there are uncountable references in many
languages. Any introduction will probably suffice to go deeper in this topic provided
that it includes a good section on matrices.
Concerning the programming parts, there are good introductory books on
NumPy, for instance the Guide to NumPy (Oliphant 2015) by its creator. Books on
PyTorch are not as many as those on NumPy. Deep Learning with PyTorch (Stevens
et al. 2020) is a thorough description of its features by developers who ported the
initial implementation of Torch from Lua to Python.
140 5 Python for Numerical Computations
Fig. 5.2 PyTorch computational graph: The gray blocks in the first row from the top correspond to
xy (MulBackward) and .x 2 (PowBackward), in the second row to .y 2 (PowBackward) and .xy + x 2
(AddBackward), and in the third one to .y 2 + (xy + x 2 ) (AddBackward)
NumPy and especially PyTorch are constantly evolving. Books may be rapidly
outdated and the best place to find crucial implementation details is the online
documentation: numpy.org and pytorch.org. Both are excellent and up-to-date. In
addition, the PyTorch tutorials are also very good.
Graphical processing units (GPU) that we introduced in Sect. 5.2.7, can accel-
erate considerably mathematical computations. They made it possible the devel-
opment of large-scale machine–learning models. Unfortunately, they are also very
expensive. In this book, we assumed that the reader only had access to a mid-range
laptop and we developed all the programs so that they can run on a CPU-only
machine. Readers that would like to know more on the GPU API can refer to the
PyTorch documentation, notably to that of the torch.cuda package.1
1 https://pytorch.org/docs/stable/notes/cuda.html.
Chapter 6
Topics in Information Theory
and Machine Learning
Information theory underlies the design of codes. Claude Shannon probably started
the field with a seminal article (1948), in which he defined a measure of information:
the entropy. In this section, we introduce essential concepts in information theory:
entropy, optimal coding, cross-entropy, and perplexity. Entropy is a very versatile
measure of the average information content of symbol sequences and we will
explore how it can help us design efficient encodings.
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 141
P. M. Nugues, Python for Natural Language Processing, Cognitive Technologies,
https://doi.org/10.1007/978-3-031-57549-5_6
142 6 Topics in Information Theory and Machine Learning
6.1.1 Entropy
1
I (xi ) = − log2 P (xi ) = log2
. ,
P (xi )
and it is measured in bits. When the symbols have equal probabilities, they are said
to be equiprobable and
1
P (x1 ) = P (x2 ) = . . . = P (xN ) =
. .
N
The information content assumes that the symbols have an equal probability.
This is rarely the case in reality. Therefore this measure can be improved using the
concept of entropy, the average information content, which is defined as:
H (X) = −
. P (x) log2 P (x),
x∈X
and for any random variable, we have the inequality .H (X) ≤ log2 N.
To evaluate the entropy of printed French, we computed the frequency of the
printable French characters in Gustave Flaubert’s novel Salammbô. Table 6.1 shows
the frequency of 26 unaccented letters, the 16 accented or specific letters, and the
blanks (spaces).
The entropy of the text restricted to the characters in Table 6.1 is defined as:
H (X) = − P (x) log2 P (x).
x∈X
. = −P (A) log2 P (A) − P (B) log2 P (B) − P (C) log2 P (C) − . . .
−P (Z) log2 P (Z) − P (À) log2 P (À) − P (Â) log2 P (Â) − . . .
−P (Ü ) log2 P (Ü ) − P (Ÿ ) log2 P (Ÿ ) − P (blanks) log2 P (blanks).
6.1 Codes and Information Theory 143
Table 6.1 Letter frequencies in the French novel Salammbô by Gustave Flaubert. The text has been
normalized in uppercase letters. The table does not show the frequencies of the punctuation signs or
digits
Letter Frequency Letter Frequency Letter Frequency Letter Frequency
A 42, 439 L 30, 960 W 1 Ë 6
B 5757 M 13, 090 X 2206 Î 277
C 14, 202 N 32, 911 Y 1232 Ï 66
D 18, 907 O 22, 647 Z 413 Ô 397
E 71, 186 P 13, 161 À 1884 Œ 96
F 4993 Q 3964 Â 605 Ù 179
G 5148 R 33, 555 Æ 9 Û 213
H 5293 S 46, 753 Ç 452 Ü 0
I 33, 627 T 35, 084 É 7709 Ÿ 0
J 1220 U 29, 268 È 2002 Blanks 103, 496
K 92 V 6916 Ê 898 Total: 593, 314
Once the text is normalized, we compute the frequencies with the Counter class
as in Sect. 2.16.3 and we divide them by the total number of characters to get the
relative frequencies:
def rel_freqs(corpus: str) -> dict[str, float]:
counts = Counter(corpus)
total = counts.total()
return {key: val/total
for key, val in counts.items()}
144 6 Topics in Information Theory and Machine Learning
We apply these three functions to the Salammbô novel in French. We store the
text in the corpus string and we obtain:
>>> corpus = normalize(corpus)
>>> freqs = rel_freqs(corpus)
>>> entropy(freqs)
0.370
The information content of the French character set is less than the 7 bits required
by equiprobable symbols. Although it gives no clue about an encoding algorithm, it
indicates that a more efficient code is theoretically possible. This is what we examine
now with Huffman coding, which is a general and simple method to build such a
code.
Huffman coding uses variable-length code units. Let us simplify the problem and
use only the eight symbols A, B, C, D, E, F , G, and H with the count frequencies
in Table 6.2.
The information content of equiprobable symbols is .log2 8 = 3 bits. Table 6.3
shows a possible code with constant-length units.
The idea of Huffman coding is to encode frequent symbols using short code
values and rare ones using longer units. This was also the idea of the Morse code,
which assigns a single signal to letter E: ., and four signals to letter X: -..-.
This first step builds a Huffman tree using the frequency counts. The symbols
and their frequencies are the leaves of the tree. We grow the tree recursively from
the leaves to the root. We merge the two symbols with the lowest frequencies into
a new node that we annotate with the sum of their frequencies. In Fig. 6.1, this new
node corresponds to the letters F and G with a combined frequency of 4993 + 5148
A B C D E F G H
10141
A B C D E F G H
11050 10141
A C D E B H G F
21191
11050 10141
A C D E B H G F
= 10,141 (Fig. 6.2). The second iteration merges B and H (Fig. 6.3); the third one,
(F, G) and .(B, H ) (Fig. 6.4), and so on (Figs. 6.5, 6.6, 6.7, and 6.8).
.
The second step of the algorithm generates the Huffman code by assigning a 0 to
the left branches and a 1 to the right branches (Table 6.4).
146 6 Topics in Information Theory and Machine Learning
21191
A C D E B H G F
54300
33109 21191
C D
42439 71186 5757 5293 5148 4993
A E B H G F
96739
42439 54300
A
33109 21191
C D
71186 5757 5293 5148 4993
E B H G F
167925
71186 96739
E
42439 54300
A
33109 21191
C D
5757 5293 5148 4993
B H G F
0.25 × 2 bit + 0.03 × 5 bit + 0.08 × 4 bit + 0.11 × 4 bit + 0.42 × 1 bit
.
+0.03 × 5 bit + 0.03 × 5 bit + 0.03 × 5 bit = 2.35
We can compute the entropy from the counts in Table 6.2. It is defined by the
expression:
42439 42439 5757 5757 14202 14202
− log2 + log2 + log2 +
167925 167925 167925 167925 167925 167925
We can see that although the Huffman code reduces the average number of bits
from 3 to 2.35, it does not reach the limit defined by entropy, which is, in our
example, 2.31.
6.1.4 Cross-Entropy
Let us now compare the letter frequencies between two parts of Salammbô,
then between Salammbô and another text in French or in English. The symbol
probabilities will certainly be different. Intuitively, the distributions of two parts
of the same novel are likely to be close, further apart between Salammbô and
another French text from the twenty-first century, and even further apart with a
text in English. This is the idea of cross-entropy, which compares two probability
distributions.
In the cross-entropy formula, one distribution is referred to as the model. It
corresponds to data on which the probabilities have been trained. Let us name
it M with the distribution .M(x1 ), M(x2 ), . . . , M(xN ). The other distribution, P ,
corresponds to the test data: .P (x1 ), P (x2 ), . . . , P (xN ). The cross-entropy of M on
P is defined as:
.H (P , M) = − P (x) log2 M(x).
x∈X
DKL (P ||M) = H (P , M) − H (P )
.
is a measure of the relevance of the model: the closer the cross-entropy, the better
the model.
To see how the probability distribution of Flaubert’s novel could fare on other
texts, we trained a model on the first fourteen chapters of Salammbô, and we applied
it to the last chapter of Salammbô (Chap. 15), to Victor Hugo’s Notre Dame de Paris,
both in French, and to Nineteen Eighty-Four by George Orwell in English.
In the definition of cross-entropy, the sum is over the set X of symbols in
M and P . However some of our test texts have characters that are not in the
Salammbô training set. For instance, Notre Dame de Paris contains Greek letters,
while Salammbô has only Latin characters. For such Greek letters, .M(x) = 0 and, as
.− log2 0 is infinite, this would result in an infinite cross-entropy. In practical cases,
we should avoid this situation and find a way to deal with unknown symbols. We
will discuss techniques how to smooth distributions in Chap. 10, Word Sequences.
In this experiment, we restricted X to the symbols occurring in the training set.
6.2 Entropy and Decision Trees 149
Table 6.5 The entropies are measured on the test sets and the cross-entropies are measured with
Chapters 1–14 of Gustave Flaubert’s Salammbô taken as the model
Entropy Cross-entropy Difference
Train/Test .H (P ) .H (P |M) .H (P , M)− H (P )
Salammbô, chapters 1–14 Training set (M) 4.37168 4.37168 0.00000
Salammbô, chapter 15 Test set (P ) 4.31338 4.32544 0.01206
Notre Dame de Paris Test set (P ) 4.42285 4.44187 0.01889
Nineteen eighty-four Test set (P ) 4.34982 4.79617 0.44635
Table 6.6 The perplexity and cross-perplexity of texts measured with Chapters 1–14 of Gustave
Flaubert’s Salammbô taken as the model
Train/Test Perplexity Cross-perplexity
Salammbô, chapters 1–14 Training set 20.70 20.70
Salammbô, chapter 15 Test set 19.88 20.05
Notre Dame de Paris Test set 21.45 21.73
Nineteen eighty-four Test set 20.39 27.78
Using this simplification, the data in Table 6.5 conform to our intuition. They
show that the first chapters of Salammbô are a better model of the last chapter of
Salammbô than of Notre Dame de Paris, and even better than of Nineteen Eighty-
Four.
Decision trees are useful devices to classify objects into a set of classes. In this
section, we describe what they are and see how entropy can help us learn—or
induce—automatically decision trees from a set of data. The algorithm, which
150 6 Topics in Information Theory and Machine Learning
set, here the examples in Fig. 6.7. Once the tree is induced, it will be able to predict
the class of examples taken outside the training set.
Machine-learning techniques make it possible to build models that classify data,
like annotated corpora, without the chore of manually explicating the rules behind
this organization or classification. Because of the availability of massive volumes of
data, they have become extremely popular in all the fields of language processing.
They are now instrumental in most NLP applications and tasks, including text
classification, part-of-speech tagging, group detection, named entity recognition, or
translation, that we will describe in the next chapters of this book.
A decision tree is a tool to classify objects such as those in Table 6.7. The nodes
of a tree represent conditions on the attributes of an object, and a node has as many
branches as its corresponding attribute has values. An object is presented at the root
of the tree, and the values of its attributes are tested by the tree nodes from the root
down to a leaf. The leaves return a decision, which is the object class or probabilities
to be the member of a class.
Figure 6.9 shows a decision tree that correctly classifies all the objects in the set
shown in Table 6.7 (Quinlan 1986).
Outlook
P: 9, N: 5
sunny rain
overcast
Humidity Windy
P: 4
P: 2, N: 3 P: 3, N: 2
N: 3 P: 2 N: 2 P: 3
Fig. 6.9 A decision tree classifying the objects in Table 6.7. Each node represents an attribute
with the number of objects in the classes P and N . At the start of the process, the collection has
nine objects in class P and five in class N . The classification is done by testing the attribute values
of each object in the nodes until a leaf is reached, where all the objects belong to one class, P or
N . After Quinlan (1986)
152 6 Topics in Information Theory and Machine Learning
It is possible to design many trees that classify successfully the objects in Table 6.7.
The tree in Fig. 6.9 is interesting because it is efficient: a decision can be made with
a minimal number of tests.
An efficient decision tree can be induced from a set of examples, members
of mutually exclusive classes, using an entropy measure. We will describe the
induction algorithm using two classes of p positive and n negative examples,
although it can be generalized to any number of classes. As we saw earlier, each
example is defined by a finite set of attributes, SA.
At the root of the tree, the condition, and hence the attribute, must be the most
discriminating, that is, have branches gathering most positive examples while others
gather negative examples. A perfect attribute for the root would create a partition
with subsets containing only positive or negative examples. The decision would
then be made with one single test. The ID3 (Quinlan 1986) algorithm uses this idea
and the entropy to select the best attribute to be this root. Once we have the root, the
initial set is split into subsets according to the branching conditions that correspond
to the values of the root attribute. Then, the algorithm determines recursively the
next attributes of the resulting nodes.
ID3 defines the information gain of an attribute as the difference of entropy
before and after the decision. It measures its separating power: the more the gain, the
better the attribute. At the root, the entropy of the collection is constant. As defined
previously (Sect. 6.1.1), for a two-class set .X = {P , N } of respectively p positive
and n negative examples, it is:
p p n n
H (X) = −
. log2 − log2 .
p+n p+n p+n p+n
p
Figure 6.10 shows this binary entropy function with .x = , for x ranging
p+n
from 0 to 1. The function attains its maximum of 1 at x = 0.5, when .p = n and there
are as many positive as negative examples in the set, and its minimum of 0 at x = 0
and x = 1, when .p = 0 or .n = 0 and the examples in the set are either all positive
or all negative.
An attribute A with v possible values .{A1 , A2 , . . . , Av } creates a partition of the
collection into v subsets, where each subset .Xi corresponds to one value of A and
contains .pi positive and .ni negative examples. The entropy of a subset is .H (Xi ) and
the weighted average of entropies of the partition created by A is:
v
pi + ni
.E(A) = H (Xi ).
p+n
i=1
subsets containing examples that are either all positive or all negative. In this case,
the entropy of the nodes below the root would be 0.
For the tree in Fig. 6.9, let us compute the information gain of attribute Outlook.
The entropy of the complete dataset is (Table 6.7):
9 9 5 5
.H (X) = − log2 − log2 = 0.940.
14 14 14 14
Outlook has three values: sunny, overcast, and rain. The respective subsets created
by these values consist of the objects:
2 2 3 3
sunny : H (X1 ) = − log2 − log2 = 0.971.
5 5 5 5
. overcast : H (X2 ) = 0.
3 3 2 2
rain : H (X3 ) = − log2 − log2 = 0.971.
5 5 5 5
Thus
5 4 5
E(Outlook) =
. H (X1 ) + H (X2 ) + H (X3 ) = 0.694.
14 14 14
Gain(Outlook) is then 0.940 .− 0.694 = 0.246, which is the highest for the
.
four attributes. .Gain(T emperature), .Gain(H umidity), and .Gain(W indy) are
computed similarly.
154 6 Topics in Information Theory and Machine Learning
The algorithm to build the decision tree is simple. The information gain is
computed on the dataset for all attributes, and the attribute with the highest gain:
is selected to be the root of the tree. The dataset is then split into v subsets
{N1 , . . . , Nv }, where the value of A for the objects in .Ni is .Ai , and for each subset,
.
a corresponding node is created below the root. This process is repeated recursively
for each node of the tree with the subset it contains until all the objects of the node
are either positive or negative. For a training set of N instances each having M
attributes, Quinlan (1986) showed that ID3’s complexity to generate a decision tree
is .O(NM).
ID3 handles categorical attributes only. In a sequel to ID3, Quinlan (1993, p. 25)
proposed a simple technique to deal with numerical, or continuous, attributes and
find binary partitions for the corresponding nodes. Each partition is defined by a
threshold, where the first part will be the values less than this threshold and the
second one, the values greater. For each continuous attribute, the outline of the
algorithm is:
1. First, sort the values observed in the dataset. There is a finite number of values:
.{v1 , v2 , . . . , vm };
2. Then, for all the pairs of consecutive values .vi and .vi+1 , compute the gain
of the resulting split, .{v1 , v2 , . . . , vi } and .{vi+1 , . . . , vm }, and determine the
pair, .(vimax , vimax +1 ), which maximizes it. As threshold value, Quinlan (1993)
proposes the midpoint:
vimax + vimax +1
. .
2
Table 6.8 A representation of the categorical values in Table 6.7 as numerical vectors
Attributes Class
Outlook Temperature Humidity Windy
Object Sunny Overcast Rain Hot Mild Cool High Normal True False
1 1 0 0 1 0 0 1 0 0 1 N
2 1 0 0 1 0 0 1 0 1 0 N
3 0 1 0 1 0 0 1 0 0 1 P
4 0 0 1 0 1 0 1 0 0 1 P
5 0 0 1 0 0 1 0 1 0 1 P
6 0 0 1 0 0 1 0 1 1 0 N
7 0 1 0 0 0 1 0 1 1 0 P
8 1 0 0 0 1 0 1 0 0 1 N
9 1 0 0 0 0 1 0 1 0 1 P
10 0 0 1 0 1 0 0 1 0 1 P
11 1 0 0 0 1 0 0 1 1 0 P
12 0 1 0 0 1 0 1 0 1 0 P
13 0 1 0 1 0 0 0 1 0 1 P
14 0 0 1 0 1 0 1 0 1 0 N
The classical way to do this is to represent each attribute domain—the set of the
allowed or observed values of an attribute—as a vector of binary digits (Suits 1957).
Let us exemplify this with the Outlook attribute in Table 6.7:
• Outlook has three possible values: .{sunny, overcast, rain}. Its numerical
representation is then a three-dimensional vector, .(x1 , x2 , x3 ), whose axes are
tied respectively to sunny, overcast, and rain.
• To reflect the value of the attribute, we set the corresponding coordinate to 1
and the others to 0. This corresponds to a unit vector. Using the examples in
Table 6.7, the name–value pair .[Outlook = sunny] will be encoded as .(1, 0, 0),
.[Outlook = overcast] as .(0, 1, 0), and .[Outlook = rain] as .(0, 0, 1).
For a given attribute, the dimension of the vector will then be defined by the
number of its possible values, and each vector coordinate will be tied to one of the
possible values of the attribute.
So far, we have one unit vector for each attribute. To represent a complete object,
we will finally concatenate all these vectors into a larger one characterizing this
object. Table 6.8 shows the complete conversion of the dataset using vectors of
binary values.
This type of encoding is generally called one-hot encoding. This technique has
also the names dummy variables or indicator function.
If an attribute has from the beginning a numerical value, it does not need to be
converted.
156 6 Topics in Information Theory and Machine Learning
We need first to convert this dataset into binary vectors as we saw in Sect. 6.3.
This can be done with the DictVectorizer class that transforms lists of dictionaries
representing the observations into vectors.
Let us first read the dataset assuming the file consists of values separated by
commas (CSV). As in Sect. 4.3, we use the csv module library and DictReader()
that creates a list of dictionaries from the dataset. The column names are given in
the fieldnames parameter:
import csv
We create the X_dict and y_symbols tables from dataset, where we use a deep
copy to preserve the dataset:
import copy
Now we have all our data in two separate tables. We need to convert X_dict into
a numeric matrix and we use DictVectorizer to carry this out:
from sklearn.feature_extraction import DictVectorizer
that returns:
array([[ 1., 0., 0., 0., 1., 0., 1., 0., 1., 0.],
[ 1., 0., 0., 0., 1., 0., 1., 0., 0., 1.],
[ 1., 0., 1., 0., 0., 0., 1., 0., 1., 0.],
[ 1., 0., 0., 1., 0., 0., 0., 1., 1., 0.],
[ 0., 1., 0., 1., 0., 1., 0., 0., 1., 0.],
[ 0., 1., 0., 1., 0., 1., 0., 0., 0., 1.],
[ 0., 1., 1., 0., 0., 1., 0., 0., 0., 1.],
[ 1., 0., 0., 0., 1., 0., 0., 1., 1., 0.],
[ 0., 1., 0., 0., 1., 1., 0., 0., 1., 0.],
[ 0., 1., 0., 1., 0., 0., 0., 1., 1., 0.],
[ 0., 1., 0., 0., 1., 0., 0., 1., 0., 1.],
[ 1., 0., 1., 0., 0., 0., 0., 1., 0., 1.],
[ 0., 1., 1., 0., 0., 0., 1., 0., 1., 0.],
[ 1., 0., 0., 1., 0., 0., 0., 1., 0., 1.]])
something very similar to Table 6.8. We set the sparse parameter to False to be able
to visualize the matrix. In most cases, it should be set to True to save memory space.
We can then use any scikit-learn classifier to train a model and predict classes.
158 6 Topics in Information Theory and Machine Learning
Once the model is trained, we can apply it to new observations. The next
instruction just reapplies it to the training set:
y_predicted = classifier.predict(X)
In our previous experiment, we trained and tested a model on the same dataset. This
is not a good practice: A model that would just memorize the data would also reach a
perfect accuracy. As we want to be able to generalize beyond the training set to new
data, the standard evaluation procedure is to estimate the performance on a distinct
and unseen test set. When we have only one set, as in Table 7.1, we can divide it in
two subsets: The training set and the test set (or holdout data). The split can be 90%
of the data for the training set and 10% for the test set or 80–20.
In addition to the training and test sets, we often need a validation set also
called a development set. Creating a model usually requires many training runs
with different parameters and as many evaluations. As we saw, we should use the
test set only once, in the final evaluation, otherwise the risk is to select algorithm
parameters, for instance the criterion in the decision tree of Sect. 6.4.2, that are
optimal for this test set. The validation set is an auxiliary test set that will help
us select a model without touching the test set.
In the case of a small dataset like ours, a specific test set may lead to results
that would be quite different with another test set. Cross validation, or N -fold cross
validation, is a technique to mitigate such a bias. Instead of using one single test set,
cross validation uses multiple splits called the folds. In a fivefold cross validation,
the evaluation is carried out on five different test sets randomly sampled from the
dataset. In each fold, the rest of the dataset serves as training set. Figure 6.11 shows
a fivefold cross validation process, where we partitioned the dataset into five equal
size subsets. In each fold, one of the subsets is the test set (red) and the rest, the
6.5 Further Reading 159
training set (gray). The evaluation is then repeated five times with different test sets
and the final result is the mean of the results of the five different folds.
The number of folds depends on the size of the dataset and the computing
resources at hand: 5 and 10 being frequent values. At the extreme, a leave-one-
out cross-validation has as many folds as there are observations. At each fold, the
training set consists of all the observations except one, which is used as test set.
scikit-learn has built-in cross validation functions. The code below shows an
example of it with a fivefold cross validation and a score corresponding to the
accuracy:
from sklearn.model_selection import cross_val_score
In this run, we obtained a score of 0.7. Note that this figure may vary depending on
the way scikit-learn splits the dataset.
Accuracy, the number of correctly classified observations, although very intu-
itive, is not always a good way to assess the performance of a model. Think, for
example, of a dataset consisting of 99% positive observations and of 1% negative
ones. A lazy classifier constantly choosing the positive class would have a 99%
accuracy. We will see in Chap. 7, Linear and Logistic Regression more elaborate
metrics to evaluate classification results.
1 https://scikit-learn.org/.
2 https://www.r-project.org/.
3 https://www.cs.waikato.ac.nz/ml/weka/.
Chapter 7
Linear and Logistic Regression
J’ay même trouvé une chose estonnante, c’est qu’on peut representer par les Nombres, toutes
sortes de verités et consequences. [ . . . ] tous les raisonnemens se pourroient determiner
à la façon des nombres, et mêmes à l’egard de ceux où les circonstances données, ou
data, ne suffisent pas à la determination de la question, on pourroit neantmoins determiner
mathematiquement le degré de la probabilité.
“I even found an astonishing thing, it is that we can represent by the numbers, all kinds of
truths and consequences [ . . . ] all reasonings could be determined in the manner of numbers,
and even with regard to those where the given circumstances, or data, are not sufficient for the
determination of the question, one could nevertheless determine mathematically the degree of
the probability.”
G. W. Leibniz, Projet et essais pour avancer l’art d’inventer, 1688–1690, edition A VI 4
A, pp. 963–964. Translation: Google translate.
In this chapter, we will go on with the description of linear regression and linear
classifiers among the most popular ones: the perceptron and logistic regression. We
will notably outline the mathematical background and notation we need to use these
techniques properly. In addition to being used as a stand-alone technique, logistic
regression is a core component of most modern neural networks.
Decision trees are simple and efficient devices to design classifiers. Together with
the information gain, they enabled us to induce optimal trees from a set of examples
and to deal with nominal values such as sunny, hot, and high.
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 161
P. M. Nugues, Python for Natural Language Processing, Cognitive Technologies,
https://doi.org/10.1007/978-3-031-57549-5_7
162 7 Linear and Logistic Regression
Linear classifiers are another set of techniques that have the same purpose. As
with decision trees, they produce a function splitting a set of objects into two or more
classes. This time, however, the objects will be represented by a vector of numerical
parameters. In the next sections, we examine linear classification methods in an n-
dimensional space, where the dimension of the vector space is equal to the number
of parameters used to characterize the objects.
Table 7.1 The frequency of A in the chapters of Salammbô in English and French. Letters have
been normalized in uppercase and duplicate spaces removed
French English
Chapter # Characters #A # Characters #A
Chapter 1 36, 961 2503 35, 680 2217
Chapter 2 43, 621 2992 42, 514 2761
Chapter 3 15, 694 1042 15, 162 990
Chapter 4 36, 231 2487 35, 298 2274
Chapter 5 29, 945 2014 29, 800 1865
Chapter 6 40, 588 2805 40, 255 2606
Chapter 7 75, 255 5062 74, 532 4805
Chapter 8 37, 709 2643 37, 464 2396
Chapter 9 30, 899 2126 31, 030 1993
Chapter 10 25, 486 1784 24, 843 1627
Chapter 11 37, 497 2641 36, 172 2375
Chapter 12 40, 398 2766 39, 552 2560
Chapter 13 74, 105 5047 72, 545 4597
Chapter 14 76, 725 5312 75, 352 4871
Chapter 15 18, 317 1215 18, 031 1119
Total 619, 431 42, 439 608, 230 39, 056
7.3 Linear Regression 163
Before we try to discriminate between French and English, let us examine how we
can model the distribution of the letters in one language.
Figure 7.1 shows the plot of data in Table 7.1, where each point represents the
letter counts in one of the 15 chapters. The x-axis corresponds to the total count of
letters in the chapter, and the y-axis, the count of As. We can see from the figure that
the points in both languages can be fitted quite precisely to two straight lines. This
fitting process is called a linear regression, where a line equation is given by:
y = mx + b.
.
Fig. 7.1 Plot of the frequencies of A, y, versus the total character counts, x, in the 15 chapters of
Salammbô. Squares correspond to the English version and triangles to the French original
164 7 Linear and Logistic Regression
SSE(x, y)
109
108
107
SSE(m, b)
106
105
104
0
0.02 1000
0.04 500
0.06 0
0.08
0.1 -500
m 0.12 -1000
Fig. 7.2 Plot of .SSE(m, b) applied to the 15 chapters of the English version of Salammbô
The least squares method is probably the most common technique used to model
the fitting error and estimate m and b. This error, or loss function, is defined as the
sum of the squared errors (SSE) over all the q points (Legendre 1805):
q
SSE(m, b) = (f (xi ) − yi )2 ,
i=1
.
q
= (mxi + b − yi )2 .
i=1
Ideally, all the points would be aligned and this sum would be zero. This is rarely
the case in practice, and we fall back to an approximation that minimizes it.
Figure 7.2 shows the plot of .SSE(m, b) applied to the 15 chapters of the
English version of Salammbô. Using a logarithmic scale, the surface shows a visible
minimum somewhere between 0.6 and 0.8 for m and close to 0 for b. Let us now
compute precisely these values.
We know from differential calculus that we reach the minimum of .SSE(m, b)
when its partial derivatives over m and b are zero:
∂SSE(m, b) q q
= ∂
∂m (mxi + b − yi )2 = 2 xi (mxi + b − yi ) = 0,
∂m i=1 i=1
.
∂SSE(m, b) ∂
q q
= ∂b (mxi + b − yi )2 = 2 (mxi + b − yi ) = 0.
∂b i=1 i=1
7.4 Notations in an n-Dimensional Space 165
We obtain then:
q
xi yi − qxy
i=1
. m= and b = y − mx,
q
xi2 − qx 2
i=1
with
1 1
q q
.x = xi and y = yi .
q q
i=1 i=1
Using these formulas, we find the two regression lines for French and English:
As alternative loss function to the least squares, we can minimize the sum of the
absolute errors (SAE) (Boscovich 1770, Livre V, note):
q
SAE(m, b) =
. |f (xi ) − yi |.
i=1
The corresponding minimum value is called the least absolute deviation (LAD).
Solving methods to find this minimum use linear programming. Their description
falls outside the scope of this book.
Up to now, we have formulated the regression problem with two parameters: the
letter count and the count of As. In most practical cases, we will have a much larger
set. To describe algorithms applicable to any number of parameters, we need to
extend our notation to a general n-dimensional space. Let us introduce it now.
In an n-dimensional space, it is probably easier to describe linear regression as a
prediction technique: given input parameters in the form of a feature vector, predict
the output value. In the Salammbô example, the input would be the number of letters
in a chapter, and the output, the number of As.
166 7 Linear and Logistic Regression
n−1
y =w·x=
. wi xi ,
i=0
ŷ = Xw.
.
7.5 Gradient Descent 167
For the French dataset, the complete matrix and vectors are:
⎡ ⎤ ⎡ ⎤ ⎡ ⎤
1 36961 2533.22 2503
⎢1 43621⎥ ⎢2988.11⎥ ⎢2992⎥
⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎢1 15694⎥ ⎢1080.65⎥ ⎢1042⎥
⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎢1 36231⎥ ⎢2483.36⎥ ⎢2487⎥
⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎢1 29945⎥ ⎢2054.02⎥ ⎢2014⎥
⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎢1 40588⎥ ⎢2780.95⎥ ⎢2805⎥
⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎢1 75255⎥ ⎢5148.76⎥ ⎢5062⎥
⎢ ⎥ 8.7253 ⎢ ⎥ ⎢ ⎥
.X = ⎢1 37709⎥ ; w =
⎢ ⎥ ; ŷ = ⎢
⎢2584.31⎥⎥ ;y = ⎢
⎢ 2643⎥⎥;
⎢1 30899⎥ 0.0683 ⎢2119.18⎥ ⎢2126⎥
⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎢1 25486⎥ ⎢1749.46⎥ ⎢1784⎥
⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎢1 37497⎥ ⎢2569.83⎥ ⎢2641⎥
⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎢1 40398⎥ ⎢2767.97⎥ ⎢2766⎥
⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎢1 74105⎥ ⎢5070.21⎥ ⎢5047⎥
⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎣1 76725⎦ ⎣5249.16⎦ ⎣5312⎦
1 18317 1259.81 1215
⎡ ⎤
913.26
⎢ 15.14 ⎥
⎢ ⎥
⎢1493.86⎥
⎢ ⎥
⎢ 13.25 ⎥
⎢ ⎥
⎢ ⎥
⎢1601.31⎥
⎢ ⎥
⎢ 578.40 ⎥
⎢ ⎥
⎢7527.51⎥
⎢ ⎥
se = ⎢ ⎥
⎢3444.53⎥ .
⎢ 46.57 ⎥
⎢ ⎥
⎢ ⎥
⎢1193.04⎥
⎢ ⎥
⎢5065.18⎥
⎢ ⎥
⎢ 38920 ⎥
⎢ ⎥
⎢538.909⎥
⎢ ⎥
⎣3948.29⎦
2007.53
Using partial derivatives, we have been able to find an analytical solution to the
regression line. We will now introduce the gradient descent, a generic optimization
method that uses a series of successive approximations instead. We will apply this
technique to solve the least squares as well as the classification problems we will
see in the next sections.
168 7 Linear and Logistic Regression
y = f (w0 , w1 , w2 , . . . , wn ),
.
= f (w),
. f (w1 ) > f (w2 ) > . . . > f (wk ) > f (wk+1 ) > . . . > min.
Now given a point, an initial weight vector, .w, how can we find the next point of
the iteration? The steps in the gradient descent are usually small and we can define
the points in the neighborhood of .w by .w + v, where .v is a vector of .Rn and .||v||
is small. So the problem of gradient descent can be reformulated as: given .x, find .v
subject to .f (w) > f (w + v).
As .||v|| is small, we can approximate .f (w + v) using a Taylor expansion limited
to the first derivatives:
f (w + v) ≈ f (w) + v · ∇f (w),
.
∂f ∂f ∂f ∂f
∇f (w0 , w1 , w2 , . . . , wn ) = (
. , , ,..., ),
∂w0 ∂w1 ∂w2 ∂wn
We have then:
wk+1 = wk − αk ∇f (wk )
.
and find a step sequence to the minimum, where .αk is a small positive number called
the step size or learning rate. This number can be constant over all the descent or
change at each step. The descent converges when the gradient becomes zero. This
convergence is generally faster if the learning rate decreases or adapts to the gradient
values over the iterations. In practice, we terminate the descent when .||∇f (w)|| is
less than a predefined threshold, or is not decreasing, or has reached a maximum
number of iterations.
Symmetrically to the steepest descent, we have a steepest ascent when
v = α∇f (w).
.
When the function represents a loss that we will minimize, as with the sum of
squared errors, we will denote it .L(w) or .Loss(w) and apply a descent. When it is
a quantity to maximize, we will denote it .𝓁(w) and apply an ascent.
For a dataset, DS, we find the minimum of the sum of squared errors and the
coefficients of the regression equation through a walk down the surface using the
recurrence relation above. Let us compute the gradient in a two-dimensional space
first and then generalize it to multidimensional space.
In a Two-Dimensional Space
To make the generalization easier, let us rename the straight line coefficients .(b, m)
in .y = mx + b as .(w0 , w1 ). We want then to find the regression line:
ŷ = w0 + w1 x1
.
given a dataset DS of q examples: .DS = {(1, xi,1 , yi )|i : 1..q}, where the error is
defined as:
q
SSE(w0 , w1 ) = (ŷi − yi )2 ,
i=1
.
q
= (w0 + w1 xi,1 − yi )2 .
i=1
170 7 Linear and Logistic Regression
∂SSE(w0 ,w1 )
q
∂w0 =2 w0 + w1 xi,1 − yi ,
i=1
.
∂SSE(w0 ,w1 ) q
∂w1 =2 xi,1 × (w0 + w1 xi,1 − yi ).
i=1
From this gradient, we can now compute the iteration step. With q examples and
α
a learning rate of . , inversely proportional to the number of examples, we have:
2q
q
w0 ← w0 − α
q · w0 + w1 xi,1 − yi ,
i=1
.
q
w1 ← w 1 − α
q · xi,1 × (w0 + w1 xi,1 − yi ).
i=1
In the iteration above, we compute the gradient as a sum over all the examples
before we carry out one update of the weights. This technique is called the batch
gradient descent. An alternate technique is to go through DS and compute an
update with each example:
w0 ← w0 − α · (w0 + w1 xi,1 − yi )
.
w1 ← w1 − α · xi,1 · (w0 + w1 xi,1 − yi ).
The examples are usually selected randomly from DS. This is called the stochastic
gradient descent or online learning.
For large datasets, a batch gradient descent would be impractical as it would
take too much memory. The stochastic variant has not this limitation and often has
a faster convergence. It is more unstable however. To make the convergence more
regular, most modern machine-learning toolkits use minibatches instead, where they
compute the gradient by small subsets of 4 to 256 inputs. This technique is called
the minibatch gradient descent.
The duration of the descent is measured in epochs, where an epoch is the period
corresponding to one iteration over the complete dataset: the q examples.
N-Dimensional Space
ŷ = w0 + w1 x1 + w2 x2 + . . . + wn xn ,
.
7.6 Regularization 171
given a dataset DS of q examples: .DS = {(1, xi,1 , xi,2 , . . . , xi,n , yi )|i : 1..q},
where the error is defined as:
q
SSE(w0 , w1 , . . . , wn ) = (ŷi − yi )2 ,
i=1
.
q
= (w0 + w1 xi,1 + w2 xi,2 + . . . + wn xi,n − yi )2 .
i=1
q
. SSE(w0 , w1 , . . . , wn ) = (w0 xi,0 + w1 xi,1 + w2 xi,2 + . . . + wn xi,n − yi )2 .
i=1
q
.
∂SSE
∂wj =2 xi,j · (w0 xi,0 + w1 xi,1 + w2 xi,2 + . . . + wn xi,n − yi ).
i=1
In the batch version, the iteration step considers all the examples in DS:
q
. wj ← wj − α
q · xi,j · (w0 xi,0 + w1 xi,1 + w2 xi,2 + . . . + wn xi,n − yi ).
i=1
w ← w − α · (ŷi − yi ) · xi ,
.
where .ŷi = xi · w.
7.6 Regularization
formulation in Sect. 7.3.1, linear regression consists in finding the weight vector
w that minimizes the sum of squared errors between .Xw and .y. Ideally, we would
.
Xw = y.
.
w = X−1 y.
.
(X⊺ X)−1 X⊺
.
w = (X⊺ X)−1 X⊺ y.
.
7.6.2 Inverting X ⊺ X
w = (X⊺ X + λI )−1 X⊺ y
.
This operation is called a regularization and is equivalent to adding the term .λ||w||2
to the sum of squared errors (SSE). It is also used in classification.
7.6.3 Regularization
where
n
Lq =
. |wi |q .
i=1
n
. L2 = wi2 ,
i=1
n
L1 =
. |wi |
i=1
7.7.1 An Example
We will now use the dataset in Table 7.1 to describe classification techniques that
split the texts into French or English. If we examine it closely, Fig. 7.1 shows that we
can draw a straight line between the two regression lines to separate the two classes.
This is the idea of linear classification. From a data representation in a Euclidian
space, classification will consist in finding a line:
w0 + w1 x + w2 y = 0
.
w0 + w1 x + w2 y > 0
.
and
w0 + w1 x + w2 y < 0.
.
These inequalities mean that the points belonging to one class of the dataset are on
one side of the separating line and the others are on the other side.
In Table 7.1 and Fig. 7.1, the chapters in French have a steeper slope that the
corresponding ones in English. The points representing the French chapters will
174 7 Linear and Logistic Regression
then be above the separating line. Let us write the inequalities that reflect this and
set .w2 to 1 to normalize them. The line we are looking for will have the property:
where x is the total count of letters in a chapter and y, the count of As. In total,
we will have 30 inequalities, 15 for French and 15 for English shown in Table 7.2.
Any weight vector .w = (w0 , w1 ) that satisfies all of them will define a classifier
correctly separating the chapters into two classes: French or English.
Let us represent graphically the inequalities in Table 7.2 and solve the system
in the two-dimensional space defined by .w0 and .w1 . Figure 7.3 shows a plot
with the two first chapters, where .w1 is the abscissa and .w0 , the ordinate. Each
inequality defines a half-plane that restricts the set of possible weight values. The
four inequalities delimit the solution region in white, where the two upper lines are
constraints applied by the two chapters in French and the two below by their English
translations.
Figure 7.4 shows the plot for all the chapters. The remaining inequalities shrink
even more the polygonal region of possible values. The point coordinates .(w1 , w0 )
in this region, as, for example, .(0.066, 0) or .(0.067, −20), will satisfy all the
inequalities and correctly separate the 30 observations into two classes: 15 chapters
in French and 15 in English.
7.7 Linear Classification 175
500
w0
-500
-1000
-1500
0.04 0.05 0.06 0.07 0.08 0.09 0.1
w1
Fig. 7.3 A graphical representation of the inequality system restricted to the two first chapters in
French, .f1 and .f2 , and in English, .e1 and .e2 . We can use any point coordinates in the white region
as parameters of the line to separate these two chapters
30
20
10
-10 w0
-20
-30
-40
-50
0.065 0.0655 0.066 0.0665 0.067 0.0675 0.068
w1
Fig. 7.4 A graphical representation of the inequality system with all the chapters. The point
coordinates in the white polygonal region correspond to weights vectors .(w1 , w0 ) defining a
separating line for all the chapters
176 7 Linear and Logistic Regression
w0 + w1 x1 + w2 x2 + . . . + wn xn > 0
.
and
w0 + w1 x1 + w2 x2 + . . . + wn xn < 0,
.
n
w·x=
. wi xi ,
i=0
It is not always the case that a line can perfectly separate the two classes of a dataset.
Let us return to our dataset in Table 7.1 and restrict ourselves to the three shortest
chapters: the 3rd, 10th, and 15th. Figure 7.5, left, shows the plot of these three
chapters from the counts collected in the actual texts. A thin line can divide the
chapters into two classes. Now let us imagine that in another dataset, Chapter 10 in
French has 18,317 letters and 1115 As instead of 18,317 and 1215, respectively.
Figure 7.5, right, shows this plot. This time, no line can pass between the two
classes, and the dataset is said to be not linearly separable.
Although we cannot draw a line that divides the two classes, there are
workarounds to cope with not linearly separable data that we will explain in the
next section.
7.7 Linear Classification 177
Fig. 7.5 Left part: A thin line can separate the three chapters into French and English text. The
two classes are linearly separable. Right part: We cannot draw a line between the two classes. They
are not linearly separable
Having 75,255 characters in this chapter, the regression line will predict 5149
occurrences of As (there are 5062 in reality).
The output of a classification is a finite set of values. When there are two values,
we have a binary classification. Given the number of characters and the number of
As in a text, classification will predict the language: French or English. For instance,
having the pair (75,255, 5062), the classifier will predict French.
In the next sections, we will examine three categories of linear classifiers from
among the most popular and efficient ones: perceptrons, logistic regression, and
neural networks. For the sake of simplicity, we will first restrict our presentation to
a binary classification with two classes. However, linear classifiers can generalize
to handle a multinomial classification, i.e. three classes or more. This is the most
frequent case in practice and we will then see how to apply logistic regression to
multinomial cases.
7.8 Perceptron
Given a dataset like the one in Table 7.1, where each object is characterized by the
feature vector .x and a class, P or N, the perceptron algorithm (Rosenblatt 1958) is
a simple method to find a hyperplane splitting the space into positive and negative
half-spaces separating the objects. The perceptron uses a sort of gradient descent
to iteratively adjust weights .(w0 , w1 , w2 , . . . , wn ) representing the hyperplane until
all the objects belonging to P have the property .w · x ≥ 0, while those belonging to
N have a negative dot product.
1 if w · x ≥ 0
H (w · x) =
.
0 otherwise
ŷ(xi ) = H (w · xi ),
.
= H (w0 + w1 xi,1 + w2 xi,2 + . . . + wn xi,n ).
7.8 Perceptron 179
We use .xi,0 = 1 to simplify the equations and the set .{0, 1} corresponds to the
classes {English, French} in Table 7.1.
Let us denote .wk the weight vector at step k. The perceptron algorithm starts the
iteration with a weight vector .w0 chosen randomly or set to .0 and then applies the
dot product .wk · xi one object at a time for all the members of the dataset, .i : 1..q:
• If the object is correctly classified, the perceptron algorithm keeps the weights
unchanged;
• If the object is misclassified, the algorithm attempts to correct the error by
adjusting .wk using a gradient descent:
wk+1 = wk − α · (ŷi − yi ) · xi ,
.
Let us spell out the update rules in a two-dimensional space. We have the feature
vectors and weight vectors defined as: .xi = (1, xi,1 , xi,2 ) and .w = (w0 , w1 , w2 ).
With the stochastic gradient descent, we carry out the updates using the relations:
w0 ← w0 − (ŷi − yi ) · 1,
. w1 ← w1 − (ŷi − yi ) · xi,1 ,
w2 ← w2 − (ŷi − yi ) · xi,2 ,
To find a hyperplane, the objects (i.e., the points) must be separable. This is rarely
the case in practice, and we often need to refine the stop conditions. We will stop the
learning procedure when the number of misclassified examples is below a certain
threshold or we have exceeded a fixed number of iterations.
The perceptron will converge faster if, for each iteration, we select the objects
randomly from the dataset.
1
f (x) =
. .
1 + e−x
1
Fig. 7.6 The logistic curve: .f (x) =
1 + e−x
7.9 Logistic Regression 181
mortality rate is close to 0 for lower values of x (the drug dosage), then increases,
and reaches a mortality rate of 1 for higher values of x.
Berkson used one feature, the dosage x, to estimate the mortality rate, and he
derived the probability model:
1
P (y = 1|x) =
. ,
1 + e−w0 −w1 x
where y denotes the class, either survival or death, with the respective labels 0 and
1, and .(w0 , w1 ) are weight coefficients that are fit using the maximum likelihood
method.
Using this assumption, we can write a general probability model for feature
vectors .x of any dimension:
1
P (y = 1|x) =
. ,
1 + e−w·x
e−w·x
. P (y = 0|x) = .
1 + e−w·x
P (y = 1|x) P (y = 1|x)
. ln = ln =w·x
P (y = 0|x) 1 − P (y = 1|x)
To build a functional classifier, we need now to fit the weight vector .w; the
maximum likelihood is a classical way to do this. Given a dataset, .DS =
{(1, xi,1 , xi,2 , . . . , xi,n , yi )|i : 1..q}, containing a partition in two classes, P (.y = 1)
and N (.y = 0), and a weight vector .w, the likelihood to have the classification
observed in this dataset is:
We can rewrite the product using .yi as powers of the probabilities as .yi = 0, when
xi ∈ N and .yi = 1, when .xi ∈ P :
.
To maximize this term, it is more convenient to work with sums rather than with
products, and we take the logarithm of it, the log-likelihood:
ŵ = arg max
. yi ln P (yi = 1|xi ) + (1 − yi ) ln(1 − P (yi = 1|xi )).
w
(xi ,yi )∈DS
1 e−w·xi
ŵ = arg max
. yi ln + (1 − yi ) ln .
w 1 + e−w·xi 1 + e−w·xi
(xi ,yi )∈DS
In contrast to linear regression that uses least mean squares, here we fit a logistic
curve so that it maximizes the likelihood of the classification—partition—observed
in the training set.
We can use gradient ascent to compute the maximum of the log-likelihood. This
method is analogous to gradient descent that we saw in Sect. 7.5; we move upward
instead. A Taylor expansion of the log-likelihood gives us: .𝓁(w + v) = 𝓁(w) + v ·
∇𝓁(w) + . . . When .w is collinear with the gradient, we have:
The inequality:
wk+1 = wk + α∇𝓁(wk )
.
. . . and Descent
Alternatively, we could try to find a minimum for the negative log-likelihood (NLL).
We call the corresponding function the logistic loss, log-loss, or binary cross-
entropy. For one observation, it is defined as:
where
. ŷ = 1
1+e−w·x
.
1
q
BCELoss(ŷ, y) = −
. yi ln ŷi + (1 − yi ) ln(1 − ŷi ),
q
i=1
where
. ŷi = 1
1+e−w·xi
.
To compute the gradient of .∇w BCELoss(ŷ, y), we will consider the loss for one
point .(xi , yi ):
∂L(ŷi , yi )
The gradient of the loss .∇w L(ŷ, y) is defined by the partial derivatives: . .
∂wj
Using the chain rule, we have:
Let us compute separately the two terms to the right of this equality, first . dL(dŷŷi ,yi )
i
∂ ŷi
and then . ∂w j
:
dL(ŷi ,yi )
dŷi
= d
dŷi
(−yi ln ŷi − (1 − yi ) ln(1 − ŷi )),
. = − yŷi + 1−y i
1−ŷi
,
i
ŷi −yi
= ŷi (1−ŷi )
.
2. For the second term, using the chain rule again, we have:
∂ ŷi dŷi ∂w · xi
. = · .
∂wj dw · xi ∂wj
We have then:
dŷi
. = ŷi · (1 − ŷi ),
dw · xi
Weight Updates
Using the gradient values, we can now compute the weight updates at each step
of the iteration. As with linear regression, we can use a stochastic or a batch
method. For .DS = {(1, xi,1 , xi,2 , . . . , xi,n , yi )|i : 1..q}, the updates of .w =
(w0 , w1 , . . . , wn ) are:
• With the stochastic gradient descent:
1
wj (k+1) = wj (k) − α ·
. − yi · xi,j ;
1 + e−wk ·xi
wk+1 = wk − α · (ŷi − yi ) · xi ;
.
We stop the descent when the gradient is less than a predefined threshold or after a
certain number of epochs.
In this chapter so far, when fitting our models, we used constant learning rates in
the update rules. The value of such learning rates has a considerable influence on
the final results. A low rate will make the descent converge slowly while a high rate
may overshoot the minimum. One possible method to find an optimal value is to
186 7 Linear and Logistic Regression
carry out convergence experiments on a part of the dataset with different rates. It is
a common practice to vary the .α values between 0.1 and .105 , and look at the loss
function with respect to the epochs.
Another way to optimize gradient descent is to use an adaptive learning rate
that changes with the epochs. Such optimizers include Momentum (Qian 1999),
RMSProp (Hinton 2012), Adam (Kingma and Ba 2014), and NAdam (Dozat 2016).
We examine here the Momentum and RMSProp optimizers:
Qian (1999) noticed when the loss surface was a long and narrow valley, the descent
trajectory oscillated between the ridges making it particularly slow. He proposed
to redefine the update rule so that it could cancel the oscillations and have a more
direct descent.
He added a momentum representing the accumulated past gradients to the update
term α∇w L(w). The new update term is:
where ρ is the momentum parameter, for instance 0.9, with the initialization:
. Δ0 = α∇w L(w0 ).
. wk+1 = wk − Δk .
7.10.2 RMSprop
RMSprop starts from the update rule of the gradient descent applied to the loss
function as we defined it in Sect. 7.5:
wk+1 = wk − αk ∇L(wk ),
.
In this section, we will apply logistic regression to our small Salammbô dataset with
the scikit-learn toolkit. We already used scikit-learn in Sect. 6.4 and we saw that its
base numerical representation was NumPy arrays (Chap. 5, Python for Numerical
Computations). We will load and format the dataset in NumPy, fit a model, predict
classes, and evaluate the performances with the scikit-learn API.
All the feature values in the dataset in Table 7.1 are numeric and creating .X and .y in
a NumPy array format is straightforward. We can do it manually as it is a very small
dataset. We just need to decide on a convention on how to represent the classes: We
assign 0 to English and 1 to French. We create the arrays with np.array() and the
lists of values as arguments:
import numpy as np
X = np.array(
[[35680, 2217], [42514, 2761], [15162, 990], [35298, 2274],
[29800, 1865], [40255, 2606], [74532, 4805], [37464, 2396],
[31030, 1993], [24843, 1627], [36172, 2375], [39552, 2560],
[72545, 4597], [75352, 4871], [18031, 1119], [36961, 2503],
[43621, 2992], [15694, 1042], [36231, 2487], [29945, 2014],
[40588, 2805], [75255, 5062], [37709, 2643], [30899, 2126],
[25486, 1784], [37497, 2641], [40398, 2766], [74105, 5047],
[76725, 5312], [18317, 1215]
])
y = np.array(
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])
188 7 Linear and Logistic Regression
In the programs in this chapter, we will use the X values as is. However, a common
practice is to scale them so that they range from 0 to 1 or from –1 to +1. This
will considerably improve the stability of the gradient descent. We will see these
techniques in the next chapter (Sect. 8.5.1) as well as the corresponding scikit-learn
functions.
In most applications, the dataset will be larger and we will have to load it from a
file. There are many possible formats and loader applications. We give two examples
here with tab-separated values (TSV) and svmlight.
In Sect. 4.3, we already used the TSV format and pandas. For our dataset, a TSV
file will consist of three columns containing the number of characters, the number
of As, and the class:
35680 2217 0
42514 2761 0
15162 990 0
...
36961 2503 1
43621 2992 1
15694 1042 1
...
dataset_pd = pd.read_csv(’../salammbo/salammbo_a_binary.tsv’,
sep=’\t’,
names=[’cnt_chars’, ’cnt_a’, ’class’])
svmlight is an older format, but still widely used to distribute numerical datasets.
Each row has the structure:
<class-label> <feature-idx>:<value> <feature-idx>:<value> ...
For the dataset in Table 7.1, the corresponding file consists of the following lines:
0 1:35680 2:2217
0 1:42514 2:2761
0 1:15162 2:990
...
1 1:36961 2:2503
1 1:43621 2:2992
1 1:15694 2:1042
...
When a feature value is 0, we do not need to store it. This means that svmlight is
well suited when the data is sparse.
We load our dataset with this piece of code:
from sklearn.datasets import load_svmlight_file
X, y = load_svmlight_file(file)
Once we have our dataset ready in NumPy arrays, we select and fit a model, here
logistic regression with default parameters, with the lines:
from sklearn.linear_model import LogisticRegression
classifier = linear_model.LogisticRegression()
classifier.fit(X, y)
Depending on the size of the dataset, the model can take a while to train. Here
it is instantaneous. When the fitting is done, we can predict the class of new
observations:
classifier.predict([X[-1]]) # 1
classifier.predict(np.array([[35680, 2217]])) # 0
The next instruction reapplies the model to the whole training set:
y_predicted = classifier.predict(X)
A more thorough evaluation would use a test set or cross-validation. This is what
we will do in Sect. 7.12, but first let us examine the model.
190 7 Linear and Logistic Regression
We saw in Sect. 7.9 that a logistic curve models the probabilities of a prediction: We
obtain them with predict_proba(), which returns two values: .P (0|x) and .P (1|x):
classifier.predict_proba([X[2]]) # [[0.9913024, 0.0086976]]
classifier.predict_proba([X[-1]]) # [[0.0180183, 0.9819817]]
The model itself consists of a weight vector .(w1 , w2 ) and an intercept .w0 ,
respectively coef_ and intercept_:
classifier.coef_ # [[-0.03372363 0.51169867]]
and
classifier.intercept_ # [-4.51879339e-05]
Predicting the probability of a class is just the application of the logistic function
to the dot product .w · x. For this, we create the weight and feature vectors,
respectively .w = (w0 , w1 , w2 ) and .x = (1, x1 , x2 ), here for X[-1]:
w = np.append(classifier.intercept_, classifier.coef_)
x = np.append([1.0], X[-1])
and we apply the logistic function:
1/(1 + np.exp(-w @ x)) # 0.9819817031873619
that returns the same value as with predict_proba.
Note that as the initial weight vectors are randomly initialized, the model
parameters will probably differ between two fitting experiments. This implies that
the results and figures shown in this book will certainly be slightly different in your
own experiments.
1
q
BCELoss(ŷ, y) = −
. yi ln ŷi + (1 − yi ) ln(1 − ŷi )
q
i=1
enabled us to fit the model. This may seem abstract and, given a model and a dataset,
how do we compute it concretely?
In the equation, each pair .(ŷi , yi ) represents an observation, where .yi is the class
and .ŷi the probability it belongs to class 1. The value of .yi is either 0 or 1 and .ŷi is
a number ranging from 0 to 1, for instance:
• The third observation, X[2], belongs to class 0 and we have
We plug these values in the equation to obtain the loss for X[2] and X[-1]:
To apply this equation the full dataset, we extract the probabilities of class 1 with:
classifier.predict_proba(X)[:, 1]
scikit-learn’s has a dedicated built-in function for this and in routine programming,
we will probably prefer to use it:
from sklearn import metrics
metrics.log_loss(y, classifier.predict_proba(X))
# 0.00206177
In the linear regression algorithm in Sect. 7.3, we used a loss consisting of the sum
or mean of squared errors (MSE), while in logistic regression, we used binary cross
entropy and the probability logarithms. Binary cross entropy is a consequence of
our decision to reach the maximum likelihood in Sect. 7.9.1.
In fact, we could also measure the prediction loss as the distance between the
predicted probability and the value of the true class, either 0 or 1. Figure 7.7 shows
these losses relatively to class 1 and compares the binary cross entropy loss defined
as
BCELoss(ŷ) = − log ŷ
.
192 7 Linear and Logistic Regression
MSELoss(ŷ) = (1 − ŷ)2 .
.
MSELoss(ŷ2 , y2 ) = (0 − 0.0086976)2 ,
= 7.56 · 10−5 ,
.
MSELoss(ŷ29 , y29 ) = (1 − 0.9819817)2 ,
= 3.25 · 10−4 .
and, as the figures show, this would be a bad idea. The squared error applied to
probabilities leads to much smaller differences between the truth and the prediction
This makes it more difficult for a gradient descent to find a minimum. Figure 7.7
shows that the squared error squashed the loss range, while the negative logarithm
strongly penalizes probability errors. A system that would predict with certainty
class 0 when the true class is 1 would get an infinite binary cross entropy loss.
Although Berkson (1944) used squared errors in his paper on logistic regression,
we should avoid this loss and always prefer cross entropy.
1 https://www.gutenberg.org/.
7.12 Evaluation of Classification Systems 193
X_de = np.array(
[[37599, 1771], [44565, 2116], [16156, 715], [37697, 1804],
[29800, 1865], [42606, 2146], [78242, 3813], [40341, 1955],
[31030, 1993], [26676, 1346], [39250, 1902], [41780, 2106],
[72545, 4597], [79195, 3988], [19020, 928]
])
We add this array to X with vstack and we create a new .y vector where class 2
is German:
X = np.vstack((X, X_de))
y = np.array(
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])
We create a new classifier, we fit it, and we predict the three classes of the training
set with the same statements as with the binary classification:
cls_de = LogisticRegression()
cls_de.fit(X, y)
y_hat = cls_de.predict(X)
# array([2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
2, 2, 2, 2, 0, 2, 2, 2, 0, 2, 2, 2, 0, 2, 2])
7.12.1 Accuracy
The prediction this time is not as perfect as with English and French. We can
evaluate it more precisely with scikit-learn built-in functions using a N-fold cross-
validation with .N = 5 and first with accuracy: How many observations the classifier
predicted correctly.
We use a scikit-learn function again that shuffles the dataset, splits it N times into
training and test sets, fits the models, and evaluates them. The function stratifies the
data, meaning that the splits keep the original percentage of observations for each
class. Each split uses about 80% of the data for the training set and the rest for the
test set.
194 7 Linear and Logistic Regression
|A|
Recall =
. .
|A ∪ C|
For a given class, this would correspond to how many observations have been
predicted correctly to be in this class with respect to their true number in the dataset.
Precision is the accuracy of what has been returned. It measures how much of
the information is actually correct. It is defined as the number of correct documents
returned divided by the total number of documents returned.
|A|
Precision =
. .
|A ∪ B|
For a given class, this would translate as: Out of the observations predicted to be in
this class, how many are correct?
Recall and precision are combined into the F -measure, which is defined as the
harmonic mean of both numbers:
2 2P R
F =
. = .
1 1 P +R
+
P R
The F -measure is a composite metric that reflects the general performance of a
system. It does not privilege precision at the expense of recall, or vice versa. An
arithmetic mean would have made it very easy to reach 50% using, for example,
very selective rules with a precision of 100 and a low recall.
Using a .β-coefficient, it is possible to give an extra weight to either precision,
.β > 1, or recall, .β < 1, however:
(β 2 + 1)P R
F =
. .
β 2P + R
Finally, a fallout figure is also sometimes used that measures the proportion of
irrelevant documents that have been selected.
|B|
Fallout =
. .
|B ∪ D|
Coming back to our Salammbô experiment and the prediction on the training set,
we print the scores for the three classes with scikit-learn’s classification report:
print(metrics.classification_report(y, y_hat))
accuracy 0.89 45
macro avg 0.89 0.89 0.89 45
weighted avg 0.89 0.89 0.89 45
where the macro average is the arithmetic mean of the F-1 scores. This macro
average is frequently used as it synthesizes the performance of classification systems
in just one number.
2 https://github.com/google/cld3.
Chapter 8
Neural Networks
Neural networks form a last family of numerical classifiers that, compared to the
other techniques we have seen, have a more flexible and extendible architecture.
Typically, neural networks are composed of layers, where each layer contains a set
of nodes, the neurons. The input layer corresponds to the input features, where each
feature is represented by a node, and the output layer produces the classification
result.
Beyond this simple outline, there are scores of ways to build and configure a
neural net in practice. In this chapter, we will focus on the simplest architecture,
feed-forward, that consists of a sequence of layers. In a given layer, each node
receives information from the all the nodes of the preceding layer, processes this
information, and passes it through to all the nodes of the next layer; see Fig. 8.1.
Over the years, neural networks have become one the most efficient machine-
learning devices. In this chapter, we will describe how we can reformulate the
perceptron and logistic regression as neural networks, and then see how we can
extend the networks with multiple layers. In the next chapters, we will introduce
other types of neural networks.
The structure of neural networks was initially inspired by that of the brain and its
components, the nerve cells. Literature in the field often uses a biological vocabulary
to describe the network components: The connections between neurons are then
called synapses and the information flowing between two nodes, a scalar number, is
multiplied by a weight called the synaptic weight.
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 197
P. M. Nugues, Python for Natural Language Processing, Cognitive Technologies,
https://doi.org/10.1007/978-3-031-57549-5_8
198 8 Neural Networks
(1)
a1
x1
(1) (2)
a2 a1
x2 (3)
a1 1/0
(1) (2)
a3 a2
x3
(1)
a4
It is easier nonetheless tol represent a network with matrices and the following
mathematical notation:
• As in the previous sections, let us denote .x the input vector representing one
observation;
(j )
• Let us denote .ai a node at layer .(j ). In Fig 8.1, for instance, the first hidden
(1) (1) (1) (1)
layer consists of four nodes: .a1 , .a2 , .a3 , and .a4 ; The nodes of a hidden layer
form a vector that we denote .a(j ) , for instance .a(1) ;
(j )
• Let us call .wi the vector representing the weights from the incoming synapses
(j ) (1)
at node .ai . For instance, .a2 has three incoming connections with weights
(1)
represented by .w2 ;
• Given a layer .(j ), we can store all the incoming weights in a matrix that we call
.W
(j ) . This matrix has as many rows as there are hidden nodes in the layer. In
the case of Fig. 8.1, we stack the .w(1) (1) (1) (1)
1 , w2 , w3 , and .w4 vectors as rows of the
matrix;
• The size of these matrices will be .4 × 3 for the first hidden layer, .W (1) , .2 × 4 for
the second one, .W (2) , and finally, .1 × 2 for the third layer, .W (3) .
At a given layer, the computation carried out in a neuron has two main steps:
1. Combine information coming from the neurons from the preceding layer; this is
simply done through the dot product of the synaptic weights. In Fig. 8.1, at node
8.2 Feed-Forward Computation 199
(1) (1)
a1 , the linear combination is: .w1 · x. We compute the input values of all the
.
(1)
nodes .ai by multiplying the matrix .W (1) by .x:
W (1) x.
.
W (1) x + b(1) ,
.
We apply the activation function to all the coordinates of the vector. We also say
that we map the function over the vector.
3. For the ith hidden layer, we apply the same computation as between the input
and the first layer and we have the relation:
The weights matrices .W (j ) and bias vectors .b(j ) making the network consist of
trainable parameters that we fit using a dataset; see Sect. 8.2.2.
200 8 Neural Networks
x1
w1
x2 w2
w3
x3 S wi xi 1/0
i
..
. wn
xn
x1
w1
x2 w2
w3
x3 Si wi xi 1/0
..
. wn
xn
The perceptron from Sect. 7.8 is the simplest form of neural networks. It has a single
layer, where the activation is the Heaviside function. Figure 8.3 shows a graphical
representation of it.
We can also reformulate logistic regression from Sect. 7.9 as a neural network.
It also has a single layer, where the activation function is the logistic function; see
Fig. 8.4.
From the examples above, we can build more complex networks by just adding
layers between the input features and the layer connected to the output. A frequent
8.3 Backpropagation 201
x1
x2 1/0
x3
Fig. 8.5 Network with hidden layers and two activation functions: reLU and logistic regression
design is to use the reLU function in the hidden layers, except the last one, where
we use the logistic function; see Fig. 8.5.
Training a neural network model is then identical to what we have seen for the
perceptron or logistic regression. Given a network topology and activation functions,
we apply a training algorithm to a dataset that finds (fits) optimal synaptic weights,
i.e. the weight matrices and bias vectors of the network: .W (j ) and .b(j ) . To carry
this out, the training algorithm uses backpropagation: A technique that computes
the error gradient iteratively, starting from the last layer, and backpropagates it, one
layer at a time, until it reaches the first layer. The update rule from Sect. 7.5, will
then enable us to adjust the weights and biases.
8.3 Backpropagation
8.3.1 Presentation
where .f (1) is the activation function at layer 1. For the second layer:
until we reach layer L, the last layer, and output the prediction:
For the example in Fig. 8.5 and using two linear activations, i.e. no activation,
and a final logistic function, we have:
As in Sect. 7.5, to fit the matrices, we can compute the partial derivatives with
respect to all the weights, i.e. the gradient and apply the gradient descent algorithm
until we find a minimal loss. Computing the products of all the network matrices and
(l)
flattening the .wij coefficients into one single vector .w, we would have the following
recurrence relation at step k of the descent:
wk+1 = wk − αk ∇Loss(wk ).
.
This relation gives us a weight update rule that would enable us to fit the network.
Nonetheless, although theoretically possible, it would be difficult to apply this
technique in practice as modern neural networks have sometimes billions of
parameters.
Coming back to our example in Fig. 8.5 and using the logistic loss, i.e. the binary
cross-entropy (Sect. 7.9), the gradient coordinates would be:
(l)
for all the weights .wij . The complete computation of this expression is left as
an exercise to the reader. Again this would yield an update rule to fit our weight
parameters, even if this method is impractical in real cases.
8.3 Backpropagation 203
Instead of computing the gradient for the whole network in one shot, we will
decompose the problem and compute it one layer at a time. For this, we will follow
Le Cun (1987) and proceed in two steps:
1. We will first consider the input and the values in the hidden layers, before and
after activation, respectively .z(l) and .a(l) , with the equality .a(l) = f (l) (z(l) ). It is
easier to start with the gradient with respect to these input and hidden nodes as
the number of parameters in .a(l) is much lower than that of the weights in .W (l) .
2. Then, we will see that these gradients can serve as intermediate steps to compute
the gradients with respect to the weights.
For convenience, we will denote .a(L) the variable .ŷ to have an easier iteration.
We have:
ŷ = a(L) ,
. = f (L) (z(L) ),
= f (L) (W (L) a(L−1) )
Remember though that row and column vectors are matrices (see Sect. 5.5.8).
As we said, we will proceed backward starting from the output and moving to the
input. Let us first consider the last hidden layer and compute the partial derivatives
204 8 Neural Networks
(L)
of the loss, .Loss(ŷ, y), with respect to .zi . Using the chain rule, we have:
Computing all the partial derivatives, we obtain the gradient of the loss with
respect to .z(L) :
∂Loss(a(L) ,y)
∇ z(L) Loss(a(L) , y) = ∂z(L)
,
∂Loss(a(L) ,y)
. = ∂a(L)
f (L) (z(L) ),
= ∇ a(L) Loss(a(L) , y) f (L) (z(L) ).
(L)
where each coordinate .zi of .z(L) is a scalar function for which we can compute a
gradient with respect to .z(L−1) :
(L)
(L) ∂zi
.∇ z(L−1) z = (L−1) .
i
∂z
In total, we have .I (L) gradients if .I (L) is the number of hidden nodes at layer L.
These gradients are vectors that we can arrange in a matrix called a Jacobian. By
convention, the gradients will correspond to the rows of the matrix and each cell .mij
∂zi(L)
will have the value . (L−1) :
∂zj
⎡ (L) ⎤
⎡ (L)
⎤ ∂z1
∇ z(L−1) z1 ⎢ ∂z(L−1) ⎥
∂z(L) ⎢ (L) ⎥ ⎢ ∂z2(L) ⎥
⎢ ∇ z(L−1) z2 ⎥ ⎢ ∂z(L−1) ⎥
Jz(L−1) (z(L) ) =
. =⎢ ⎥=⎢ ⎥
∂z(L−1) ⎣ ... ⎦ ⎢ ... ⎥
(L) ⎣ (L) ⎦
∇ z(L−1) zI (L) ∂z (L)
I
∂z(L−1)
8.3 Backpropagation 205
Let us now compute the derivatives of the .z vector at layer L with respect to a
variable in the previous layer: .zjL−1 . It corresponds to the values in the column at
index j :
Considering all the columns and computing the partial derivatives with respect
(L−1)
to all the .zj variables, we can build the complete Jacobian:
With the Hadamard product, we multiply each element of .f (L−1) (z(L−1) ) , a row
vector, with the column of .W (L) of respective index.
We can show that this relation applies for any pair of adjacent layers l and .l − 1
in the network:
We can now use the chain rule for composed functions to compute .∇ x Loss(y, ŷ)
as the product of the intermediate gradients with respect to the hidden cells:
∇ x Loss(ŷ, y) = ∇ x Loss(f (L) (W (L) . . . f (2) (W (2) f (1) (W (1) x)) . . .), y),
ŷ,y) ∂z(L) ∂z(L−1) (2) ∂z(1)
= ∂Loss(
∂z(L) ∂z(L−1) ∂z(L−2)
. . . ∂z
∂z(1) ∂x
,
.
= ∇ a(L) Loss(a(L) , y) f (L) (z(L) )f (L−1) (z(L−1) )
W (L) . . . f (1) (z(1) ) W (2) W (1) .
206 8 Neural Networks
This last term shows that we can recursively compute, backpropagate, the gradient
with respect to the hidden values from the output until we reach the input. Note that
.∇ z(l) Loss(ŷ, y) is sometimes denoted .δ
(l) .
Using our example in Fig. 8.5, two linear activations instead of ReLU, .f (1) and
.f
(2) , a final logistic function, .f (3) , and a logistic loss, we have .f (1) = f (2) = 1,
.f
(3) = f (3) (1 − f (3) ), and . dL(dŷ,y)
ŷ
ŷ−y
= ŷ(1− ŷ)
(see Sect. 7.9.1).
The gradient of the loss with respect to .x is :
Now that we have the gradient with respect to the input and hidden layers, let us
define the backpropagation algorithm, where we compute the gradient with respect
to .W (l) , l being the index of any layer. From the chain rule, for the last layer, L, we
have:
∂Loss(ŷ,y) ∂z(L)
. ∇ W (L) Loss(ŷ, y) = ∂z(L) ∂W (L)
and
The partial derivatives of .z(L) with respect to .W (L) simply consist of the transpose
of .a(L−1) . Then, we have:
∂z(L)
. = a(L−1) .
∂W (L)
8.4 Applying Neural Networks to Datasets 207
We can now compute the gradient of the loss with respect of .W (L) :
which enables us to apply the update rule to all the weights of the last layer using
the .a(L−1) , .a(L) , and .z(L) values we have stored in the forward pass.
With our example in Fig. 8.5, this corresponds to:
Coming back to the general case, we can now proceed backward using the chain
rule for any l:
∂Loss(ŷ,y) ∂z(l)
. ∇ W (l) Loss(ŷ, y) = ∂z(l) ∂W (l)
with
and
∂z(l)
. = a(l−1) .
∂W (l)
Finally, we have
that gives us the value of the gradient of the weights at index l. This gradient enables
us to apply the update rule to the weight matrices.
It is easy to convert the input so that it fits the dataset format. We just have to
transpose the matrix product:
(W x) = x W ,
.
= ŷ.
x is now a one-row matrix and we can apply the product to the whole .X dataset by
.
stacking all the input rows vertically. This product yields the predicted outputs:
XW = ŷ.
.
We will now apply the concepts we have learned to the Salammbô dataset. For
this programming part, we will start with Keras (Chollet 2021), as it is a smooth
transition with scikit-learn, and then we will use PyTorch (Paszke et al. 2019) for
the rest of this book. Both are comprehensive libraries of neural network algorithms.
Keras is in fact a high-level programming interface on top of either Tensorflow
(Abadi et al. 2016), PyTorch, or JAX.
Before we fit a model, we will normalize the data. This part is common to Keras
and PyTorch. We will then build identical networks with both libraries so that we
can compare their programming interfaces.
8.5 Programming Neural Networks 209
We store the Salammbô dataset, .X and .y, in NumPy arrays using the same structure
as in Sect. 7.11.1. The algorithms in Keras and PyTorch are quite sensitive to
differences in numeric ranges between the features. Prior to fitting a model, a
common practice is to standardize the columns, here the counts for each character by
subtracting the mean from the counts and dividing them by the standard deviation:
xi,j − x̄.,j
xi,j std =
. .
σx.,j
For letter A, the second column in the .X matrix, we compute the mean and
standard deviation of the counts with these two statements:
mean = np.mean(X[:,1])
std = np.std(X[:,1])
where the results are 2716.5 and 1236.21. Applying a standardization replaces the
value of .x15,1 , 2503, with .−0.1727.
The chapters in Salammbô have different sizes: The largest chapter is five times
the length of the shorter. To mitigate the count differences, we can also apply a
normalization of the rows, i.e. divide the chapter vectors by their norm, before the
standardization. As a result, all the chapter rows will have a unit norm:
xi,j
xi,j norm =
. .
n−1 2
k=0 xi,k
Instead of computing the mean and standard deviation ourself, we will rely on
scikit-learn and its built-in classes, Normalizer and StandardScaler, to normalize
and standardize an array. These classes have two main methods: fit() and
transform(). We first use fit() to determine the parameters of the normalization
or standardization and then transform() to apply them to the dataset. We normally
use fit() once and transform() as many times as we need to standardize the data.
We can combine fit() and transform() in the fit_transform() sequence:
from sklearn.preprocessing import StandardScaler, Normalizer
X_norm = Normalizer().fit_transform(X)
X_scaled = StandardScaler().fit_transform(X_norm)
210 8 Neural Networks
8.5.2 Keras
Building the Network
Building a feed-forward network is easy with Keras. We just need to describe its
structure in terms of layers. The fully connected nodes correspond to Dense layers
that we will assemble with the Sequential class. To implement logistic regression,
we then use a Sequential model with one Dense() layer in the sequence. In the
Dense() object, we specify the output dimension as first parameter, 1, for English
(0) or French (1), then, using a named argument, the activation function, a sigmoid,
activation=’sigmoid’. To have a reproducible model, we also set a random seed.
Creating our network is as simple as that:
import numpy as np
import keras_core as keras
np.random.seed(0)
model = keras.Sequential([
keras.layers.Dense(1, activation=’sigmoid’)
])
Before we can train the model, we need to specify the loss function (the quantity to
minimize) and the algorithm to compute the weights. We do this with the compile()
method and two arguments: loss and optimizer, where we use, respectively,
binary_crossentropy for logistic regression and sgd for stochastic gradient descent.
The fitting process will report the loss for each epoch. We also tell it to report the
accuracy with the metrics argument.
model.compile(loss=’binary_crossentropy’,
optimizer=’sgd’,
metrics=[’accuracy’])
The optimizer has often a big influence on the results of the descent. Keras
supports the most common algorithms such as RMSprop, rmsprop, or Adam, adam,
that we described in Sect. 7.10. Here, they would produce the same results as sgd.
Once the model is compiled, we train it with the fit() method, where we pass
the scaled dataset, the number of epochs, and the batch size. For a batch gradient
descent, batch_size would be set to the size of the dataset. Here we apply a
stochastic descent and we set batch_size to 1. In most applications, we would use
8.5 Programming Neural Networks 211
minibatches (Sect. 7.5), where the batch size should have higher values, such as 8,
16, 32, or 64.
history = model.fit(X_scaled, y, epochs=20, batch_size=1)
The returned variable history.history is a dictionary with two keys: the loss and
the accuracy. We plot the loss with these statements:
import matplotlib.pyplot as plt
plt.scatter(range(len(history.history[’loss’])),
history.history[’loss’], c=’b’, marker=’x’)
plt.title(’Loss’)
plt.xlabel(’Epochs’)
plt.ylabel(’BCE’)
plt.show()
and we just change loss with accuracy to plot the accuracy (Fig. 8.6).
We can see that with such a small dataset, the loss decreases steadily and
the accuracy reaches one immediately. A more realistic experiment would use
validation data. We would include them in the fitting code with these arguments:
history = model.fit(X_train, y_train,
epochs=20,
batch_size=1,
validation_data=(X_val, y_val))
Predicting Classes
Once trained, we can apply the model to a dataset with the predict() method, here
to the training set again.
y_pred_proba = model.predict(X_scaled)
We predict class 1 when the probability is greater than a threshold of 0.5 and 0,
when it is less:
def predict_class(y_pred_proba):
y_pred = np.zeros(y_pred_proba.shape[0])
for i in range(y_pred_proba.shape[0]):
if y_pred_proba[i][0] >= 0.5:
y_pred[i] = 1
return y_pred
212 8 Neural Networks
Fig. 8.6 The training loss and accuracy over the epochs
In this small example, where we train and apply a model on the same dataset, we
reach an accuracy of 100%.
predict_class(y_pred_proba)
# array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., ...
# ... 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.])
8.5 Programming Neural Networks 213
We can also evaluate the model using Keras evaluate() and report the binary cross-
entropy and the accuracy:
# evaluate the model
scores = model.evaluate(X_scaled, y)
# [0.13300223648548126, 1.0]
Remember that if you run this program on your machine, as the initialization is
random, the figures will be different.
Model Parameters
The model parameters consist the matrix weights, here a column vector, and the
bias that we obtain with model.get_weights(). As the input has two parameters, the
number of letters and the number of As, our vector has two coordinates. The bias
has only one coordinate:
>>> model.get_weights()
[array([[-0.7931856],
[ 1.1318369]], dtype=float32),
array([0.00644173], dtype=float32)]
So far, we used a single layer. If we want to add hidden layers, for instance to insert
a layer of five nodes, we just declare them in a list that we pass to the Sequential
class. As mentioned in Sect. 8.2.2, intermediate layers should use the relu activation
function. We do not need to specify the input size of intermediate layers; Keras
automatically infers it:
model = keras.Sequential([
layers.Dense(5, activation=’relu’),
layers.Dense(1, activation=’sigmoid’)
])
Training and running this new network will also result in a 100% accuracy.
This last example shows how to build a multilayer feedforward network.
However, this architecture is not realistic here given the size of the dataset: The
high number of nodes relative to the size certainly creates an overfit.
8.5.3 PyTorch
After scikit-learn and Keras, PyTorch is third machine-learning library. Like Keras,
it is intended to deep-learning applications but their application programming
214 8 Neural Networks
interfaces (APIs) are quite different. With PyTorch, the programmer has to manage
lower level details, that makes it more flexible, but also more difficult to program.
PyTorch does not use NumPy arrays. Instead, it has an equivalent data structure
called tensors that we saw in Chap. 5, Python for Numerical Computations. As they
are not identical, we need to convert our dataset in this format. In addition, because
of loss output format of binary cross-entropy in PyTorch, the .y vector must be a
column vector:
import numpy as np
import torch
Y = y.reshape((-1, 1))
X_scaled = torch.from_numpy(X_scaled).float()
Y = torch.from_numpy(Y).float()
model = nn.Sequential(
nn.Linear(2, 1)
)
Linear is a fully connected layer, where the arguments are the input and output
dimensions. A Linear object consists of parameters representing a .W matrix and a
b bias. In Sect. 5.8, we saw the application of this matrix to an input X corresponds
.
Once we have defined the model architecture, we specify the parameters of the
gradient descent: The loss to optimize and the optimization algorithm. As with
Keras, this is the binary cross-entropy and the stochastic gradient descent.
We first create the loss function with the statement;
loss_fn = nn.BCEWithLogitsLoss() # binary cross entropy loss
It incorporates the logistic function and this explains its long name here:
BCEWithLogitsLoss(). As input, this loss function uses the output of the linear
layer of model. Such an output is called a logit, which is the inverse of the logistic
function. Although this term is a bit convoluted, it is standard in the field. As an
8.5 Programming Neural Networks 215
analogy, we could think of the sine function and call the input arcsine instead of
angle.
A true binary cross-entropy loss, as we defined it in Sect. 7.9.1, also exists in
PyTorch under the name BCELoss(), but it is not recommended. The reason is that
a naive computation of the logistic function . 1+e1 −x is unstable even with relatively
small numbers like .−1000 for which math.exp(1000) throws an overflow error. The
cross-entropy from logits uses a technique called the log-sum-exp trick that factors
terms and makes the computation possible.
We then select an optimizer, here stochastic gradient descent, where we give the
parameters to optimize model.parameters(), corresponding here to .W and .b, and
the learning rate, lr:
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)
In version 2.0, PyTorch introduced an optional compiling of the model code that
should make its execution faster. If we want to compile your model, we just add one
statement before you train it:
model = torch.compile(model)
Note however that the speedup will depend on the underlying execution platform
and this is still a new feature that may be unstable.
Now that we have a model, a loss, and an optimizer, we can fit the parameters. As
opposed to scikit-learn and Keras, we have to write a loop to iterate explicitly over
the epochs of the gradient descent. For a batch descent, we have the sequence:
model.train() # sets PyTorch in the train mode
bce_loss = []
for epoch in range(30):
Y_pred = model(X_scaled)
loss = loss_fn(Y_pred, Y)
bce_loss += [loss.item()]
optimizer.zero_grad() # resets the gradients
loss.backward() # gradient backpropagation
optimizer.step() # weight updates
where we first set the model in the training mode with model.train(), then model()
applies the forward pass to the input and loss_fn() computes the loss between the
predicted and true values. The two next statements backpropagate the gradient as
in Sect. 8.3. By default, PyTorch accumulates the gradients and zero_grad() resets
their values to zero. Finally, step() updates the weights as in Sect. 7.5. We record
the loss evolution so that we can plot it as in Fig. 8.6.
216 8 Neural Networks
Predicting Classes
Once trained, we can apply the model to an input, here the training set X_scaled,
to predict classes. This is called the inference or evaluation mode. Before we start
the prediction, we tell PyTorch we are in this mode with model.eval(). This will
skip some operations carried out in the training step. Then, we compute the matrix
product and output the logits:
model.eval()
y_pred_logits = model(X_scaled) # applies the model
We need then to apply a logistic function to obtain the probabilities and predict the
class:
y_pred_proba = torch.sigmoid(y_pred_logits)
Model Parameters
The model has a structure nearly identical to that of Keras. We obtain it with
>>> list(model.parameters())
[Parameter containing:
tensor([[-0.6147, 0.1383]], requires_grad=True),
Parameter containing:
tensor([0.2068], requires_grad=True)]
that outputs a generator. We can also print the parameters as an ordered dictionary,
a dictionary, where the keys are ordered:
>>> model.state_dict()
We add hidden layers similarly to Keras, but we have to specify the complete size
of the matrices and the activation, here ReLU(), in the sequence:
model = nn.Sequential(
nn.Linear(2, 5),
nn.ReLU(),
nn.Linear(5, 1)
)
We may want to name the layers in the network above. We use this statement then:
from collections import OrderedDict
model = nn.Sequential(OrderedDict([
(’W1’, nn.Linear(2, 5)), # W1x + b1
(’reLU’, nn.ReLU()), # reLU(W1x + b1)
(’W2’, nn.Linear(5, 1)) # W2reLU(W1x + b1) + b2
]))
In Sect. 5.8, we saw that applying this network to the X dataset corresponds to the
matrix product:
As with our first PyTorch model, we omit the last logistic activation and we
return the logits. The training loop and the prediction are identical to those of our
first model.
PyTorch Dataloaders
We used a loop to iterate over the epochs in the batch descent. If we want to apply a
stochastic descent or use minibatches, we have to write an inner loop as this one for
a batch size of 4. In this inner loop, the sequence of statements is the same:
BATCH_SIZE = 4
model.train()
for epoch in range(50):
for i in range(0, X_scaled.size(dim=0), BATCH_SIZE):
Y_batch_pred = model(X_scaled[i:i + BATCH_SIZE])
loss = loss_fn(Y_batch_pred, Y[i:i + BATCH_SIZE])
optimizer.zero_grad()
loss.backward()
optimizer.step()
The combination of both will enable us to manage large quantities of data and iterate
over them. We will here introduce their simplest features.
218 8 Neural Networks
dataset = TensorDataset(X_scaled, Y)
dataloader = DataLoader(dataset,
batch_size=BATCH_SIZE,
shuffle=True)
The inner loop to apply the descent with minibatches is now easier to write:
model.train()
for epoch in range(50):
for X_scaled_batch, Y_batch in dataloader:
Y_batch_pred = model(X_scaled_batch)
loss = loss_fn(Y_batch_pred, Y_batch)
optimizer.zero_grad()
loss.backward()
optimizer.step()
We created our first PyTorch feed-forward networks with the Sequential class. This
is the most simple way if the network is just a pipeline of elementary components.
This is not always the case and we will see here another kind of implementation,
where we will encapsulate and create all the elementary components we need in a
dedicated class and tell explicitly how to apply them in the forward pass.
PyTorch uses the nn.Module base class for all its neural networks. To create a new
model, we subclass it and implement two methods: __init__() and forward(). For
the logistic regression model:
1. In __init__(), we create the layers of our network, here one Linear layer;.
2. The forward() method computes the matrix product .XW + b from the .X input,
as described in Sect. 8.4.
class Model(nn.Module):
def __init__(self, input_dim):
super().__init__()
self.fc1 = nn.Linear(input_dim, 1)
The model so far is just a class. To run it, we create an instance of it, choose
a loss function, here binary cross entropy, and select an optimizer. We create the
model object with:
input_dim = X_scaled.size(dim=1)
model = Model(input_dim)
To insert a hidden layer of five nodes in our previous model, we simply add a Linear
layer in the class. In the forward() method, we insert a relu() after the first layer
and we apply the second one:
class Model(nn.Module):
def __init__(self, input_dim):
super().__init__()
self.fc1 = nn.Linear(input_dim, 5)
self.fc2 = nn.Linear(5, 1)
In Sects. 8.5.2 and 8.5.3, we used the logistic function to carry out a classification
with two languages and binary cross entropy as a loss. Now, how can we do if
instead of a binary output, French or English, we have three or more languages to
detect: Latin, Greek, German, Russian, etc., possibly all the languages in wikipedia?
The answer is that we will need to define two more general functions to quantify
the loss and to estimate the probability of a vector .x to belong to a certain class, here
i, .P (y = i|x). This generalization of logistic regression is called multinomial or
multiclass logistic regression and, fortunately, it will not change much the structure
of our programs.
220 8 Neural Networks
In Sect. 7.9, we saw that binary logistic regression minimizes the binary cross
entropy loss. Categorical cross entropy or simply cross entropy is an extension of it
that covers multiple classes, where we compute the loss for each class and we sum
or average all the losses. The mean is defined as:
1
q
CELoss(ŷ, y) = −
. yj · ln ŷj ,
q
j =1
where .yj corresponds to the true class of the j th observation, represented as a one-
hot vector (Sect. 6.3), .ŷj is the probability distribution of the predicted classes, and
q, the number of observations.
Supposing we have three languages in our dataset, English, French, and German,
in this order, we first associate them to an index: 0, 1, 2 and then convert these
indices to one-hot vectors. y will have three possible values:
1. .y = (1, 0, 0) for English,
2. .y = (0, 1, 0) for French, and
3. .y = (0, 0, 1) for German.
Given an observation .x, the classifier will output a three-dimensional vector
estimating the probabilities to belong to a certain class:
ezi
.σ (z)i = C
.
zj
j =1 e
C
. σ (z)i = 1.
i=1
The probability for input vector .x to have i as language, for instance French or
German, is then simply the ith coordinate of the softmax distribution:
ewi ·x
P (y = i|x) =
.
C wj ·x
j =1 e
A training procedure based on gradient descent determines the weight values .wj
attached to each language j .
If our network has more than one layer, we apply the softmax function to the last
layer. All the previous layers normally use the relu activation.
1 The softmax function is in fact a renaming of the much older Boltzmann distribution.
222 8 Neural Networks
The Keras functions for multiclass networks are nearly identical to those
of binary networks. We just change the loss from binary_crossentropy to
categorical_crossentropy and the activation of the last layer from sigmoid to
softmax. However, Keras does not handle the output labels like scikit-learn and
we need to convert them into a Keras-compatible format. There are two options:
one-hot vectors or integer indices:
1. We can represent the class of an observation, y, as a one-hot vector, just as in
Sect. 8.6 with, for example, .y = (1, 0, 0) for English. Given an observation .x,
the classifier outputs the probabilities to belong to a certain class for instance
.ŷ = (0.2, 0.7, 0.1). We predict the class by picking the index with the maximal
value, .arg max ŷ, here .i = 1, French, with a probability of 0.7. As the index of
i
the true language is 0 in this example, the loss is of .− ln 0.2.
2. One-hot vectors are equivalent to indices and we can also encode y with a
number, for instance 0, 1, and 2 for three classes. In this case, we replace the
loss argument with sparse_categorical_crossentropy.
We will now write a program to exemplify these concepts. We will use the same
dataset as with scikit-learn in Sect. 7.11.8 with letter counts of versions of Salammbô
in French, English, and German. We store them in an X matrix and, in the y vector,
the category of these counts:
y = np.array(
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])
Before we fit a model, we scale X as in Sect. 8.5.1 and we store the result in
X_scaled. We convert y in one-hot vectors using the utility function to_categorical:
Y_cat = keras.utils.to_categorical(y)
The model architecture is the same as in Sect. 8.5.2 except the arguments of the
last layer, where we change the number of output classes to 3 and the activation
function from sigmoid to softmax. For one layer, we have:
model = keras.Sequential([
keras.layers.Dense(3, activation=’softmax’)
])
8.7 Multiclass Classification with Keras 223
We can now compile and fit the model. This part is also nearly identical to the
code in Sect. 8.5.2: We just need to change the loss to categorical_crossentropy as
we have more than two classes:
model.compile(loss=’categorical_crossentropy’,
optimizer=’sgd’,
metrics=[’accuracy’])
model.fit(X_scaled, Y_cat, epochs=50, batch_size=1)
Once we have fitted the model, we can predict the class of new observations. In
the code below, we reapply the model to the training set:
Y_pred_proba = model.predict(X_scaled)
Each row represent the probability estimates of an observation. To extract the class,
we pick the index of the highest value. In first row, this is index 0, representing
English, with a probability of 0.692. In the last one, this is index 2, German, with a
probability of 0.991.
We extract the classes from a matrix of probability estimates with the
np.argmax() function:
y_pred = np.argmax(Y_pred_proba, axis=-1)
returning .ŷ:
np.array(
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
2, 2, 2, 2, 0, 2, 2, 2, 0, 2, 2, 2, 0, 2, 2])
Compared with .y, .ŷ contains three errors as the model predicted wrongly three
German samples. This yields an accuracy of 93%
Although accuracy is intuitive, the gradient descent only tries to minimize the
loss. For one observation, .(xi , yi ), where .yi is the class represented as a one-hot
vector, this corresponds to the dot product:
. − yi · ln ŷi
as we saw in Sect. 8.6. For the first row, we compute its value with the line:
-[1, 0, 0] @ np.log([0.692, 0.075, 0.233]) # 0.368
224 8 Neural Networks
For the whole dataset, we extract the probabilities of the true classes of the
observations, compute their logarithms and their mean:
-np.mean(np.log(Y_pred_proba[range(0, len(y)), y]))
We obtain a loss of 0.32. We can also compute the loss and the accuracy with
model.evaluate(X_scaled, Y_cat)
The creation and standardization of the dataset is the same as with Keras. We need
though to convert the data to PyTorch tensors:
X_scaled = torch.from_numpy(X_scaled).float()
y = torch.from_numpy(y).long()
We create a model with one hidden layer of five nodes by adding a linear module
followed by a reLU activation:
from collections import OrderedDict
model = nn.Sequential(OrderedDict([
(’W1’, nn.Linear(input_dim, 5)),
(’reLU’, nn.ReLU()),
(’W2’, nn.Linear(5, 3))
]))
By default, PyTorch cross entropy reduces the loss of each mini-batch with its mean.
In the next steps, we will compute the loss mean for the whole dataset. It is then
preferable to sum the loss of each mini-batch, sum all the mini-batches, and then
compute the mean.
8.8 Multiclass Classification with PyTorch 225
Fig. 8.7 The training loss over the epochs for the logistic regression model
The fitting loop is also similar and we record the loss. See Fig. 8.7.
model.train()
ce_loss = []
for epoch in range(150):
loss_train = 0
for X_scaled_batch, y_batch in dataloader:
y_batch_pred = model(X_scaled_batch)
loss = loss_fn(y_batch_pred, y_batch)
loss_train += loss.item()
optimizer.zero_grad()
loss.backward()
optimizer.step()
ce_loss += [loss_train/len(y)]
with torch.no_grad():
Y_pred_logits = model(X_scaled)
We obtain the result below for the logistic regression model. It is not normalized as
the last layer has no activation:
tensor([[ 1.2929, -1.4799, 0.4301],
[ 0.9736, 0.2044, -0.9725],
[ 0.9331, 0.4184, -1.1506],
...]])
226 8 Neural Networks
Logits are difficult to interpret and we apply a softmax function to the rows:
with torch.no_grad():
Y_pred_proba = torch.softmax(model(X_scaled), dim=-1)
The training loop of PyTorch that we first introduced in Sect. 8.5.3 is more complex
and more difficult to understand than the fitting functions of both Keras and
scikit-learn. In addition, the connection with gradient descent (Sect. 7.5) and the
mathematical equations of backpropagation (Sect. 8.3) is not obvious at first sight.
We will walkthrough this loop with the 5-hidden node network from the previous
section, where we used a learning rate of 0.01 and we will see how PyTorch
implements all this.
Let us examine the model after we have run training epochs and suppose that we
are at step k. We will focus on the first layer .W (1) consisting of five nodes. All the
other parameters would show the same behavior. We print its weight matrix with:
model.W1.weight
yielding:
tensor([[-0.8995, 0.3332],
[ 0.9948, -1.4310],
[-0.2389, 0.5856],
[-1.9805, 2.0227],
[-0.6628, 0.2297]], requires_grad=True)
This represents the state of the model at a certain iteration, here step k. Using
this model, we can predict the targets and measure the difference with the true
8.9 Backpropagation in PyTorch 227
values with the cross entropy loss, here reduced as a sum. This corresponds to the
statement:
loss = loss_fn(model(X_scaled), y)
# tensor(10.3754, grad_fn=<NllLossBackward0>)
The tensors store gradients when requires_grad is true. We print the gradient of .W 1
with:
model.W1.weight.grad
yielding:
tensor([[-0.0000, 0.0000],
[-0.0000, 0.0000],
[ 0.0028, -0.0114],
[-0.0000, 0.0000],
[ 0.0039, -0.0158]])
This gradient comes from an earlier operation. We saw that PyTorch accumulates
them. Before we backpropagate the gradients of the current loss, we need to clear
the old ones with:
optimizer.zero_grad()
prints nothing. We can now backpropagate the gradients safely to all the layers
starting from the last one with:
loss.backward()
.W
(1) has a new gradient corresponding to the current loss
>>> model.W1.weight.grad
tensor([[ 0.1993, -0.1766],
[-0.0467, 0.0465],
[ 0.0617, -0.0510],
[ 0.5730, -0.4875],
[ 0.0860, -0.0710]])
wk+1 = wk − αk ∇f (wk )
.
Neural networks and gradient descent are described in a countless number of books.
As in the previous chapter, Murphy (2022), James et al. (2021), and Goodfellow
et al. (2016) are good starting points. In many applications, the parameter values
of gradient descent are very important to the success of the training procedures.
Among these parameters, the optimizer has a considerable influence on the result.
Ruder (2017) is a good survey of available gradient descent algorithms.
A number of machine-learning toolkits are available from the Internet. Scikit-
learn2 (Pedregosa et al. 2011) is a comprehensive collection of machine-learning
and data analysis algorithms with an API in Python. Keras3 (Chollet 2021) is
a neural network library also in Python. PyTorch4 is another one, which is a
conversion in Python of Torch, an older library written in Lua. R5 is another set
of statistical and machine-learning functions with a script language.
We used scikit-learn, Keras and PyTorch in this chapter. Chollet (2021) is
an excellent and pedagogical tutorial on Keras and machine learning in general.
Stevens et al. (2020) is a book by programmers who participated in the development
of PyTorch. Raschka et al. (2022) is a third one that covers both PyTorch and scikit-
learn.
2 https://scikit-learn.org/.
3 https://keras.io/.
4 https://pytorch.org/.
5 https://www.r-project.org/.
Chapter 9
Counting and Indexing Words
Many language processing techniques rely on words and sentences. When this is the
case and when the input data is a stream of characters, we must first segment it, i.e.,
identify the words and sentences in it, before we can apply any further operation to
the text. We call this step text segmentation or tokenization. A tokenizer may also
be preceded or include a cleaning step to remove formatting instructions, such as
XML tags, if any, and a normalization with NFC or NFKC, see Sect. 4.1.3.
Originally, early European scripts had no symbols to mark segment boundaries
inside a text. Ancient Greeks and Romans wrote their inscriptions as continuous
strings of characters flowing from left to right and right to left without punctuation
or spaces. The lapis niger, one of the oldest remains of the Latin language, is an
example of this writing style, also called boustrophedon (Fig. 9.1).
As the absence of segmentation marks made texts difficult to read, especially
when engraved on a stone, Romans inserted dots to delimit the words and thus
improve their legibility. This process created the graphic word as we know it: a
sequence of letters between two specific signs. Later white spaces replaced the dots
as word boundaries and Middle Ages scholars introduced a set of punctuation signs:
commas, full stops, question and exclamation marks, colons, and semicolons, to
delimit phrases and sentences.
The definition of what a word is, although apparently obvious, is in fact surprisingly
difficult. A naïve description could start from its historical origin: a sequence of
alphabetic characters delimited by two white spaces. This is an approximation. In
addition to white spaces, words can end with commas, question marks, periods, etc.
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 229
P. M. Nugues, Python for Natural Language Processing, Cognitive Technologies,
https://doi.org/10.1007/978-3-031-57549-5_9
230 9 Counting and Indexing Words
Words can also include dashes and apostrophes that, depending on the context, have
a different meaning.
Word boundaries vary according to the language and orthographic conventions.
Compare these different spellings: news stand, news-stand, and newsstand. Al-
though the latter one is considered more correct, the two other forms are also
frequent. Compare also the convention in German to bind together adjacent nouns as
in Gesundheitsreform, as opposed to English that would more often separate them,
as in health reform. Compare finally the ambiguity of punctuation marks, as in the
French word aujourd’hui, ‘today’, which forms a single word, and l’article, ‘the
article’, where the sequence of an article and a noun must be separated before any
further processing.
In corpus processing, text elements are generally called tokens. Tokens include
words and also punctuation, numbers, abbreviations, or any other similar type of
string. Tokens may mix characters and symbols as:
• Numbers: 9,812.345 (English and French from the eighteenth–nineteenth cen-
tury), 9 812,345 (current French and German) 9.812,345 (French from the
nineteenth–early twentieth century);
• Dates: 01/02/2003 (French and British English), 02/01/2003 (US English),
2003/02/01 (Swedish);
• Abbreviations and acronyms: km/h, m.p.h., S.N.C.F.;
• Nomenclatures: A1-B45, /home/pierre/book.tex;
• Destinations: Paris–New York, Las Palmas–Stockholm, Rio de Janeiro–Frankfurt
am Main;
• Telephone numbers: (0046) 46 222 96 40;
• Tables;
• Formulas: E = mc.2 .
9.1 Text Segmentation 231
As for the words, the definition of what is a sentence is also tricky. A naïve
definition would be a sequence of words ended by a period. Unfortunately, periods
are also ambiguous. They occur in numbers and terminate abbreviations, as in etc.
or Mr., which makes sentence isolation equally complex. In the next sections, we
examine techniques to break a text into words and sentences, and to count the words.
Tokenization breaks a character stream, that is, a text file or a keyboard input, into
tokens—separated words—and sentences. In Python, it results in a list of strings.
For this paragraph, such a list looks like:
[[’Tokenization’, ’breaks’, ’a’, ’character’, ’stream’, ’,’,
’that’, ’is’, ’,’, ’a’, ’text’, ’file’, ’or’, ’a’, ’keyboard’,
’input’, ’,’, ’into’, ’tokens’, ’-’, ’separated’, ’words’, ’-’,
’and’, ’sentences’, ’.’], [’In’, ’Python’, ’,’, ’it’, ’results’,
’in’, ’a’, ’list’, ’of’, ’strings’, ’.’], [’For’, ’this’,
’paragraph’, ’,’, ’such’, ’a’, ’list’, ’looks’, ’like’, ’:’]]
A basic format to output or store tokenized texts is to print one word per line and
have a blank line to separate sentences as in:
In
Python
,
it
results
in
a
list
of
atoms
.
For
this
paragraph
,
such
a
list
looks
like
:
232 9 Counting and Indexing Words
As our patterns will include Unicode classes, we import the regex module:
import regex as re
We will use the letters, \p{L}, numbers, \p{N}, punctuation, \p{P}, and symbols,
\p{S}; see Table 4.6 for a complete list of the classes.
We can first define the words as sequences of contiguous letters with the \p{L}+
pattern. The tokenization is straightforward in Python with the re.findall()
function that returns a list of words:
>>> re.findall(r’\p{L}+’, text)
This regex is very simple, but it ignored the punctuation as well as the other
character sequences. If we want to extract the words and the rest, excluding the
spaces, we need to add another pattern to specify the nonword tokens:
r’\p{L}+|[^\s\p{L}]+’
This disjunction creates a token from a letter sequence (\p{L}+) as well as from a
sequence of other characters ([^\s\p{L}]+):
>>> re.findall(r’\p{L}+|[^\s\p{L}]+’, text)
tokenizes the punctuation separately and creates a single token for each punctuation
sign.
The second technique matches the word delimiters to tokenize a text. In its simplest
form, we just split the text when we encounter a sequence of white spaces and we
return the tokens between the delimiters in a list.
Python has a split() function that can just do that. It has two variants:
• str.split(separator), where the separator is a constant string. If there is
no argument, the default separator is a sequence of white spaces, including
tabulations and new lines;
• re.split(pattern, string), where the separator is a regular expression, pattern,
that breaks up the string variable as many times as pattern matches in string.
We will use re.split() as it is more flexible and the pattern
pattern1 = r’\s+’
we obtain:
[’Tell’, ’me,’, ’O’, ’muse,’, ’of’, ’that’, ’ingenious’,
’hero’, ’who’, ’travelled’, ’far’, ..., ’of’, ’Troy.’]
As a result, we have separated the punctuation from the words. We then tokenize
the text according to white spaces as in the previous segmentation. Altogether, we
have this function:
re.split(
pattern1,
re.sub(pattern2, r’ \1 ’, text))
234 9 Counting and Indexing Words
This re.split() produces empty strings. We filter them out from our token list
with the statement:
filter(None, token_list)
where None acts as the identity function. As we have seen, this second technique
is a bit more convoluted than the direct identification of words. Most of the time, the
first technique is to be preferred.
The tokenizing programs we have created so far are not yet perfect. Decimal
numbers, for example, would not be properly processed. They would match the
point of decimal numbers such as 3.14 and create the tokens 3 and 14. The
apostrophe inside words is another ambiguous sign. The tokenization of auxiliary
and negation contractions in English would then need a morphological analysis.
Improving the tokenizers requires more complex regular expressions that would
take into account word forms such as those in Table 9.1.
In French, apostrophes corresponding to the elided e have a regular behavior as
in
Si j’aime et d’aventure .→ si j’ aime et d’ aventure
but there are words like aujourd’hui, ‘today’, that correspond to a single entity and
are not tokenized. This would also require a more elaborate regular expression.
So far, we have carried out tokenization using rules that we have explicitly defined
and implemented using regular expressions or Python. A second option is to use
classifiers such as logistic regression (Sect. 7.9) and to train a tokenizer from a
corpus. Given an input queue of characters, we then formulate tokenization as
a binary classification: is the current character the end of a token or not? If the
classifier predicts a token end, we insert a new line.
Before we can train our classifier, we need a corpus and an annotation to mark
the token boundaries. Let us use the OpenNLP format as an example. The Apache
OpenNLP library is an open-source toolkit for natural language processing. It
features a classifier-based tokenizer and has defined an annotation for it (Apache
OpenNLP Development Community 2012). A training corpus consists of a list of
sentences with one sentence per line, where the white spaces are unambiguous token
boundaries. The other token boundaries are marked with the <SPLIT> tag, as in these
two sentences:
Pierre Vinken<SPLIT>, 61 years old<SPLIT>, will join the
board as a nonexecutive director Nov. 29<SPLIT>.
Mr. Vinken is chairman of Elsevier N.V.<SPLIT>, the Dutch
publishing group<SPLIT>.
Note that in the example above, the sentence lengths are too long to fit the size
of the book and we inserted two additional breaks and leading spaces to denote
a continuing sentence. In the corpus file, every new line corresponds to a new
sentence.
Once we have an annotation, we need to define the features we will use for the
classifier. We already used features in Sect. 7.1 in the form of letter frequencies to
classify the language of a text. For the tokenization, we will follow Reynar (1998,
pp. 69–70), who describes a simple feature set consisting of four features:
• The current character,
• The pair formed of the previous and current characters,
• The next character,
• The pair formed of the two next characters.
As examples, Table 9.2 shows the features extracted from three characters in the
sentences above: the second n in Pierre Vinken, the d in old, and the dot in Nov.
Table 9.2 The features extracted from the second n in Pierre Vinken, the d in old, and the dot in
Nov.. The two classes to learn are inside token and token end
Context Current char. Previous pair Next char. Next pair Class Action
Vinken, n en , , Token end New line
old, d ld , , Token end New line
Nov. v ov . . Inside token Nothing
236 9 Counting and Indexing Words
From these features, the classifier will a create model and discriminate between the
two classes: inside token and token end.
Before we can learn the classifiers, we need a corpus annotated with the <SPLIT>
tags. We can create one by tokenizing a large text manually—a tedious task—or by
reconstructing a nontokenized text from an already tokenized text. See, for example,
the Penn Treebank (Marcus et al. 1993) for English or Universal Dependencies
(Nivre et al. 2017) for more than 150 languages, including English.
We extract a training dataset from the corpus by reading all the characters and
extracting for each character their four features and their class. We then train the
classifier, for instance, using logistic regression, to create a model. Finally, given a
nontokenized text, we apply the classifier and the model to each character of the text
to decide if it is inside a token or if it is a token end.
Sentences usually end with a period, and we will use this sign to recognize
boundaries. However, this is an ambiguous symbol that can also be a decimal point
or appear in abbreviations or ellipses. To disambiguate it, we introduce now two
main lines of techniques identical to those we used for tokenization: rules and
classifiers.
Although in this chapter, we describe sentence segmentation after tokenization,
most practical systems use them in a sequence, where sentence segmentation is the
first step followed by tokenization.
We will consider that a period sign either corresponds to a sentence end, a decimal
point, or a dot in an abbreviation. Most of the time, we can recognize these three
cases by examining a limited number of characters to the right and to the left of the
sign. The objective of disambiguation rules is then to describe for each case what
can be the left and right context of a period.
The disambiguation is easier to implement as a two-pass search: the first pass
recognizes decimal numbers or abbreviations and annotates them with a special
marking. The second one runs the detector on the resulting text. In this second pass,
we also include the question and exclamation marks as sentence boundary markers.
We can generalize this strategy to improve the sentence segmentation with
specific rules recognizing dates, percentages, or nomenclatures that can be run as
9.3 Sentence Segmentation 237
different processing stages. However, there will remain cases where the program
fails, notably with abbreviations.
Starting from the most simple rule to identity sentence boundaries, a period
corresponds to a full stop, Grefenstette and Tapanainen (1994) experimented on
a set of increasingly complex regular expressions to carry out segmentation. They
evaluated them on the Brown corpus (Francis and Kucera 1982).
About 7% of the sentences in the Brown corpus contain at least one period, which
is not a full stop. Using their first rule, Grefenstette and Tapanainen could correctly
recognize 93.20% of the sentences. As a second step, they designed the set of regular
expressions in Table 9.3 to recognize numbers and remove decimal points from
the list of full stops. They raised to 93.78% the number of correctly segmented
sentences.
Regular expressions in Table 9.3 are designed for English text. French and
German decimal numbers would have a different form as they use a comma as
decimal point and a period or a space as a thousand separator:
([0-9]+(.| )?)*[0-9](,[0-9]+)
Table 9.4 Regular expressions to recognize abbreviations and performance breakdown. The
Correct column indicates the number of correctly recognized instances, Errors indicates the
number of errors introduced by the regular expression, and Full stop indicates abbreviations
ending a sentence where the period is a full stop at the same time. After Grefenstette and
Tapanainen (1994)
Regex Correct Errors Full stop
[A-Za-z]\. 1327 52 14
[A-Za-z]\.([A-Za-z0-9]\.)+ 570 0 66
[A-Z][bcdfghj-np-tvxz]+\. 1938 44 26
Totals 3835 96 106
Table 9.5 The features extracted from Nov. and 29. in the example sentences in Sect. 9.2.4. The
two classes to learn are inside sentence and end of sentence
Context Prefix Suffix Previous word Next word Prefix abbrev. Class
Nov. Nov nil director 29. Yes Inside sentence
29. 29 nil Nov. Mr. No End of sentence
Table 9.5 shows the features for the periods in Nov. and 29. in the example
sentences in Sect. 9.2.4. The first four features are straightforward to extract. We
need a list of abbreviations for the rest. We can build this list automatically using
the method described in Sect. 9.3.4.
Reynar and Ratnaparkhi (1997) used logistic regression to train their classifica-
tion models and discriminate between the two classes: inside sentence and end of
sentence.
The first step of lexical statistics consists in extracting the list of word types or
types, i.e., the distinct words, from a corpus, along with their frequencies. Within
the context of lexical statistics, word types are opposed to word tokens, the sequence
of running words of the corpus. The excerpt from George Orwell’s Nineteen Eighty-
Four:
War is peace
Freedom is slavery
Ignorance is strength
has nine tokens and seven types. The type-to-token ratio is often used as an
elementary measure of a text’s density.
Extracting and counting words is straightforward and very fast with Python. We can
obtain them with the following algorithm:
1. Tokenize the text file;
2. Count the words and store them in a dictionary;
3. Possibly, sort the words according to their alphabetical order or their frequency.
The resulting program is similar to the letter count in Sect. 2.6.9. Let us use
functions to implement the different steps. For the first step, we apply a tokenizer
240 9 Counting and Indexing Words
to the text (Sect. 9.2) and we produce a list of words as output. Then, the counting
function uses a dictionary with the words as keys and their frequency as value. It
scans the words list and increments the frequency of the words as they occur.
The program reads a file and sets the characters in lower case, calls the tokenizer
and the counter. It sorts the words alphabetically with sorted(dict.keys()) and
prints them with their frequency. The complete program is:
def tokenize(text):
words = re.findall(r’\p{L}+’, text)
return words
def count_words(words):
frequency = {}
for word in words:
if word in frequency:
frequency[word] += 1
else:
frequency[word] = 1
return frequency
if __name__ == ’__main__’:
text = sys.stdin.read().lower()
words = tokenize(text)
word_freqs = count_words(words)
for word in sorted(word_freqs()):
print(word, ’\t’, word_freqs[word])
If we want to sort the words by frequency, we saw in Sect. 2.6.9 that we have
to assign the key argument the value word_freqs.get in sorted(). We also set the
reverse argument to True:
Our program is now ready and we can apply it to a corpus. Let us try it with
Homer’s Iliad, where an English version translated by Samuel Butler is available at
the URL https://classics.mit.edu/Homer/iliad.mb.txt. We download the file with the
statements:
import requests
text_copyright = requests.get(
’http://classics.mit.edu/Homer/iliad.mb.txt’).text
The full text contains a copyright information that we want to exclude from the
counts. It consists of a header and a footer separated from the main text by a dashed
line. We extract the narrative with the regex:
text = re.search(r’^-+$(.+)^-+$’,
text_copyright, re.M | re.S).group(1).strip()
9.4 Word Counting 241
Running the program on this text and limiting the output to 10 most frequent words,
we obtain:
the 9948
and 6624
of 5606
to 3329
he 2905
his 2537
in 2242
him 1868
you 1810
a 1807
words = tokenize(text)
word_freqs = Counter(words)
The n most common words are the same as with our own dictionary:
>>> word_freqs.most_common(5)
[(’the’, 9948),
(’and’, 6624),
(’of’, 5606),
(’to’, 3329),
(’he’, 2905)]
242 9 Counting and Indexing Words
Finally, let us use Unix tools to count words. In his famous column, Programming
Pearls, Bentley et al. (1986) posed the following problem:
Given a text file and an integer k, print the k most common words in the file (and the number
of their occurrences) in decreasing frequency.
Bentley received two solutions for it: one from Donald Knuth, the prestigious
inventor of TEX, and the second in the form of a comment from Doug McIlroy,
the developer of Unix pipelines. While Knuth sent an 8-page program, McIlroy
proposed a compelling Unix shell script of six lines.1 We reproduce it here (slightly
modified):
1. tr -cs ’A-Za-z’ ’\n’ <input_file |
Tokenize the text in input_file with one word per line. This transliteration
command is the most complex of the six.
(a) tr has two arguments search_list and replacement_list. It replaces all the
occurrences of the characters in search_list by the corresponding character
in replacement_list: The first character in search_list by the first one in
replacement_list and so on. For instance, the instruction tr ’ABC’ ’abc’
replaces the occurrences of A, B, and C by a, b, and c, respectively.
(b) tr has a few options:
• d deletes any characters of the search list that have no corresponding
character in the replacement list;
• c translates characters that belong to the complement of the search list.
After complementing the search list, if the replacement list has not the
same length, the last character of this replacement list is repeated so that
we have equal lengths;
• s reduces—squeezes, squashes—sequences of characters translated to an
identical character to a single instance.
The command
tr -d ’AEIOUaeiou’
replaces all nonletters with a new line. The contiguous sequences of trans-
lated new lines are reduced to a single one.
The tr output is passed to the next command.
1 Stephen Bourne, the author of the Unix Bourne shell, proposed a similar script; see Bourne (1982,
pp. 196–197).
9.5 Retrieval and Ranking of Documents 243
2. tr ’A-Z’ ’a-z’ |
Translate the uppercase characters into lowercase letters and pass the output to
the next command.
3. sort |
Sort the words. The identical words will be grouped together in adjacent lines.
4. uniq -c |
Remove repeated lines. The identical adjacent lines will be replaced with one
single line. Each unique line in the output will be preceded by the count of its
duplicates in the input file (-c).
5. sort -rn |
Sort in the reverse (-r) numeric (-n) order. The most frequent words will be
sorted first.
6. head -5
Print the five first lines of the file (the five most frequent words).
The two first tr commands do not take into account possible accented characters.
To correct it, we just need to modify the character list and include accents.
Nonetheless, we can apply the script as it is to English texts. On the novel Nineteen
Eighty-Four (Orwell 1949), the output is:
6518 the
3491 of
2576 a
2442 and
2348 to
The advent of the Web in the mid-1990s made it possible to retrieve automatically
billions of documents from words or phrases they contained. Companies providing
such a service became quickly among the most popular sites of the internet; Google
and Bing being the most notable ones as of today.
Web search systems or engines are based on “spiders” or “crawlers” that visit
internet addresses, follow links they encounter, and collect all the pages they
traverse. Crawlers can amass billions of pages every month.
All the pages the crawlers download are tokenized and undergo a full text indexing.
To carry out this first step, an indexer extracts all the words of the documents in the
collection and builds a dictionary. It then links each word in the dictionary to the list
of documents where this word occurs in. Such a list is called a postings list, where
each posting in the list contains a document identifier and the word’s positions in
244 9 Counting and Indexing Words
Table 9.6 An inverted index. Each word in the dictionary is linked to a postings list that gives
all the documents in the collection where this word occurs and its positions in a document. Here,
the position is the word index in the document. In the examples, a word occurs at most once in a
document. This can be easily generalized to multiple occurrences
Words Postings lists
America (D1, 7)
Chrysler (D1, 1) .→ (D2, 1)
In (D1, 5) .→ (D2, 5)
Investments (D1, 4) .→ (D2, 4)
Latin (D1, 6)
Major (D2, 3)
Mexico (D2, 6)
New (D1, 3)
Plans (D1, 2) .→ (D2, 2)
the corresponding document. The resulting data structure is called an inverted index
and Table 9.6 shows an example of it with the two documents:
D1: Chrysler plans new investments in Latin America.
D2: Chrysler plans major investments in Mexico.
An inverted index is pretty much like a book index except that it considers all
the words. When a user asks for a specific word, the search system answers with the
pages that contain it. See Baeza-Yates and Ribeiro-Neto (2011) and Manning et al.
(2008) for more complete descriptions.
To represent the inverted index, we will use a dictionary, where the words in the
collection are the keys and the postings lists, the values. In addition, we will augment
the postings with the positions of the word in each document. We will also represent
the posting lists as dictionaries, where the keys will be the documents identifiers and
the values, the list of positions:
{
index[word1]: {doc_1: [pos1, pos2, ...], ..., doc_n: [pos1, ...]},
index[word2]: {doc_1: [pos1, pos2, ...], ..., doc_n: [pos1, ...]},
...
}
Let us first write two functions to index a single document. We extract the
words with the tokenizer in Sect. 9.2.1, but instead of findall(), we use finditer()
to return the match objects. We will use these match objects to extract the word
positions.
def tokenize(text):
9.5 Retrieval and Ranking of Documents 245
"""
Uses the letters to break the text into words.
Returns a list of match objects
"""
words = re.finditer(r’\p{L}+’, text)
return words
Once the text is tokenized, we can build the index. We define the positions as the
number of characters from the start of the file and we store them in a list:
def text_to_idx(words):
"""
Builds an index from a list of match objects
"""
word_idx = {}
for word in words:
try:
word_idx[word.group()].append(word.start())
except:
word_idx[word.group()] = [word.start()]
return word_idx
Using these two functions, we can index documents, for instance A Tale of Two
Cities by Dickens:2
>>> text = open(’A Tale of Two Cities.txt’).read().lower().strip()
>>> index = text_to_idx(tokenize(text))
Finally, we collect of collection of books by Dickens and build the index of all
books in the collection with a loop over the list of files:
master_index = {}
for file in corpus_files:
text = open(file).read().lower().strip()
words = tokenize(text)
idx = text_to_idx(words)
for word in idx:
if word in master_index:
master_index[word][file] = idx[word]
else:
master_index[word] = {}
master_index[word][file] = idx[word]
Applying this program results in a master index from which we can find all the
positions of a word, such as vendor, in the documents that contain it:
>>> master_index[’vendor’] =
{’Dombey and Son.txt’: [1080291],
’A Tale of Two Cities.txt’: [218582, 218631, 219234, 635168],
’The Pickwick Papers.txt’: [28715],
’Bleak House.txt’: [1474429],
’Oliver Twist.txt’: [788457]}
Once indexed, search engines compare, categorize, and rank documents using
statistical or popularity models. The vector space model (Salton 1988) is a widely
used representation to carry this out. The idea is to represent the documents in a
vector space whose axes are the words. Documents are then vectors in a space of
words. As the word order plays no role in the representation, it is often called a
bag-of-word model.
Let us first suppose that the document coordinates are the occurrence counts of
each word. A document D would be represented as:
Table 9.7 shows the document vectors representing the examples in Sect. 9.5.1, and
Table 9.8 shows a general matrix representing a collection of documents, where
each cell .(wi , Dj ) contains the frequency of .wi in document .Dj .
Table 9.7 The vectors representing the two documents in Sect. 9.5.1. The words have been
normalized in lowercase letters
D#.\ Words America Chrysler In Investments Latin Major Mexico New Plans
1 1 1 1 1 1 0 0 1 1
2 0 1 1 1 0 1 1 0 1
Table 9.8 The word by document matrix. Each cell .(wi , Dj ) contains the frequency of .wi in
document .Dj
D#.\Words .w1 .w2 .w3 ... .wm
Using the vector space model, we can measure the similarity between two
documents, D and Q, by the angle they form in the vector space. In practice, it is
easier to compute the cosine of the angle between .q, representing Q, and .d, defined
as:
n
qi di
q·d i=1
. cos(q, d) = = .
||q|| · ||d||
n n
qi2 di2
i=1 i=1
The cosine values will range from 0, meaning very different documents, to 1,
very similar or identical documents.
In fact, most of the time, the rough word counts that are used as coordinates in the
vectors are replaced by a more elaborate term: the term frequency times the inverse
document frequency, better known as tf–idf or .tf × idf (Salton 1988). To examine
how it works, let us take the phrase internet in Somalia as an example.
A document that contains many internet words is probably more relevant than
a document that has only one. The frequency of a term i in a document j reflects
this. It is a kind of a “mass” relevance. For each vector, the term frequencies .tf i,j
are often normalized by the sum of the frequencies of all the terms in the document
and defined as:
ti,j
tf i,j =
. ,
ti,j
i
However, since internet is a very common word, it is not specific. The number
of documents that contain it must downplay its importance. This is the role of the
inverse document frequency (Spärck Jones 1972):
N
idf i = log
. ,
ni
In this section, we gave one definition of tf–idf . In fact, this formula can vary
depending on the application. Salton and Buckley (1987) reported 287 variants of it
and compared their respective merits. BM25 and BM25F (Zaragoza et al. 2004) are
extensions of tf–idf that take into account the document length.
The user may query a search engine with a couple of words or a phrase. Most
systems will then answer with the pages that contain all the words and any of
the words of the question. Some questions return hundreds or even thousands of
valid documents. Ranking a document consists in projecting the space to that of
the question words using the cosine. With this model, higher cosines will indicate
better relevance. In addition to .tf × idf , search systems may employ heuristics such
as giving more weight to the words in the title of a page (Mauldin and Leavitt 1994).
Google’s PageRank algorithm (Brin and Page 1998) uses a different technique
that takes into account the page popularity. PageRank considers the “backlinks”, the
links pointing to a page. The idea is that a page with many backlinks is likely to be a
page of interest. Each backlink has a specific weight, which corresponds to the rank
of the page it comes from. The page rank is simply defined as the sum of the ranks
of all its backlinks. The importance of a page is spread through its forward links and
contributes to the popularity of the pages it points to. The weight of each of these
forward links is the page rank divided by the count of the outgoing links. The ranks
are propagated in a document collection until they converge.
9.6 Categorizing Text 249
9.6.1 Corpora
Using manually-categorized corpora, like the Reuters corpus, and the vector space
model, we can apply supervised machine-learning techniques to train classifiers
(see Sect. 7.1). The training procedure uses a bag-of-word representation of the
documents, either with Boolean features, term frequencies, or .tf × idf as input,
and their classes as output.
Logistic regression again is a simple, yet efficient technique to carry out text
classification. LibShortText (Yu et al. 2013), for example, is an open source library
that includes logistic regression and different types of preprocessing and feature
representations.
250 9 Counting and Indexing Words
In this section, we will build a simple text categorizer using scikit-learn and its
built-in modules. As corpus, we will use Homer’s Iliad and Odyssey, where each
work consists of 24 books. We will consider that each book forms a document.
The first step is to load each book as a string and store it in a list that we call
homer_corpus. We also store the work names, either iliad or odyssey, as strings in a
list called homer_titles.
The statement homer_corpus[0][:60] returns 60 characters from the first book in
the list:
Book I\n\nTHE GODS IN COUNCIL--MINERVA’S VISIT TO ITHACA--THE
Before we train a model, we split the corpus into training and test sets using the
train_test_split() function from scikit-learn. As the corpus is small, we set the
size of the test set to be 20%:
from sklearn.model_selection import train_test_split
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(X_train)
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
The resulting matrix, X_train_tfidf, has a shape of .(38, 8833), meaning that we
have 38 books, the documents, about 80% of the corpus, and rows of 8833 words
representing each book.
We train a model as in Sect. 7.11.1:
from sklearn.linear_model import LogisticRegression
Before applying a prediction, we must convert the test documents into numerical
vectors:
X_test_counts = count_vect.transform(X_test)
X_test_tfidf = tfidf_transformer.transform(X_test_counts)
Then we predict the test set and measure the accuracy score:
from sklearn.metrics import accuracy_score
y_test_hat = clf.predict(X_test_tfidf)
accuracy_score(y_test, y_test_hat)
3 https://www.nltk.org/.
4 https://spacy.io/.
5 https://opennlp.apache.org/.
252 9 Counting and Indexing Words
vector database and use appoximate comparison algorithms such as Faiss (Johnson
et al. 2019).
Lucene6 is a popular open-source indexer written in Java with a Python API. It
is used in scores of web sites such as Twitter and Wikipedia to carry out search and
information retrieval.
6 https://lucene.apache.org/.
Chapter 10
Word Sequences
On trouve ainsi qu’un événement étant arrivé de suite, un nombre quelconque de fois, la
probabilité qu’il arrivera encore la fois suivante, est égale à ce nombre augmenté de l’unité,
divisé par le même nombre augmenté de deux unités. En faisant, par exemple, remonter la plus
ancienne époque de l’histoire, à cinq mille ans, ou à 1826213 jours, et le Soleil s’étant levé
constamment, dans cet intervalle, à chaque révolution de vingt-quatre heures, il y a 1826214 à
parier contre un qu’il se lè vera encore demain.
Pierre-Simon Laplace. Essai philosophique sur les probabilités. 1840.
See explanations in Sect. 10.4.2.
We saw in Chap. 3, Corpus Processing Tools that words have specific contexts of
use. Pairs of words like strong and tea or powerful and computer are not random
associations but the result of a preference. A native speaker will use them naturally,
while a learner will have to learn them from books—dictionaries—where they are
explicitly listed. Similarly, the words rider and writer sound much alike in American
English, but they are likely to occur with different surrounding words. Hence,
hearing an ambiguous phonetic sequence, a listener will discard the improbable
rider of books or writer of horses and prefer writer of books or rider of horses
(Church and Mercer 1993).
In lexicography, extracting recurrent pairs of words—collocations—is critical to
finding the possible contexts of a word and citing real examples of its use. In speech
recognition, the statistical estimate of a word sequence—also called a language
model—is a key part of the recognition process. The language model component
of a speech recognition system enables the system to predict the next word given a
sequence of previous words: the writer of books, novels, poetry, etc., rather than of
the writer of hooks, nobles, poultry.
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 253
P. M. Nugues, Python for Natural Language Processing, Cognitive Technologies,
https://doi.org/10.1007/978-3-031-57549-5_10
254 10 Word Sequences
10.2 N -Grams
Collocations and language models use the frequency of pairs of adjacent words:
bigrams, for example, how many of the there are in this text; of word triples:
trigrams; and more generally of fixed sequences of n words: n-grams. In lexical
statistics, single words are called unigrams.
Jelinek (1990) exemplified corpus statistics and trigrams with the sentence
We need to resolve all of the important issues within the next two days
We count bigrams just as we did with unigrams. The only difference is that we use
pairs of adjacent words instead of words. We extract these pairs, .(wi , wi+1 ), with the
slice notation: words[i:i+2] that produces a list of two strings. As with unigrams,
we use the bigrams as the keys of a dictionary and their frequencies as the values. We
need then to make sure that we have the right data type for this. Python dictionaries
only accept immutable structures as keys (see Sect. 2.6.7) and we hence convert our
bigrams to tuples. We create the bigram list with a list comprehension:
bigrams = [tuple(words[idx:idx + 2])
for idx in range(len(words) - 1)]
The rest of the program is nearly identical to that of Sect. 9.4.2 for unigrams. As
input, we uses the same list of words and Counter to count the items in the list:
words = tokenize(text.lower())
bigrams = [tuple(words[idx:idx + 2])
for idx in range(len(words) - 1)]
bigram_freqs = Counter(bigrams)
The bigram count can easily be generalized to n-grams with the statement:
ngrams = [tuple(words[idx:idx + n])
for idx in range(len(words) - n + 1)]
Setting .n = 3, we obtain the five most frequent trigrams from the the Iliad with:
>>> Counter(ngrams).most_common(5)
With the Unix tools from Sect. 9.4.4, it is also easy to extend the counts from
unigrams to bigrams. We need first to create a file, where each line contains a
256 10 Word Sequences
bigram: the words at index i and .i + 1 on the same line separated with a blank.
We use the Unix commands:
1. tr -cs ’A-Za-z’ ’\n’ < input_file > token_file
Tokenize the input and create a file with the unigrams.
2. tail +2 < token_file > next_token_file
Create a second unigram file starting at the second word of the first tokenized file
(+2).
3. paste token_file next_token_file > bigrams
Merge the lines (the tokens) pairwise. Each line of bigrams contains the words at
index i and .i + 1 separated with a tabulation.
4. And we count the bigrams as in the previous script.
We observed in Table 10.1 that some word sequences are more likely than others.
Using a statistical model, we can quantify these observations. The model will enable
us to assign a probability to a word sequence as well as to predict the next word to
follow the sequence.
Let .S = w1 , w2 , . . . , wi , . . . , wn be a word sequence. Given a training corpus, an
intuitive estimate of the probability of the sequence, .P (S), is the relative frequency
of the string .w1 , w2 , . . . , wi , . . . , wn in the corpus. This estimate is called the
maximum likelihood estimate (MLE):
C(w1 , . . . , wn )
PMLE (S) =
. ,
N
P (S) = P (w1 , . . . , wn ),
= P (w1 )P (w2 |w1 )P (w3 |w1 , w2 ) . . . P (wn |w1 , . . . , wn−1 ),
.
n
= P (wi |w1 , . . . , wi−1 ).
i=1
The probability .P (I t, was, a, bright, cold, day, in, April) from Nineteen
Eighty-Four by George Orwell corresponds then to the probability of having It to
begin the sentence, then was knowing that we have It before, then a knowing that
10.3 Probabilistic Models of a Word Sequence 257
we have It was before, and so on, until the end of the sentence. It yields the product
of conditional probabilities:
To estimate .P (S), we need to know unigram, bigram, trigram, so far, so good, but
also 4-gram, 5-gram, and even 8-gram statistics. Of course, no corpus is big enough
to produce them. A practical solution is then to limit the n-gram length to 2 or 3,
and thus to approximate them to bigrams:
or trigrams:
n
P (S) ≈ P (w1 )
. P (wi |wi−1 ),
i=2
C(wi−2 , wi−1 , wi )
PMLE (wi |wi−2 , wi−1 ) =
. .
C(wi−2 , wi−1 )
C(wi+1 , . . . , wi+n )
PMLE (wi+n |wi+1 , . . . , wi+n−1 ) = ,
C(wi+1 , . . . , wi+n−1 , w)
. w
C(wi+1 , . . . , wi+n )
= .
C(wi+1 , . . . , wi+n−1 )
258 10 Word Sequences
As the probabilities we obtain are usually very low, it is safer to represent them
as a sum of logarithms in practical applications. For the bigrams, we will then use:
n
. log P (S) ≈ log P (w1 ) + log P (wi |wi−1 ),
i=2
n
N LL(P (S)) ≈ − log P (w1 ) −
. log P (wi |wi−1 ),
i=2
Before computing the probability of a word sequence, we must train the language
model. Like in our machine-learning experiments, the corpus used to derive the n-
gram frequencies is classically called the training set, and the corpus on which we
apply the model, the test set. Both sets should be distinct. If we apply a language
model to a word sequence, which is part of the training corpus, its probability will
be biased to a higher value, and thus will be inaccurate. The training and test sets
can be balanced or not, depending on whether we want them to be specific of a task
or more general.
For some models, we need to optimize parameters in order to obtain the best
results. Again, it would bias the results if at the same time, we carry out the
optimization on the test set and run the evaluation on it. For this reason some
models need a separate validation set, also called development set, to fine-tune
their parameters.
In some cases, especially with small corpora, a specific division between training
and test sets may have a strong influence on the results. It is then preferable to
apply the training and testing procedure several times with different sets and average
the results. The method is to randomly divide the corpus into two sets. We learn
the parameters from the training set, apply the model to the test set, and repeat
the process with a new random division, for instance, ten times. This method is
called cross-validation, or tenfold cross-validation if we repeat it ten times. Cross-
validation smoothes the impact of a specific partition of the corpus; see Sect. 6.4.3.
10.3 Probabilistic Models of a Word Sequence 259
Most corpora use some sort of markup language. The most common markers of N-
gram models are the sentence delimiters <s> to mark the start of a sentence and </s>
at its end. For example:
<s> It was a bright cold day in April </s>
The Vocabulary
We have defined language models that use a predetermined and finite set of
words. This is never the case in reality, and the models will have to handle out-
of-vocabulary (OOV) words. Training corpora are typically of millions, or even
billions, of words. However, whatever the size of a corpus, it will never have a
complete coverage of the vocabulary. Some words that are unseen in the training
corpus are likely to occur in the test set. In addition, frequencies of rare words will
not be reliable.
There are two main types of methods to deal with OOV words:
• The first method assumes a closed vocabulary. All the words both in the training
and the test sets are known in advance. Depending on the language model
settings, any word outside the vocabulary will be discarded or cause an error.
This method is used in some applications, like voice control of devices.
• The open vocabulary makes provisions for new words to occur with a specific
symbol, <UNK>, called the unknown token. All the OOV words are mapped to
<UNK>, both in the training and test sets.
The vocabulary itself can come from an external dictionary. It can also be
extracted directly from the training set. In this case, it is common to exclude the
rare words, notably those seen only once—the hapax legomena. The vocabulary
will then consist of the most frequent types of the corpus, for example, the 20,000
most frequent types. The other words, unseen or with a frequency lower than a cutoff
value, 1, 2, or up to 5, will be mapped to <UNK>.
260 10 Word Sequences
We first normalized the text: we created a file with one sentence per line. We
inserted automatically the delimiters <s> and </s>. We removed the punctuation,
parentheses, quotes, stars, dashes, tabulations, and double white spaces. We set all
the words in lowercase letters. We counted the words, and we produced a file with
the unigram and bigram counts.
The training corpus has 115,212 words; 8635 types, including 3928 hapax
legomena; and 49,524 bigrams, where 37,365 bigrams have a frequency of 1.
Table 10.2 shows the unigram and bigram frequencies for the words of the test
sentence.
Table 10.2 Frequencies of unigrams and bigrams. We excluded the <s> symbols from the word
counts
.PMLE
.wi .C(wi ) #words .PMLE (wi ) .wi−1 , wi .C(wi−1 , wi ) .C(wi−1 ) (wi |wi−1 )
<s> 7072 – – – –
a 2482 108,140 0.023 <s> a 133 7072 0.019
good 53 108,140 0.00049 a good 14 2482 0.006
deal 5 108,140 .4.62 × 10
−5 good deal 0 53 0.0
of 3310 108,140 0.031 deal of 1 5 0.2
the 6248 108,140 0.058 of the 742 3310 0.224
literature 7 108,140 .6.47 × 10
−5 the literature 1 6248 0.00016
of 3310 108,140 0.031 literature of 3 7 0.429
the 6248 108,140 0.058 of the 742 3310 0.224
past 99 108,140 0.00092 the past 70 6248 0.011
was 2211 108,140 0.020 past was 4 99 0.040
indeed 17 108,140 0.00016 was indeed 0 2211 0.0
already 64 108,140 0.00059 indeed 0 17 0.0
already
being 80 108,140 0.00074 already being 0 64 0.0
transformed 1 108,140 .9.25 × 10−6 being 0 80 0.0
transformed
in 1759 108,140 0.016 transformed 0 1 0.0
in
this 264 108,140 0.0024 in this 14 1759 0.008
way 122 108,140 0.0011 this way 3 264 0.011
</s> 7072 108,140 0.065 way </s> 18 122 0.148
10.4 Smoothing N -Gram Probabilities 261
All the words of the sentence have been seen in the training corpus, and we can
compute a probability estimate of it using the unigram relative frequencies:
As .P (< s >) is a constant that would scale all the sentences by the same factor,
whether we use unigrams or bigrams, we excluded it from the .P (S) computation.
The bigram estimate is defined as:
and has a zero probability. This is due to sparse data: the fact that the corpus is not
big enough to have all the bigrams covered with a realistic estimate. We shall see in
the next section how to handle them.
The approach using the maximum likelihood estimation has an obvious disad-
vantage because of the unavoidably limited size of the training corpora. Given
a vocabulary of 20,000 types, the potential number of bigrams is .20,0002 =
400,000,000, and with trigrams, it amounts to the astronomic figure of .20,0003 =
8,000,000,000,000. No corpus yet has the size to cover the corresponding word
combinations.
Among the set of potential n-grams, some are almost impossible, except as
random sequences generated by machines; others are simply unseen in the corpus.
This phenomenon is referred to as sparse data, and the maximum likelihood
estimator gives no hint on how to estimate their probability.
In this section, we introduce smoothing techniques to estimate probabilities of
unseen n-grams. As the sum of probabilities of all the n-grams of a given length
is 1, smoothing techniques also have to rearrange the probabilities of the observed
n-grams. Smoothing allocates a part of the probability mass to the unseen n-grams
that, as a counterpart, it shifts—or discounts—from the other n-grams.
Laplace’s rule (Laplace 1820, p. 17) is probably the oldest published method to cope
with sparse data. It just consists in adding one to all the counts. For this reason, some
authors also call it the add-one method.
262 10 Word Sequences
Laplace wanted to estimate the probability of the sun to rise tomorrow and he
imagined this rule: he set both event counts, rise and not rise, arbitrarily to one,
and he incremented them with the corresponding observations. From the beginning
of time, humans had seen the sun rise every day. Laplace derived the frequency of
this event from what he believed to be the oldest epoch of history: 5000 years or
1,826,213 days. As nobody observed the sun not rising, he obtained the chance for
the sun to rise tomorrow of 1,826,214 to 1.
Laplace’s rule states that the frequency of unseen n-grams is equal to 1 and the
general estimate of a bigram probability is:
C(wi−1 , wi ) + 1 C(wi−1 , wi ) + 1
PLaplace (wi |wi−1 ) =
. = ,
(C(wi−1 , w) + 1) C(wi−1 ) + |V |
w
where .|V | is the cardinality of the vocabulary, i.e. the number of word types. The
denominator correction is necessary to have the probability sum equal to 1.
With Laplace’s rule, we can use bigrams to compute the sentence probability
(Table 10.3):
The Good–Turing estimation (Good 1953) is one of the most efficient smoothing
methods. As with Laplace’s rule, it reestimates the counts of the n-grams observed
in the corpus by discounting them, and it shifts the probability mass it has shaved
to the unseen bigrams. The discount factor is variable, however, and depends on
the number of times a n-gram has occurred in the corpus. There will be a specific
discount value to n-grams seen once, another one to bigrams seen twice, a third one
to those seen three times, and so on.
Let us denote .Nc the number of n-grams that occurred exactly c times in the
corpus. .N0 is the number of unseen n-grams, .N1 the number of n-grams seen once,
.N2 the number of n-grams seen twice, and so on. If we consider bigrams, the value
E(Nc+1 )
c∗ = (c + 1)
. ,
E(Nc )
where .E(x) denotes the expectation of the random variable x. This formula is
usually approximated as:
Nc+1
c∗ = (c + 1)
. .
Nc
264 10 Word Sequences
To understand how this formula was designed, let us take the example of the
unseen bigrams with .c = 0. Let us suppose that we draw a sequence of bigrams to
build our training corpus, and the last bigram we have drawn was unseen before.
From this moment, there is one occurrence of it in the training corpus and the
count of bigrams in the same case is .N1 . Using the maximum likelihood estimation,
the probability to draw such an unseen bigram is then the count of bigrams seen
once divided by the total count of the bigrams seen so far: .N1 /N. We obtain the
probability to draw one specific unseen bigram by dividing this term by the count of
unseen bigrams:
1 N1
. × .
N N0
N1
Hence, the Good–Turing reestimated count of an unseen n-gram is .c∗ = .
N0
2N2
Similarly, we would have .c∗ = for an n-gram seen once in the training corpus.
N1
The three chapters in Nineteen Eighty-Four contain 37,365 unique bigrams
and 5820 bigrams seen twice. Its vocabulary of 8635 words generates .86352 =
74,563,225 bigrams, of which 74,513,701 are unseen. The Good–Turing method
reestimates the frequency of each unseen bigram to .37, 365/74, 513, 701 = 0.0005,
and unique bigrams to .2 × (5820/37, 365) = 0.31. Table 10.4 shows the complete
the reestimated frequencies for the n-grams up to 9.
In practice, only high values of .Nc are reliable, which correspond to low values
of c. In addition, above a certain threshold, most frequencies of frequency will be
equal to zero. Therefore, the Good–Turing estimation is applied for .c < k, where k
is a constant set to 5, 6, . . . , or 10. Other counts are not reestimated. See Katz (1987)
for the details.
The probability of a n-gram is given by the formula:
c∗ (w1 , . . . , wn )
PGT (w1 , . . . , wn ) =
. ,
N
where .c∗ is the reestimated count of .w1 . . . wn , and N the original count of n-grams
in the corpus. The conditional frequency is
c∗ (w1 , . . . , wn )
.PGT (wn |w1 , . . . , wn−1 ) = .
C(w1 , . . . , wn−1 )
Table 10.5 shows the conditional probabilities, where only frequencies less than 10
have been reestimated. The sentence probability using bigrams is .2.56×10−50 . This
is better than with Laplace’s rule, but as the corpus is very small, still greater than
the unigram probability.
10.5 Using N -Grams of Variable Length 265
Table 10.5 The conditional frequencies using the Good–Turing method. We have not reestimated
the frequencies when they are greater than 9
.wi−1 , wi .C(wi−1 , wi )
∗ (w
.c .PGT (wi |wi−1 )
i−1 , wi ) .C(wi−1 )
unseen n-gram will be specific to the words it contains. In this section, we introduce
two techniques: the linear interpolation and Katz’s back-off model.
Linear interpolation, also called deleted interpolation (Jelinek and Mercer 1980),
combines linearly the maximum likelihood estimates from length 1 to n. For
trigrams, it corresponds to:
3
where .0 ≤ λi ≤ 1 and . λi = 1.
i=1
The values can be constant and set by hand, for instance, .λ3 = 0.6, .λ2 = 0.3,
and .λ1 = 0.1. They can also be trained and optimized from a corpus (Jelinek 1997).
Table 10.6 shows the interpolated probabilities of bigrams with .λ2 = 0.7 and
.λ1 = 0.3. The sentence probability using these interpolations is .9.46 × 10
−45 .
Table 10.6 Interpolated probabilities of bigrams using the formula .λ2 PMLE (wi |wi−1 ) +
λ1 PMLE (wi ), .λ2 = 0.7, and .λ1 = 0.3. The total number of words is 108,140
.wi−1 , wi .C(wi−1 , wi ) .C(wi−1 ) .PMLE (wi |wi−1 ) .PMLE (wi ) .PInterp (wi |wi−1 )
We can now understand why bigram we the is ranked so high in Table 10.1
after we are and we will. Although it can occur in English, as in the American
constitution, We the people. . . , it is not a very frequent combination. In fact, the
estimation has been obtained with an interpolation where the term .λ1 PMLE (the)
boosted the bigram to the top because of the high frequency of the.
10.5.2 Back-Off
The idea of the back-off model is to use the frequency of the longest available n-
grams, and if no n-gram is available to back off to the (.n − 1)-grams, and then to
(.n−2)-grams, and so on. If n equals 3, we first try trigrams, then bigrams, and finally
unigrams. For a bigram language model, the back-off probability can be expressed
as:
P (wi |wi−1 ), if C(wi−1 , wi ) /= 0,
.PBackoff (wi |wi−1 ) =
αP (wi ), otherwise.
So far, this model does not tell us how to estimate the n-gram probabilities to the
right of the formula. A first idea would be to use the maximum likelihood estimate
for bigrams and unigrams. With .α = 1, this corresponds to:
⎧
⎪ C(wi−1 , wi )
⎨PMLE (wi |wi−1 ) = , if C(wi−1 , wi ) /= 0,
.PBackoff (wi |wi−1 ) =
C(wi−1 )
⎪
⎩PMLE (wi ) = C(wi ) , otherwise.
#words
and Table 10.7 shows the probability estimates we can derive from our small corpus.
They yield a sentence probability of .2.11 × 10−40 for our example.
This back-off technique is relatively easy to implement and Brants et al. (2007)
applied it to 5-grams on a corpus of three trillion tokens with a back-off factor
.α = 0.4. They used the recursive definition:
However, the result is not a probability as the sum of all the probabilities,
. wi P (wi |wi−1 ), can be greater than 1. In the next section, we describe Katz’s
(1987) back-off model that provides an efficient and elegant solution to this problem.
268 10 Word Sequences
<s> 7072 –
<s> a 133 2482 0.019
a good 14 53 0.006
good deal 0 Backoff 5 .4.62 × 10
−5
c∗ (wi−1 , wi )
. P̃ (wi |wi−1 ) = ,
C(wi−1 )
for instance, with the values in Tables 10.4 and 10.5 for our sentence. We then assign
the remaining probability mass to the unigrams.
To compute .α, we add the two terms of Katz’s back-off model, the discounted
probabilities of the observed bigrams, and, for the unseen bigrams, the weighted
10.5 Using N -Grams of Variable Length 269
unigram probabilities:
PKatz (wi |wi−1 ) = P̃ (wi |wi−1 ) + α PMLE (wi ),
. wi wi ,C(wi−1 ,wi )>0 wi ,C(wi−1 ,wi )=0
= 1.
The Kneser–Ney model (Ney et al. 1994) is our last method to smooth probability
estimates. .PKN (wi |wi−1 ) consists of two terms:
1. The first one uses the maximum likelihood to estimate the bigram probabilities
when they exist minus a discount to make room for bigrams unseen in the training
corpus. We ensure that the first term is always positive by taking the maximum
of the discounted value and 0:
max(0, C(wi−1 , wi ) − δ)
. with 0 < δ < 1.
C(wi−1 )
2. The second term, .PKN (wi ), measures the percentage of unique bigrams with the
second word in it with respect to the number of unique bigrams. We compute it
from the set of bigrams seen at least once in the corpus
and we extract a subset of it, where the second word is .wi : .{(wk , wi )}. .PKN (wi )
is defined as the ratio of their cardinalities:
|{(wk , wi ) : C(wk , wi ) > 0)}|
PKN (wi ) =
. .
|{(wk , wk+1 ) : C(wk , wk+1 ) > 0)}|
270 10 Word Sequences
max(0, C(wi−1 , wi ) − δ)
PKN (wi |wi−1 ) =
. + λwi−1 PKN (wi ).
C(wi−1 )
The absolute discounting value .δ is a constant, for instance 0.1, that we can optimize
using a validation set. We must compute the .λwi−1 values for all the words so that
the .PKN (wi |wi−1 ) probabilities have a sum of one.
The Kneser–Ney smoothing model has shown the best performances is language
modeling especially for small corpora. See Sect. 10.7 on evaluation.
1
H (L) = − log2 P (w1 , . . . , wn ).
.
n
We have seen that trigrams are better predictors than bigrams, which are better
than unigrams. This means that the probability of a very long sequence computed
with a bigram model will normally be higher than with a unigram one. The log
measure will then be lower.
Intuitively, this means that the .H (L) measure will be a quality marker for
a language model where lower numbers will correspond to better models. This
intuition has mathematical foundations, as we will see in the two next sections.
1
H (L) = −
. P (w1 , . . . , wn ) log2 P (w1 , . . . , wn ),
n
w1 ,...,wn ∈L
1
H (L) = lim − P (w1 , . . . , wn ) log2 P (w1 , . . . , wn ),
n→∞ n
. w1 ,...,wn ∈L
1
= lim − log2 P (w1 , . . . , wn ),
n→∞ n
which means that we can compute .H (L) from a very long sequence, ideally infinite,
instead of summing of all the sequences of a definite length.
We can also use cross entropy, which is measured between a text, called the language
and governed by an unknown probability P , and a language model M. Using the
272 10 Word Sequences
1
H (P , M) = −
. P (w1 , . . . , wn ) log2 M(w1 , . . . , wn ).
n
w1 ,...,wn ∈L
As for the entropy rate, it has been proven that, under certain conditions
1
H (P , M) = lim − P (w1 , . . . , wn ) log2 M(w1 , . . . , wn ),
n→∞ n
. w1 ,...,wn ∈L
1
= lim − log2 M(w1 , . . . , wn ).
n→∞ n
10.7.4 Perplexity
P P (P , M) = 2H (P ,M) .
.
word, .xi :
By adding the predicted word to the existing sequence and repeating this operation,
we will be able generate text:
This will be pretty dull nonetheless. Given a starting word, possibly the start of
sentence symbol <s>, and selecting the next word with the highest probability,
Another option is to select the next word following the multinomial distribution
of our language model. That is, for example, if a word w has three observed
followers in the corpus, .wa , .wb , and .wc with .P (wa |w) = 0.5, .P (wb |w) = 0.3,
and .P (wc |w) = 0.2, we will select .wa 50% of the time, .wb 30%, and .wc 20%. We
will then generate text with the same statistical properties as our training corpus.
To select an outcome from a distribution in the form of a list of real values, for
instance [0.5, 0.3, 0.2], we can use:
np.random.multinomial(1, distribution)
which will return a one-hot vector: [1, 0, 0], 50% of the time, [0, 1, 0], 30%,
and [0, 0, 1], 20%. We can then extract the word index with np.argmax():
np.argmax(np.random.multinomial(1, distribution))
274 10 Word Sequences
Let us take an example with the Iliad corpus set in lowercase and use bigrams to
simplify:
P (xi |xi−1 )
.
We first tokenize the text that we store in the words list. We count the unigrams,
bigrams, and we compute the conditional probabilities with:
unigram_freqs = Counter(words)
Starting from the last word in the sequence, say Hector, we estimate
P (x|hector)
.
from the corpus. For a given word, we extract the conditional probabilities with:
def bigram_dist(word, cond_probs):
bigram_cprobs = sorted(
[(k, v) for k, v in cond_probs.items()
if k[0] == word],
key=lambda tup: tup[1], reverse=True)
return bigram_cprobs
We have 184 bigrams .(hector, x), for which the five highest probabilities are:
P (and|hector) = 0.117,
P (son|hector) = 0.052,
. P (s|hector) = 0.048,
P (was|hector) = 0.031,
P (in|hector) = 0.023.
The generation is then easy. We start with a word, here Hector. We then run a
loop, where we select the next word according to the multinomial distribution and
replace the current word with the next word:
print(start_word, end=’ ’)
current_word = start_word
for i in range(50):
bigram_cprobs = bigram_dist(current_word, cond_probs)
dist = [bigram_cprob[1] for bigram_cprob in bigram_cprobs]
selected_idx = np.argmax(np.random.multinomial(1, dist))
next_bigram_cprob = bigram_cprobs[selected_idx]
current_word = next_bigram_cprob[0][1]
print(current_word, end=’ ’)
10.8 Generating Text from a Language Model 275
This loop creates a text that will change each time we run it:
hector shouted to rush without a brave show but flow for his tail from their cave even for he
killed in despair as luck will override fate of all gods who were thus beaten wild in all good
was dead in spite him and called you are freshly come close to . . .
In the previous section, we generated a text with a bigram distribution that is the
same as that of the training corpus i.e. the Iliad. We can transform it to make it more
deterministic or more random.
Chollet (2021, pp. 369–373) proposed a transformation over the second word of
the bigram using a power function that has this property. He called the inverse of the
power, the temperature, T , referring to the temperature in Boltzmann distribution:
1
fT (x) = x T .
.
As input, we take the original bigram distribution, transform it, and normalize it so
that the probabilities sum to 1.
def power_transform(distribution, T=0.5):
new_dist = np.power(distribution, 1/T)
return new_dist / np.sum(new_dist)
Figure 10.1 shows the power function with different temperatures. Given that the
probabilities range from 0 to 1, a temperature of 1 yields the original distribution;
a low temperature squashes low probabilities and accentuates higher probabilities.
This makes the predictions more deterministic. On the contrary, a high temperature
equalizes the distribution and makes the predictions more random.
T = 3, f (x) = x1/3
x
276 10 Word Sequences
10.9 Collocations
Collocations are recurrent combinations of words. Palmer (1933), one of the first to
study them comprehensively, defined them as:
succession[s] of two or more words that must be learnt as an integral whole and not pieced
together from its component parts
1 Some authors now use the term pointwise mutual information to mean mutual information.
Neither Fano (1961, pp. 27–28) nor Church and Hanks (1990) used this term and we kept the
original one.
10.9 Collocations 277
Table 10.8 Collocates of surgery extracted from the Bank of English using the mutual informa-
tion test. Note the misspelled word pioneeing
Word Frequency Bigram word + surgery Mutual information
Arthroscopic 3 3 11.822
Pioneeing 3 3 11.822
Reconstructive 14 11 11.474
Refractive 6 4 11.237
Rhinoplasty 5 3 11.085
P (wi , wj )
I (wi , wj ) = log2
. ,
P (wi )P (wj )
which will be positive, if the two words occur more frequently together than
separately, equal to zero, if they are independent, in this case .P (wi , wj ) =
P (wi )P (wj ), and negative, if they occur less frequently together than separately.
Using the maximum likelihood estimate, this corresponds to:
N2 C(wi , wj )
I (wi , wj ) = log2 · ,
N − 1 C(wi )C(wj )
N · C(wi , wj )
.
≈ log2 ,
C(wi )C(wj )
where .C(wi ) and .C(wj ) are, respectively, the frequencies of word .wi and word
wj in the corpus, .C(wi , wj ) is the frequency of bigram .wi , wj , and N is the total
.
t-Scores
Given two words, the t-score (Church and Mercer 1993) compares the hypothesis
that the words form a collocation with the null hypothesis that posits that the
cooccurrence is only governed by chance, that is .P (wi , wj ) = P (wi ) × P (wj ).
278 10 Word Sequences
Table 10.9 Collocates of set Word Frequency Bigram set + word t-score
extracted from Bank of
English using the t-score up 134, 882 5512 67.980
a 1, 228, 514 7296 35.839
to 1, 375, 856 7688 33.592
off 52, 036 888 23.780
out 12, 3831 1252 23.320
The t-score computes the difference between the two hypotheses, respectively,
mean(P (wi , wj )) and .mean(P (wi ))mean(P (wj )), and divides it by the variances.
.
C(wi , wj )
The hypothesis that .wi and .wj are a collocation gives us a mean of . ;
N
C(wi ) C(wj )
with the null hypothesis, the mean product is × ; and using a
.
N N
C(wi , wj )
binomial assumption, the denominator is approximated to . . We have
N2
then:
1
C(wi , wj ) − C(wi )C(wj )
.
t (wi , wj ) = N .
C(wi , wj )
Table 10.9 shows collocates of set extracted from the Bank of English using the
t-score. High t-scores show recurrent combinations of grammatical or very frequent
words such as of the, and the, etc. Church and Mercer (1993) hint at the threshold
value of 2 or more.
Likelihood Ratio
Dunning (1993) criticized the t-score test and proposed an alternative measure based
on binomial distributions and likelihood ratios. Assuming that the words have a
binomial distribution, we can express the probability of having k counts of a word
w in a sequence of N words knowing that w’s probability is p as:
N
.f (k; N, p) = p k (1 − p)N−k ,
k
10.9 Collocations 279
where
N N!
. = .
k k!(N − k)!
, and N − k
The formula reflects the probability of having k counts of a wordw, p k
N
counts of not having w, (1 − p)N −k . The binomial coefficient corresponds
k
to the number of different ways of distributing k occurrences of the word w in a
sequence of N words.
In the case of collocations, rather than measuring the distribution of single words,
we want to evaluate the likelihood of the wi wj bigram distribution. To do this, we
can reformulate the binomial formula considering the word preceding wj , which
can either be wi or a different word that we denote ¬wi .
Let n1 be the count of wi and k1 , the count of the bigram wi wj in the word
sequence (the corpus). Let n2 be the count of ¬wi , and k2 , the count of the bigram
¬wi wj , where ¬wi wj denotes a bigram in which the first word is not wi and
the second word is wj . Let p1 be the probability of wj knowing that we have wi
preceding it, and p2 be the probability of wj knowing that we have ¬wi before it.
The binomial distribution of observing the pairs wi wj and ¬wi wj in our sequence
is:
n1 n2
.f (k1 ; n1 , p1 )f (k2 ; n2 , p2 ) = p1k1 (1 − p1 )n1 −k1 p2k2 (1 − p2 )n2 −k2 .
k1 k2
Hdep
−2 log λ = 2 log ,
Hind
f (k1 ; n1 , p1 )f (k2 ; n2 , p2 )
= 2 log ,
f (k1 ; n1 , p)f (k2 ; n2 , p)
.
Table 10.10 A contingency table containing bigram counts, where ¬wi wj represents bigrams in
which the first word is not wi and the second word is wj . N is the number of words in the corpus
wi ¬wi
wj C(wi , wj ) C(¬wi , wj ) = C(wj ) − C(wi , wj )
¬wj C(wi , ¬wj ) = C(wi ) − C(wi , wj ) C(¬wi , ¬wj ) = N − C(wi , wj )
Using the counts in Table 10.10 and the maximum likelihood estimate, we have
C(wj )
p = P (wj ) = ,
N
C(wi , wj )
. p1 = P (wj |wi ) = , and
C(wi )
C(wj ) − C(wi , wj )
p2 = P (wj |¬wi ) = ,
N − C(wi )
The three measurements, mutual information, t-scores, and likelihood ratio, use
unigram and bigram statistics. To compute them, we first tokenize the text, and count
words and bigrams using the functions we have described in Sects. 9.4.2 and 10.2.1.
We also need the number of words in the corpus. This corresponds to the size of the
words list: len(words).
Mutual Information
The mutual infomation function iterates over the freq_bigrams list and applies the
mutual information formula:
def mutual_info(words, freq_unigrams, freq_bigrams):
mi = {}
factor = len(words) * len(words) / (len(words) - 1)
for bigram in freq_bigrams:
mi[bigram] = (
math.log(factor * freq_bigrams[bigram] /
(freq_unigrams[bigram[0]] *
freq_unigrams[bigram[1]]), 2))
return mi
To run the computation, we first tokenize the text and collect the unigram and
bigram frequencies. We then call mutual_info() and print the results:
words = tokenize(text.lower())
10.9 Collocations 281
word_freqs = Counter(words)
bigrams = [tuple(words[idx:idx + 2])
for idx in range(len(words) - 1)]
bigram_freqs = Counter(bigrams)
mi = mutual_info(words, word_freqs, bigram_freqs)
t-Scores
word_freqs[bigram[0]], p1) +
log_f(word_freqs[bigram[1]] -
bigram_freqs[bigram],
len(words) - word_freqs[bigram[0]], p2) -
log_f(bigram_freqs[bigram],
word_freqs[bigram[0]], p) -
log_f(word_freqs[bigram[1]] -
bigram_freqs[bigram],
len(words) - word_freqs[bigram[0]], p))
return lr
Language models and statistical techniques were applied first to speech recognition,
lexicography, and later to most domains of natural language processing. For a
historical turning point in their popularity, see the special issues of Computational
Linguistics (1993, 1 and 2).
Jurafsky and Martin (2008) as well as Manning and Schütze (1999) are good
references on language models in general. Chen and Goodman (1998) give addi-
tional details on language modeling techniques and Dunning (1993) on .χ 2 tests and
likelihood ratios to improve collocation detection.
The Natural Language Toolkit (NLTK) provides modules for n-grams compu-
tation, collocations as well as language models. The KenLM Language Model
Toolkit (Heafield 2011) is a fast implementation of the Kneser-Ney algorithm. See
also Norvig (2009) for beautiful Python programs building on language models to
segment words, decipher codes, or check word spelling.
Chapter 11
Dense Vector Representations
In Sect. 6.3, we used one-hot encoding to represent words and, in Sect. 9.5.3, bags
of words to represent documents. When applied to any significant corpus, both
techniques often result in very large, sparse matrices, i.e. containing many zeros.
In this chapter, we will examine how we can build dense vector representations
ranging from two to a few hundred dimensions instead. In the context of natural
language processing, we will call embeddings these dense vectors.
As first corpus, we will consider the chapters of the Salammbô novel and their
translations in English. This setting is both realistic and pretty compact. In Sect. 7.1,
we already used the counts of letter A to discriminate English and French.
Figure 11.1 shows the rest of the counts for all the characters, set to lower case,
broken down by chapter; 30 in total.
As shown by this table, we can well represent the chapters (the documents) by
vectors of characters counts, instead of words. We could call this representation
a bag of characters. In Figure 11.1, there are as many as 40 characters: the 26
unaccented letters from a to z and 14 French accented letters: à, â, æ, ç, è, é, ê,
ë, î, ï, ô, œ, ù, and û. This means that we represent each of our 30 chapters by a
vector of 40 dimensions, where the chapter coordinates are the counts.
Conversely, we can represent the characters by their counts in the chapters. We
just need to store Fig. 11.1 in a .(30 × 40) matrix and transpose it. This results in the
.(40×30) matrix shown in Fig. 11.2, where the 40 rows are the vector representations
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 285
P. M. Nugues, Python for Natural Language Processing, Cognitive Technologies,
https://doi.org/10.1007/978-3-031-57549-5_11
286 11 Dense Vector Representations
Fig. 11.1 Character counts per chapter, where the fr and en suffixes designate the language, either
French or English. In total, there are as many as 40 characters: the 26 unaccented letters from a to
z and 14 French accented letters: à, â, é, etc.
Let us apply this decomposition to the data in Fig. 11.1. Let us denote .X the .m × n
matrix of the letter counts per chapter, in our case, m = 30 and n = 40. From the
11.2 Dimensionality Reduction 287
French English
01 02 03 04 05 ... 12 13 14 15 01 02 03 04 05 ... 12 13 14 15
a 2503 2992 1042 2487 2014 ... 2766 5047 5312 1215 2217 2761 990 2274 1865 ... 2560 4597 4871 1119
b 365 391 152 303 268 ... 373 725 689 173 451 551 183 454 400 ... 489 987 948 229
c 857 1006 326 864 645 ... 935 1730 1754 402 729 777 271 736 553 ... 757 1462 1439 335
d 1151 1388 489 1137 949 ... 1237 2273 2149 582 1316 1548 557 1315 1135 ... 1566 2689 2799 683
e 4312 4993 1785 4158 3394 ... 4618 8678 8870 2195 3967 4543 1570 3814 3210 ... 4331 7963 8179 1994
f 264 319 136 314 223 ... 329 648 628 150 596 685 279 595 515 ... 677 1254 1335 323
g 349 360 122 331 215 ... 350 566 630 134 662 769 253 559 525 ... 650 1201 1140 281
h 295 350 126 287 242 ... 349 642 673 148 2060 2530 875 1978 1693 ... 2348 4278 4534 1108
i 1945 2345 784 2028 1617 ... 2273 3940 4278 969 1823 2163 783 1835 1482 ... 2033 3634 3829 912
j 65 81 41 57 67 ... 65 140 143 27 22 13 4 22 7 ... 28 39 36 9
k 4 6 7 3 3 ... 2 22 2 6 200 284 82 198 153 ... 234 432 427 112
l 1946 2128 816 1796 1513 ... 1955 3746 3780 950 1204 1319 520 1073 949 ... 1102 2281 2218 579
m 726 823 397 722 651 ... 812 1597 1610 387 656 829 333 690 571 ... 746 1493 1534 351
n 1896 2308 778 1958 1547 ... 2285 3984 4255 906 1851 2218 816 1771 1468 ... 2125 3774 4053 924
o 1372 1560 612 1318 1053 ... 1419 2736 2713 697 1897 2237 828 1865 1586 ... 2105 3911 3989 1004
p 789 977 315 773 672 ... 865 1550 1599 417 525 606 194 514 517 ... 581 1099 1019 305
q 248 281 102 274 166 ... 272 425 512 103 19 21 13 33 17 ... 32 49 36 9
r 1948 2376 792 2000 1601 ... 2276 4081 4271 985 1764 2019 711 1726 1357 ... 1939 3577 3689 863
s 2996 3454 1174 2792 2192 ... 3131 5599 5770 1395 1942 2411 864 1918 1646 ... 2152 3894 3946 997
t 1938 2411 856 2031 1736 ... 2274 4387 4467 1037 2547 3083 1048 2704 2178 ... 3046 5540 5858 1330
u 1792 2069 707 1734 1396 ... 1923 3480 3697 893 704 861 298 745 663 ... 750 1379 1490 310
v 414 499 147 422 315 ... 455 767 914 206 258 295 94 245 194 ... 278 437 539 108
w 0 0 0 0 1 ... 0 0 0 0 653 769 254 663 568 ... 721 1374 1377 330
x 129 175 42 138 83 ... 149 288 283 63 29 37 8 60 26 ... 35 77 90 14
y 94 89 31 81 67 ... 98 119 145 36 401 475 145 467 330 ... 418 673 856 150
z 20 23 7 27 18 ... 37 41 41 3 18 31 15 19 33 ... 40 49 49 9
à 128 136 39 110 90 ... 129 209 224 48 0 0 0 0 0 ... 0 0 0 0
â 36 50 9 43 67 ... 33 55 75 20 0 0 0 0 0 ... 0 0 0 0
æ 0 1 0 0 0 ... 0 3 0 2 0 0 0 0 0 ... 0 0 0 0
ç 35 28 10 22 24 ... 23 61 56 17 0 0 0 0 0 ... 0 0 0 0
è 102 147 49 138 112 ... 151 237 260 58 0 0 0 0 0 ... 0 0 0 0
é 423 513 194 424 367 ... 480 940 1019 221 0 0 0 0 0 ... 0 0 0 0
ê 43 68 24 36 44 ... 60 126 94 32 0 0 0 0 0 ... 0 0 0 0
ë 1 0 0 0 1 ... 0 0 0 0 0 0 0 0 0 ... 0 0 0 0
î 17 20 12 15 11 ... 13 32 28 12 0 0 0 0 0 ... 0 0 0 0
ï 2 0 0 2 8 ... 3 5 2 0 0 0 0 0 0 ... 0 0 0 0
ô 20 20 27 15 23 ... 15 37 45 24 0 0 0 0 0 ... 0 0 0 0
ù 14 9 4 6 18 ... 11 24 21 7 0 0 0 0 0 ... 0 0 0 0
û 7 9 7 4 15 ... 14 30 21 11 0 0 0 0 0 ... 0 0 0 0
œ 5 5 2 8 7 ... 0 13 12 6 0 0 0 0 0 ... 0 0 0 0
Fig. 11.2 Character counts per chapter in French, left part, and English, right part. In total, 30
chapters
works of Beltrami (1873), Jordan (1874), and Eckart and Young (1936), we know
we can rewrite .X as:
X = U ΣV ⊺ ,
.
The NumPy, PyTorch, and scikit-learn libraries have SVD functions that we can
call without bothering about the mathematical details. To store .X, depending on
the toolkit, we use a NumPy array as in Sect. 6.3 or a PyTorch tensor. Prior to the
decomposition, we usually standardize the counts for each character by subtracting
the mean from the counts and dividing them by the standard deviation. As with
neural networks in Sect. 8.5.1, it is sometimes beneficial to apply a normalization
before the standardization as in the next statements:
from sklearn.preprocessing import StandardScaler, Normalizer
X_norm = Normalizer().fit_transform(X)
X_scaled = StandardScaler().fit_transform(X_norm)
To carry out the singular value decomposition, in the code below, we use NumPy
and the linalg.svd function. The statement and the returned values follow the
mathematical formulation. If X is the input matrix corresponding to the scaled and
possibly normalized data, we have:
import numpy as np
U, s, Vt = np.linalg.svd(X, full_matrices=False)
Us = U @ np.diag(s)
where s contains the singular values of .Σ. We do not need to compute the
full squared .U matrix as the values of .Σ outside its diagonal are zero and
we set full_matrices=False. We compute the new coordinates of the chapters,
corresponding to the rows of Us, by multiplying U and the diagonal matrix of the
singular values, np.diag(s).
We will now apply the dimensionality reduction the data in Fig. 11.1 and we will
proceed in two steps:
1. We will first restrict the dataset to the French chapters and to two rows: The
frequencies of letter A by chapter and the total counts of characters, (.#A + #B +
#C + . . .), as shown in Table 7.1. This small set has only two dimensions and is
then easy to visualize;
2. We will then proceed with the counts of each character in Fig. 11.1 to have a
complete example.
11.3 Applying a SVD to the Salammbô Dataset 289
For the first experiment, we store the data in Table 7.1, left part, in a matrix,
Xoriginal , and standardize it, .Xscaled . With this very small dataset, we set aside the
.
normalization.
⎡ ⎤ ⎡ ⎤
36961. 2503. −0.232 −0.2556
⎢43621. 2992.⎥ ⎢ 0.1245 0.1275 ⎥
⎢ ⎥ ⎢ ⎥
⎢15694. ⎥ ⎢−1.3706 −1.4001⎥
⎢ 1042.⎥ ⎢ ⎥
⎢36231. 2487.⎥ ⎢−0.2711 −0.2681⎥
⎢ ⎥ ⎢ ⎥
⎢ ⎥ ⎢ ⎥
⎢29945. 2014.⎥ ⎢−0.6076 −0.6386⎥
⎢ ⎥ ⎢ ⎥
⎢40588. 2805.⎥ ⎢−0.0379 −0.019 ⎥
⎢ ⎥ ⎢ ⎥
⎢75255. 5062.⎥ ⎢ 1.818 1.749 ⎥
⎢ ⎥ ⎢ ⎥
Xoriginal
. =⎢
⎢37709. 2643.⎥ ⎢
⎥ ; Xscaled = ⎢ −0.192 −0.1459⎥
⎥
⎢30899. ⎥ ⎢ ⎥
⎢ 2126.⎥ ⎢−0.5566 −0.5509⎥
⎢25486. ⎥ ⎢ ⎥
⎢ 1784.⎥ ⎢−0.8464 −0.8188⎥
⎢ ⎥ ⎢ ⎥
⎢37497. 2641.⎥ ⎢−0.2033 −0.1475⎥
⎢ ⎥ ⎢ ⎥
⎢40398. 2766.⎥ ⎢ −0.048 −0.0496⎥
⎢ ⎥ ⎢ ⎥
⎢74105. 5047.⎥ ⎢ 1.7565 1.7373 ⎥
⎢ ⎥ ⎢ ⎥
⎣76725. 5312.⎦ ⎣ 1.8967 1.9449 ⎦
18317. 1215. −1.2302 −1.2645
The first column vector in .V corresponds to the direction of the regression line
given in Sect. 7.3, while the second one is orthogonal to it; .Σ contains the singular
290 11 Dense Vector Representations
Fig. 11.3 Left pane: The standardized dataset with the singular vectors; Middle pane: The rotated
dataset, note the change of scale; Right pane: The projection on the first (horizontal) singular vector
values; and .U Σ contains the coordinates of the chapters in the new system defined
by .V . We project the points on an axis, for instance the first one, by simply setting
all the other values of .Σ to 0, here the second one:
5.4764 0
.Σ= .
0 0
Figure 11.3 shows the results of the analysis on three panes. The left pane
shows the original standardized dataset; the middle pane, the rotated dataset,
corresponding to .U Σ, and the right pane, the projection on the first singular vector.
This corresponds to a reduction from 2 to 1 and is equivalent to a projection on the
regression line.
11.3 Applying a SVD to the Salammbô Dataset 291
Now that we have seen how two dimensional data are rotated and projected, let
us apply the decomposition to all the characters in Fig. 11.1. This decomposition
results in a vector of 30 singular values ranging from 30.17 to 1.34 .10−14 that we
can relate to the amount of information brought by the corresponding direction.
Projecting the data on the plane associated to the two highest values reduces the
dimensionality from 30 to 2, while keeping a good deal of the variance between the
points. To carry this out, we just set all the singular values in .Σ to 0 except the two
first ones. Figure 11.4 shows the projected chapters corresponding to the rows in the
truncated .U Σ matrix.
The result is striking: On the left pane of the figure, the chapters form two
relatively compact clusters, to the left and to the right on the x axis, corresponding
to the English and French versions. In addition, although far from perfect, we can
apply a translation with a vector that we could call .vFrench_to_English to predict the
approximate position of an English chapter from the position of the French chapter:
On the right pane of the figure, we have the cumulative sum of the singular
values: 39.37 for the two first ones and 94.90 for all.
In the NLP literature, we find two names for the dimensionality reduction technique
we have just seen: SVD and principal component analysis (PCA). They often mean
the same thing though there are a few differences. PCA is a sort of ready-to-use
application of SVD:
• It usually includes a standardization, or at least a centering;
• the output dimensionality is a parameter of the program; and
• it only returns the .U Σ or U matrices.
We used NumPy to compute the singular value decompositions. PyTorch has
a similar function: torch.linalg.svd. scikit-learn provides two classes for this, PCA
and TruncatedSVD. Both programs have an argument to set the number of dimensions
11.4 Latent Semantic Indexing 293
to keep and return a truncated matrix. The difference between them is that PCA
centers the columns, while TruncatedSVD applies to raw matrices. Finally. fbpca
is Facebook’s very fast implementation of PCA. By default, it also centers the
columns.
We can extend the technique we described for characters in the previous section to
words. The matrix structure will be the same, but the rows will correspond to the
words in the corpus, and the columns, to documents, or more simply a context of a
few words, for instance a paragraph. The matrix of word-document pairs .(wi , Dj )
in Table 11.1 is just the transpose of Table 9.8. Each matrix element measures the
association strength between word .wi and document .Dj .
In Fig. 11.1, the matrix elements are simply the raw counts of the pairs.
Deerwester et al. (1990) used a more elaborate formula similar to .tf × idf ,
defined in Sect. 9.5.3. It consists of the product of a local weight, computed from
a document, like the term frequency .tf , and a global one, computed from the
collection, like the inverted document frequency .idf .
Dumais (1991) describes variants of the weights, where the best scheme to
compute the coefficient .xij corresponding to term i in document j is given by a
local weight of .log(tf ij + 1), where .tf ij is the frequency of term i in document j ,
and a global weight of
pij log(pij )
.1− ,
log(ndocs)
j
294 11 Dense Vector Representations
Table 11.1 The word-by-document matrix. Each cell .(wi , Dj ) contains an association score
between .wi and .Dj
Words.\D# .D1 .D2 .D3 ... .Dn
tf ij
where .pij = , .gf i is the global frequency of term i: The total number of times
gf i
it occurs in the collection, and .ndocs, the total number of documents.
Once we have filled the matrix, we can apply a SVD to reduce the dimensionality
and represent the words or transpose the matrix before and represent the documents.
Deerwester et al. (1990) used this method to index the documents of a collection,
where they reduced the dimension of the vector space to 100, i.e. each document is
represented by a 100-dimensional vector. They called it latent semantic indexing
(LSI). We can store the resulting documents in a vector database that will enable us
to speed up document queries and comparisons.
Similarly to LSI, Benzécri (1981a) and Benzécri and Morfin (1981) defined a
correspondence analysis method that applies the .χ 2 metric to pairs of words or
word-document pairs and a specific normalization of the matrices.
Using this context, a very simple replacement of the association scores .X(wi , Dj )
in Table 11.1 is to use the counts: .Ci/=k (wi , wk ), wk ∈ Cj , computed over all the
contexts of .wi in the corpus as in Table 11.2. The size of this matrix is .n × n, where
n is the number of unique words in the corpus. A PCA will enable us to truncate the
columns of .U Σ to 50, 100, 300, or 500 dimensions and create word embeddings
from its rows.
After this description of cooccurrence matrices, we will now see how to build
them from a corpus and reduce the dimensionality of their rows with a PCA.
11.5 Word Embeddings from a Cooccurrence Matrix 295
Table 11.2 The word-by-context matrix. Counts of bigrams .(wi , wj ), where .wj occurs in a
window of size 2K centered on .wi called the context
Words.\C# .w1 .w2 .w3 ... .wn
11.5.1 Preprocessing
Before we can build the matrix, we need to preprocess our corpus: We tokenize it
with:
text = open(’corpus.txt’, encoding=’utf8’).read()
text = text.lower()
words = re.findall(’\p{L}+’, text)
We then replace the tokens in the list with their numerical index:
words_idx = [word2idx[word] for word in words]
Once we have a list of word indices, we can build the cooccurrence matrix, where
for each word in the corpus, we will count the words in its context. These counts
quantify the association between two words. However, a word that is 10 words apart
from the focus is probably less significant than an adjacent one. To take this into
account, we will weight the contribution of each context word by .1/d, d being the
distance to the focus word. For each word, we store the current weighted counts in
a dictionary and we update them with those of new left and right contexts with the
update_counts() function.
start_dict[word] = 1.0/i
for i, word in enumerate(lc[::-1], start=1):
if word in start_dict:
start_dict[word] += 1.0/i
else:
start_dict[word] = 1.0/i
return start_dict
We store all the cooccurrence counts in a dictionary, C_dict, where the keys
are the unique words of the corpus represented by their indices. For a given key,
the value is another dictionary with all the words found in its contexts and their
weighted counts. To compute the counts, we traverse the list of indices representing
the corpus, where at each index, if the key is not already present, we create a
dictionary; we extract the left and right contexts; and we update this dictionary with
the weighted counts of the contexts.
def build_C(words_idx: list[int],
K: int) -> dict[int, dict[int, float]]:
C_dict = dict()
for i, word in tqdm(enumerate(words_idx)):
if word not in C_dict:
C_dict[word] = dict()
lc = words_idx[i - K:i]
rc = words_idx[i + 1:i + K + 1]
C_dict[word] = update_counts(C_dict[word],
lc, rc)
return C_dict
We have now the data needed in Table 11.2 and we can reduce its dimensionality.
Before that, as our numerical libraries we can only apply a PCA to a matrix, we
must convert our cooccurrence dictionary to a NumPy array. For this, we create a
square matrix of zeros of the size of the vocabulary and we assign it the values in
the dictionary:
def build_matrix(C_dict):
cooc_mat = np.zeros((len(C_dict), len(C_dict)))
for k, coocs in C_dict.items():
for c, cnt in coocs.items():
cooc_mat[k, c] = float(cnt)
return cooc_mat
11.6 Embeddings’ Similarity 297
We compute the U , .Σ, and V matrices with Facebook’s PCA as it is very fast. In
the next statement, we set the reduced dimension of U ’s row vectors to 50:
(U, s, Va) = fbpca.pca(cooc_mat, 50)
df = pd.DataFrame(
U @ np.diag(s),
index=[idx2word[i] for i in range(len(idx2word))])
that will index the rows with the words. We save the vectors in a text file with:
df.to_csv(’cooc50d.txt’, sep=’ ’, header=False)
The file consists of rows, where the first item of a row is the word string followed
by its embedding: A vector of 50 coordinates separated by spaces.
Now that we have counted the cooccurrences and subsequently applied a dimension-
ality reduction how can we evaluate the results? We saw in Fig. 11.4 that the chapters
formed two clusters and that a chapter in French was closer to the other chapters in
French than to those in English. We will follow this idea with a few words from our
corpus and determine what are the most similar words. If the similar words match
our expectations, then we will have a sort of qualitative assessment of the method.
We usually measure the similarity between two embeddings .u and .v with the cosine
similarity:
u·v
. cos(u, v) = ,
||u|| · ||v||
298 11 Dense Vector Representations
ranging from -1 (most dissimilar) to 1 (most similar) or with the cosine distance
ranging from 0 (closest) to 2 (most distant):
u·v
1 − cos(u, v) = 1 −
. .
||u|| · ||v||
11.6.2 Programming
computes the list of the N vectors most similar to .u in E and returns their indices.
Note that .u and E must be PyTorch tensors and that we convert a NumPy array to a
tensor with the function torch.from_numpy().
Taking Homer’s Iliad and Odyssey as corpus, a context of four words, the
three most similar words to he, she, ulysses, penelope, achaeans, and trojans are
respectively, for the cooccurrence matrix:
he [’she’, ’it’, ’they’]
she [’he’, ’they’, ’ulysses’]
ulysses [’achilles’, ’hector’, ’telemachus’]
penelope [’telemachus’, ’ulysses’, ’juno’]
achaeans [’danaans’, ’argives’, ’trojans’]
trojans [’achaeans’, ’danaans’, ’argives’]
Note that as Facebook’s PCA uses a randomized algorithm, these results may vary
from run to run.
11.7 From Cooccurrences to Mutual Information 299
We already experimented with semantic associations in Sect. 10.9 and we saw that
mutual information could produce relevant pairs. In this section, we will convert
our cooccurrence matrix to a matrix of mutual information values. Recall that the
definition of mutual information from Fano (1961) is:
P (wi , wj )
I (wi , wj ) = log2
. ,
P (wi )P (wj )
and we have already the counts in Table 11.2 or in cooc_mat from the program in
Sect. 11.5.
We will first estimate the probabilities from the counts. We have to consider that
each word or word pair appears many more times in the table than in the corpus.
1. To normalize the .C(wi , wj ) counts, we have to divide them by the sum of all the
elements in the matrix. This is just a division by np.sum(cooc_mat);
2. For .wi , the focus word, its total count is the sum of the elements in row i. .P (wi )
is then this sum divided by the total count;
3. Finally, for .wj , the context word, its total count is the sum of the elements in
column j . .P (wi ) is then this sum divided by the total count.
P (w ,w )
i j
With these counts, we can compute . P (wi )P (wj ) from the cooccurrence matrix.
The next step is to compute the logarithm of the cells. Given the matrix, this is not
possible as some pairs .(wi , wj ) have a count of 0, leading to an infinitely negative
value. To solve this problem, Bullinaria and Levy (2007) proposed to set the mutual
information of a pair to zero when it is unseen and, to be consistent with the other
pairs, to set pairs with a negative mutual information to zero too. This means that
we keep the positive values, corresponding to associations that are more frequent
than chance, and we set the rest to zeros.
Finally, we can compute the positive mutual information matrix with this
function:
def build_pmi_mat(cooc_mat):
pair_cnt = np.sum(cooc_mat)
wi_rel_freq = np.sum(cooc_mat, axis=1) / pair_cnt
wc_rel_freq = np.sum(cooc_mat, axis=0) / pair_cnt
mi_mat = np.log2(mi_mat,
out=np.zeros_like(mi_mat),
where=(mi_mat != 0))
mi_mat[mi_mat < 0.0] = 0.0
return mi_mat
300 11 Dense Vector Representations
where the where argument of np.log2() tells it to not to apply the function to zero
elements and out to replace them with 0 instead. We compute the positive mutual
information matrix with:
pmi_mat = build_pmi_mat(cooc_mat)
that seems a bit less relevant than with the previous method.
11.8 GloVe
11.8.1 Model
In Sect. 11.6, we assessed the meaning similarity between the embeddings of two
words, .wi and .wj , with their cosine. GloVe uses the dot product of the embeddings
instead so that it matches how many times .wj occurs in the vicinity of .wi . More
precisely, GloVe models the dot product of the embeddings a focus word, .wi , and
a word in its context, .wj , plus two biases, as being equal to the logarithm of their
weighted cooccurrence count:
where .EL (wi ) and .bL (wi ) are the focus word embedding and its bias and .ER (wj )
and .bR (wj ), those of the context word.
11.8 GloVe 301
11.8.2 Loss
and the sum or mean of the squared errors for the whole corpus. We come then to:
V
L=
. (EL (wi ) · ER (wj ) + bL (wi ) + bR (wj ) − log C(wi , wj ))2 .
i,j =1
We saw in Sect. 11.7 with mutual information that the logarithm could pose a
problem as Table 11.2 contained zeros. GloVe solves it by setting the loss to zero
when .C(wi , wj ) = 0. For this, it defines the function:
⎧ α
⎨ x
, if x < xmax ,
f (x) =
.
xmax
⎩1 otherwise,
shown in Fig. 11.6 and multiplies it with the squared error. We have thus a zero loss
when the counts are zero as .f (0) = 0 and by convention .0 · log 0 = 0. In addition,
Pennington et al. (2014) set .xmax to 100 and .α to .3/4.
We have now the final definition of GloVe’s loss:
V
L=
. f (C(wi , wj ))(EL (wi ) · ER (wj ) + bL (wi ) + bR (wj ) − log C(wi , wj ))2
i,j =1
Such a loss enables us to fit the embeddings with a gradient descent and we now
describe how to do this with PyTorch.
Xi j
xmax
302 11 Dense Vector Representations
11.8.3 Embeddings
The corpus preprocessing step to count the cooccurrences, .C(wi , wj ), is exactly the
same as in Sect. 11.5. Then, we will store the embedding vectors representing the
words in a lookup table as rows of a given dimension. Before we train them, they
are just random values.
In PyTorch, we create an embedding table for 9768 words, the vocabulary size,
where each word has an embedding dimension of five with the statement:
>>> embedding_dim = 5
>>> embedding = nn.Embedding(vocab_size, embedding_dim)
>>> embedding.weight[:5]
tensor([[-1.1903, 0.6513, -0.0738, -0.8198, 1.0269],
[ 1.7192, 1.9402, 0.8532, 0.0069, 0.1495],
[ 0.0781, -0.3445, -0.4184, 2.8871, -0.2273],
[-2.2183, -1.3953, 0.6825, 0.6301, 1.1582],
[ 0.9579, 0.0194, 0.3023, -0.3885, 0.4006]],
grad_fn=<SliceBackward0>)
Once we have trained the embeddings, we will save them and possibly reuse
them in other applications. We load trained embeddings with the from_pretrained()
method:
>>> rand_matrix = torch.rand((vocab_size, embedding_dim))
>>> embedding_rand = nn.Embedding.from_pretrained(
rand_matrix,
freeze=False)
We can make these embeddings trainable or not by setting the freeze argument to
false or true.
To model the data for the training loop, we will use PyTorch datasets and dataloaders
as in Sect. 8.5.3. We will represent the dataset as a TensorDataset with two
arguments:
1. An input tensor containing the pair indices corresponding to .(wi , wj ) and
2. An output tensor with the counts representing .C(wi , wj ).
11.8 GloVe 303
We convert the original data structure holding the counts, C_dict, see Sect. 11.5,
to these tensors with the function cooc_cnts2Xy():
def cooc_cnts2Xy(C_dict: dict[dict[int, float]]):
(C_pairs, C_freqs) = ([], [])
for word_l, context in C_dict.items():
for word_r, freq in context.items():
C_pairs += [[word_l, word_r]]
C_freqs += [[freq]]
C_pairs = torch.LongTensor(C_pairs)
C_freqs = torch.FloatTensor(C_freqs)
return C_pairs, C_freqs
and we fit the parameters with a training loop identical to that in Sect. 8.5.3.
Once fitted, our model contains two embedding tables: one for the focus words and
the other for the context words. Pennington et al. (2014) proposed to add them to
get the GloVe embeddings:
E = glove.E_l.weight + glove.E_r.weight
We can store these vectors in a pandas DataFrame as in Sect. 11.5.4. We must though
detach the tensors from their computational graph and convert them to NumPy with
the method: E.detach().numpy(). We then save them in a file as we did in this
section.
Using the cosine similarity of Sect. 11.6, the same corpus, and the same list of words,
we obtain:
he [’but’, ’was’, ’him’]
she [’he’, ’her’, ’then’]
ulysses [’telemachus’, ’achilles’, ’said’]
penelope [’telemachus’, ’juno’, ’answered’]
trojans [’achaeans’, ’danaans’, ’the’]
achaeans [’trojans’, ’suitors’, ’danaans’]
11.9.1 word2vec
After the PCA and GloVe, word2vec (Mikolov et al. 2013a) is another technique to
obtain embeddings. It uses neural networks and comes in two forms: CBOW and
skipgrams. The goal of CBOW is to guess a missing word given its surrounding
context as in the example below:
11.9 Word Embeddings from Neural Networks 305
would have to fill in the blank with the word goddess. This set up is then identical
to fill-in-the-blank questionnaires or cloze tests (Taylor 1953).
In machine learning, this task corresponds to a prediction. The .x input, called the
context, is a word sequence deprived from one word. This missing word or target is
the y answer. We can easily create a dataset from any text. Using our example above
and sequences of five words, we generate the .X contexts by removing the word in
the middle and we build the .y vector with the words to predict:
⎡ ⎤ ⎡ ⎤
sing o the anger goddess
⎢o ⎥ ⎢the ⎥
⎢ goddess anger of ⎥ ⎢ ⎥
⎢ ⎥ ⎢ ⎥
⎢goddess the of achilles⎥ ⎢anger ⎥
.X = ⎢ ⎥;y = ⎢ ⎥
⎢the anger achilles son ⎥ ⎢of ⎥
⎢ ⎥ ⎢ ⎥
⎣anger of son of ⎦ ⎣achilles ⎦
of achilles of peleus son
While the previous techniques used cooccurrences, word2vec starts from an input
of one-hot encoded words, in the form of indices, and carries out the dimensionality
reduction inside a neural network. To compute the embeddings, we first train a
model to predict the words from their contexts and then extract the model’s input
weights that will correspond to the CBOW embeddings.
Given a training corpus, the first step extracts the vocabulary of its words.
The network input consists of the context words encoded by their indices in
the vocabulary followed by an embedding layer. This layer is just a lookup
mechanism, as we saw in Sect. 11.8.3, that replaces the word indices with their
dense representations: Vectors of the embedding dimension. Practically, a dense
representation is a trainable vector of 10–300 dimensions. The next layer computes
the sum or the mean of the embedding vectors resulting into one single vector.
Finally, a linear layer and a softmax function predict the target word. This last layer
has as many weights as we have dimensions in the embeddings; see Fig. 11.7.
306 11 Dense Vector Representations
one-hot embed
• ◦ • • ••
sing
• • ◦ • ••
o
sum softmax
S y goddess
◦ • • • ••
the
• • • • •◦
anger
We initialize randomly the dense vector representations of the words. The vector
parameters are then learned by the fitting procedure. With this technique, the word
embeddings correspond to the weights of the first layer of the network.
In Fig. 11.7, the input consists of five words or batches of five words. We have first to
format our data so that it fits this input. We use the same statements as in Sect. 11.5.1
to tokenize the words and create the indices. We apply them on a corpus made of a
concatenation of the Iliad and the Odyssey.
Then given a context size c_size of 5 and left and right contexts w_size of 2, we
create the list of contexts in X and of words to predict in y with this function:
def create_Xy(words):
(X, y) = ([], [])
c_size = 2 * w_size + 1
for i in range(len(words) - c_size + 1):
X.append(words[i: i + w_size] +
words[i + w_size + 1: i + c_size])
11.9 Word Embeddings from Neural Networks 307
y.append(words[i + w_size])
return X, y
We apply this function to the list of words and we obtain the same values as in the
description of CBOW for X:
>>> X, y = create_Xy(words)
>>> X[2:5]
[[’sing’, ’o’, ’the’, ’anger’],
[’o’, ’goddess’, ’anger’, ’of’],
[’goddess’, ’the’, ’of’, ’achilles’]]
and .y:
>>> y[2:5]
[’goddess’, ’the’, ’anger’]
and we convert X and .y to PyTorch tensors, the input and target values:
X = torch.LongTensor(X)
y = torch.LongTensor(y)
and .y:
>>> y[2:5]
tensor([3697, 8548, 358])
Embeddings
The first layer in Fig. 11.7 consists of an embedding layer. This is just a lookup
table, as we saw in Sect. 11.8.3, where we store the embedding vectors as rows of a
given dimension. Before we train them, they are just random values.
We extract the embeddings of the five first contexts with the statement:
>>> embedding(X[2:5])
>>> embedding(X[2:5]).size()
torch.Size([3, 4, 5])
where the first axis corresponds to the batch size, here 3, from X[2] to X[4], the
second one to the size of the input, here four words making the context, and the
third one, the dimension of the embedding vectors: 5.
308 11 Dense Vector Representations
In Fig. 11.7, we sum or compute the mean of the embeddings making up the
context. We simply add sum() or mean() to the tensor with the dimension where we
apply the operation, here the four context words:
>>> embedding(X[2:5]).mean(dim=1)
tensor([[ 0.0108, -0.6135, 0.1026, 0.0504, 0.3282],
[ 0.7513, 0.1777, -0.0256, -0.1262, 0.7771],
[ 0.2507, 0.1822, -0.1035, 0.1529, 0.4330]],
grad_fn=<MeanBackward1>)
Embedding Bags
This statement creates a new embedding matrix. To have the same content as our
first embeddings, we load our first embedding matrix:
embedding_bag = nn.EmbeddingBag.from_pretrained(
embedding.state_dict()[’weight’])
The result is then the same as with mean() in the previous section:
>>> embedding_bag(X[2:5])
tensor([[ 0.0108, -0.6135, 0.1026, 0.0504, 0.3282],
[ 0.7513, 0.1777, -0.0256, -0.1262, 0.7771],
[ 0.2507, 0.1822, -0.1035, 0.1529, 0.4330]])
The Network
It is now easy to program the architecture in Fig. 11.7 with PyTorch. We just declare
the sequence with the tensor dimensions:
embedding_dim = 50
model = nn.Sequential(
nn.EmbeddingBag(vocab_size, embedding_dim),
nn.Linear(embedding_dim, vocab_size))
CBOW Embeddings
Semantic Similarity
Once our model is trained, we can compute the cosine similarity of a few words
using the function in Sect. 11.6. Using a batch size of 1024, the mean() function
to aggregate the vectors, and five epochs, the five most similar words for: he, she,
ulysses, penelope, achaeans, and trojans are respectively:
he [’she’, ’they’, ’i’]
she [’he’, ’vulcan’, ’they’]
ulysses [’telemachus’, ’antinous’, ’juno’]
penelope [’helen’, ’telemachus’, ’juno’]
achaeans [’danaans’, ’trojans’, ’argives’]
trojans [’danaans’, ’achaeans’, ’argives’]
which all correspond to semantically related terms. These lists may vary depending
on the network initialization.
11.9.4 Skipgrams
The Model
In the skipgram task, we try to predict the context of a given word, i.e. the words
surrounding it. For instance, taking the word goddess in:
Sing, O goddess, the anger of Achilles son of Peleus,
and a window of two words to the left and to the right of it, we would predict: Sing,
O, the, and anger.
For this, the skipgram model tries to maximize the average log-probability of the
context. For a corpus consisting of a sequence of T words, .w1 , w2 , . . . , wT , this
yields:
T −c c
1
. log P (wt+j |wt ),
T
t=c+1 j =−c,
j /=0
310 11 Dense Vector Representations
where c is the context size. Using our example and the word goddess, this
would correspond to the sum of logarithms of: .P (sing|goddess), .P (o|goddess),
.P (the|goddess), and .P (anger|goddess).
Given an input word, .wi , for instance goddess, we compute the probability
.P (wo |wi ) of word .wo to be in its context using the dot product of their respective
embeddings, .vwi for the center word and .v' wo for the context. This dot product,
'
.vwi · v wo , reflects the proximity of the two embeddings. We then use the softmax
exp(vwi · v' wo )
P (wo |wi ) =
.
'
,
w∈V exp(vwi · v w )
The skipgram initial setting results in a model with a huge number of parameters.
To reduce it, Mikolov et al. (2013b) proposed a binary classification, where they
extracted pairs consisting of a word and a context word, forming the positive class,
as well as pairs, where the second word is outside the context, the negative class. As
classes, we have then .y = 1, when the words are cooccurring, and .y = 0, when not.
We represent the words with their embeddings and we compute the dot product of
vector pairs consisting of the input word and a word in the context. For the negative
class, we use a pair consisting of the input word and a word drawn randomly from
the corpus.
Using our goddess example again and a window of two words to the left and to
the right of it, we have the positive pairs:
⎡ ⎤ ⎡ ⎤
goddess sing 1
⎢goddess o ⎥ ⎢1⎥
.X = ⎢ ⎥;y = ⎢ ⎥
⎣goddess the ⎦ ⎣1⎦
goddess anger 1
For each positive pair, Mikolov et al. (2013b) proposed to draw randomly k
negative pairs, for instance labours, to, end, before, and, etc.
⎡ ⎤ ⎡ ⎤
goddess labours 0
⎢goddess ⎥ ⎢0⎥
⎢ to ⎥ ⎢ ⎥
⎢ ⎥ ⎢ ⎥
.X = ⎢goddess end ⎥;y = ⎢ 0 ⎥
⎢ ⎥ ⎢ ⎥
⎣goddess before ⎦ ⎣0⎦
goddess ... ...
The dot product ranges from .−∞ to .∞. To have a logarithmic loss compatible
with a neural net architecture, we use the logistic curve as activation function. See
the loss below.
The Network
From these definitions, we can program a network. Contrary to the CBOW model,
the architecture is not a sequence and we have to define a new class as in Sect. 8.5.3.
We have a pair consisting of an input word, i_word, and a context word that
corresponds to the output o_word in the network. Both words go through their
respective embedding layers that we define in the __init__ method. In the forward
method, we split the X input in two columns and we compute the dot product of the
pairwise vectors:
class Skipgram(nn.Module):
def __init__(self):
super().__init__()
self.i_embedding = nn.Embedding(vocab_size,
embedding_dim)
self.o_embedding = nn.Embedding(vocab_size,
embedding_dim)
The Loss
To train the model, we need a loss. Mikolov et al. (2013b) defined it from the
network output, the dot product of the input and output embeddings, as:
k
. − log σ (vwi · v' wo ) − log σ (−vwi · v' wn ),
i=1
where .σ is the logistic function, .wi is the center word, .wo , a word in the context,
and .wn , a word outside it. This loss reflects the property that words in the context
should be similar to the center word, while the other words should be dissimilar.
There is no such a built-in loss in PyTorch, so we need to define it. In Sect. 11.8.5,
we wrote a function to compute a custom loss. Here, we will see another way to do
it with a class that applies to a batch with the same signature: loss_fn(y_pred, y)
and returns the mean or the sum for the batch.
The y_pred vector contains the predictions for the positive and negative classes.
We compute two intermediate loss vectors: For class 1, .log σ (vwi · v' wo ); for class
312 11 Dense Vector Representations
0, .log σ (−vwi · v' wn ). We then extract the values relevant to the positive class by a
pointwise multiplication with .y:
We add these two vectors and we compute the mean to have the loss. This
corresponds to the class:
class NegSamplingLoss(nn.Module):
def __init__(self):
super().__init__()
We create the loss, the model, and the optimizer with the statements:
model = Skipgram()
loss_fn = NegSamplingLoss()
optimizer = torch.optim.RMSprop(model.parameters(), lr=0.001)
Dataset Preparation
As with CBOW, we tokenize the corpus and we extract the .(wi , wo ) pairs by
traversing the list. Then we have to build the negative pairs: .(wi , wn ). We then draw
randomly from their distribution. We can do this with the function:
random.choices(word_idx, weights=frequencies, k=N)
C(w)power
U (w) =
.
power
i C(wi )
We apply a power transform to the initial counts or distribution with the function:
def power_transform(dist, power):
dist_pow = {k: math.pow(v, power)
for k, v in dist.items()}
total = sum(dist_pow.values())
dist_pow = {k: v/total
11.10 Application of Embeddings to Language Detection 313
for k, v in dist_pow.items()}
return dist_pow
with .t ≈ 0.003.
Semantic Similarity
When trained on Homer’s Iliad and Odyssey on five epochs, the three most similar
words to those in our list are:
he [’she’, ’they’, ’it’]
she [’he’, ’her’, ’him’]
ulysses [’achilles’, ’telemachus’, ’hector’]
penelope [’speak’, ’nestor’, ’telemachus’]
achaeans [’trojans’, ’argives’, ’two’]
trojans [’achaeans’, ’argives’, ’cyprian’]
Characters S
Bigrams S
s language
Trigrams S
3. Encode the characters, character bigrams and trigrams of a text using embedding
tables and compute their mean. This corresponds to the left part of Fig. 11.8;
4. Create a logistic regression model that takes a concatenation of the embedding
means and outputs the language code; see the middle and right part of Fig. 11.8;
5. Train this model, i.e the embedding tables and the logistic layer, and report the
results.
11.10.1 Corpus
This first step before we train a model is to find a dataset. We will use that of
Tatoeba, a collaborative site, where users add or translate short texts and annotate
them with language tags. In the end of year 2023, the Tatoeba corpus reached
nearly 12 million texts in more than 400 languages. It is available in one file called
sentences.csv.2 Many applications use Tatoeba to train machine-learning models
including translation and language detection.
The dataset is structured this way: There is one text per line, where each line
consists of a unique identifier, the text, and the language code:
sentence identifier [tab] language code [tab] text [cr]
The fields are separated by tabulations and ended by a carriage return. The language
codes follow an ISO standard: eng for English, fra for French, cmn for Chinese, deu
for German, etc. Here are a few lines from the dataset:
The Tatoeba dataset is very large and, at the same time, some languages only have a
handful of texts. We will extract a working dataset from it with languages that have
at least 20,000 sentences. We will then downsample this data so that we have an
equal number of sentences for each language. We will finally split it into training,
validation, and test sets.
We first read the sentences with a generator:
def file_reader(file):
with open(file, encoding=’utf8’, errors=’ignore’) as f:
for line in f:
row = line.strip()
yield tuple(row.split(’\t’))
line_generator = file_reader(’sentences.csv’)
and we count the texts per language to enable us to keep the languages with enough
training data:
>>> lang_freqs = Counter(map(lambda x: x[1], line_generator))
>>> lang_freqs.most_common(3)
[(’eng’, 1854349), (’rus’, 1027631), (’ita’, 867469)]
We select the languages with more than 20,000 samples with the statements:
SENT_PER_LANG = 20000
Classifiers tend to bias their choice toward the most populated classes. To counter
this, we will balance the corpus and we will keep about 20,000 texts for each
language. This corresponds to a fraction of the Tatoeba corpus that depends on the
20,000
language. English has 1,854,349 samples and we will select . 1,854,349 = 0.011% of
20,000
them; Russian has 1,027,631 and we will select . 1,027,631 = 0.019%, etc.
To choose the samples, we will use a uniform generator of random numbers
between 0 and 1. We will scan the sentences of our dataset and, for each of them,
draw a random number. If we want to select 10% of the sentences of a given
316 11 Dense Vector Representations
language, we will return the sentences for which the random number is less than
0.1, for 20%, when it is less than 0.2, etc. In our case, we compute the percentages
so that for each language we have 20,000 texts.
lang_percentage = dict()
for lang, cnt in selected_langs.items():
lang_percentage[lang] = SENT_PER_LANG/cnt
We can now extract the sentences of our working corpus and shuffle it:
line_generator = file_reader(’sentences.csv’)
working_corpus = []
for lang_tuple in line_generator:
lang = lang_tuple[1]
if (lang in lang_percentage and
random.random() < lang_percentage[lang]):
working_corpus += [lang_tuple]
random.shuffle(working_corpus)
We split it into training, validation, and test sets with the ratios: 80%, 10%, and
10%:
TRAIN_PERCENT = 0.8
VAL_PERCENT = 0.1
TEST_PERCENT = 0.1
Google’s CLD3 uses characters as input, as well as character bigrams and trigrams,
each with its own embedding table. We extract such n-grams from a string with a
given n value with this function:
def ngrams(sentence: str,
n: int = 1) -> list[str]:
ngram_l = []
for i in range(len(sentence) - n + 1):
ngram_l += [sentence[i:i+n]]
return ngram_l
and we collect all the n-grams from .n = 1 up to 3 with this function:
def all_ngrams(sentence: str,
max_ngram: int = 3,
lc=True) -> list[list[str]]:
if lc:
sentence = sentence.lower()
all_ngram_list = []
for i in range(1, max_ngram + 1):
all_ngram_list += [ngrams(sentence, n=i)]
return all_ngram_list
Applied to the training set, this yields more than 5000 unique characters, 100,000
unique bigrams, and 340,000 trigrams as there are many nonLatin alphabets. In
Sect. 4.1.2, we saw that Unicode has a coding capacity of 24 bits resulting in a
ceiling of more than 1,000,000 different characters. This makes it impossible to use
one-hot encoding in such an application.
We will use a hash function to reduce these numbers with a technique sometimes
called the hashing trick. Hash functions map a data object to a fixed length number
such as Python’s built-in hash function:
>>> hash(’a’)
7642859311653074128
>>> hash(’ac’)
9165824848673741814
>>> hash(’abc’)
29577857799253695
We can reduce the hash code range of the n-grams by dividing them by a fixed
integer, the modulus, for instance 100, and taking the remainder or modulo, resulting
then in at most 100 different codes.
>>> MOD_CHARS = 100
>>> hash(’a’) % MOD_CHARS
28
318 11 Dense Vector Representations
This hashing technique has many advantages here as we can skip the n-gram
indexing: We just need to apply the hash function to any n-gram. In addition, there
is no unknown value as it will always return an integer lower than the modulus.
Nonetheless, hashing will also create encoding conflicts as the coding capacity is
reduced from Unicode’s 1 million characters to the modulus. In our example, as we
have about 5000 different characters, this means that on average, with a modulus of
100, 50 characters will have the same hash value. We will see that this will not harm
the classification results.
As modulus for the characters, bigrams, and trigrams, we will use the prime
numbers MODS = [2053, 4099, 4099] as they usually have better distributional
properties.
A problem with Python’s hashing function is that it does not always return the
same values when executed on different machines. We replace it with the MD5
standard and this function instead (Klang 2023):
import hashlib
def reproducible_hash(string: str) -> int:
h = hashlib.md5(string.encode(’utf-8’),
usedforsecurity=False)
return int.from_bytes(h.digest()[0:8], ’big’, signed=True)
where hashlib.md5() creates a hash encoder and digest() computes the code in the
form of bytes. We then return an integer from it.
We can now convert a n-gram to a numerical code. We write a function to apply
this to a list of strings, where the modulus is an argument:
def hash_str_list(text: list[str],
char_mod: int) -> list[int]:
values = map(lambda x: x % char_mod,
map(reproducible_hash, text))
return list(values)
We will now create our input and output matrices. We will only write the code for
the training set as this is the same for the two other sets. First, we create indices for
the languages:
idx2lang = dict(enumerate(sorted(selected_langs)))
lang2idx = {v: k for k, v in idx2lang.items()}
Then we create a tensor of the .y output:
y_train = torch.LongTensor(
[lang2idx[tuple_lang[1]]
for tuple_lang in training_set])
Now we can look at the X matrix. We first create a function that, given a text,
creates the n-grams and convert them into hash codes. We return three tensors of
indices:
11.10 Application of Embeddings to Language Detection 319
To build the X matrix, we just call this hash function on all the texts of the training
set. We carry this out with the build_X() function:
def build_X(dataset):
X = []
for lang_tuple in tqdm(dataset):
x = hash_all_ngrams(lang_tuple[2])
X += [x]
return X
and we have
X_train = build_X(training_set)
X_train is a list of triples, where each triple represents one text and consists of three
tensors of indices for the characters, bigrams, and trigrams.
With such offsets, we can create a batch of samples that do not need a rectangular
table of indices.
We can now create a function that generates a bag for a batch of hash codes.
Given a list of tensors of indices as input, we concatenate the tensors, we compute
their lengths, and then their offsets with this function:
def bag_generator(X_idx_l: list):
X_idx = torch.cat(X_idx_l, dim=-1)
bag_lengths = [X_idx_l[i].size(dim=0)
for i in range(len(X_idx_l))]
X_offsets = [sum(bag_lengths[:i])
for i in range(len(bag_lengths))]
return X_idx, torch.LongTensor(X_offsets)
A batch in the X table, for instance X[:5] is a stack of triples: the characters,
bigrams, and trigrams. We have to generate a separate bag each of them. The inside
zip extract the three columns, corresponding respectively to the characters, the
bigrams, and the trigrams and apply them the bag generator. We have then three
bags made of a pair of the indices and the offsets. The second zip creates two lists:
one for the indices and the other for the offsets.
11.10.7 Model
The model follows the language detector architecture, where we first look up the
embeddings for the characters, bigrams, and trigrams, compute their respective
means with an embedding bag, concatenate these means, and then pass the vector
to a linear layer:
class LangDetector(nn.Module):
def __init__(self, MODS, EMBEDDING_DIM, n_lang):
super().__init__()
(MAX_CHARS, MAX_BIGRAMS, MAX_TRIGRAMS) = MODS
self.emb_chars = nn.EmbeddingBag(MAX_CHARS,
EMBEDDING_DIM)
self.emb_bigr = nn.EmbeddingBag(MAX_BIGRAMS,
EMBEDDING_DIM)
self.emb_trig = nn.EmbeddingBag(MAX_TRIGRAMS,
EMBEDDING_DIM)
11.10 Application of Embeddings to Language Detection 321
The training loop is identical to those we have already written. The only difficulty
being the call to the bag generator and we detail it here. We skip the piece of code
to measure the performance on the validation set.
loss_train_history = []
acc_train_history = []
Fig. 11.9 The training loss and accuracy over the epochs
acc_train /= len(n_indices)
acc_train_history += [acc_train.item()]
loss_train /= len(n_indices)
loss_train_history += [loss_train]
11.10.9 Evaluation
The curves in Fig. 11.9 show the training and validation losses and accuracies over
the epochs. As we can see, training and validation figures start diverging after two
epochs, hinting at an overfit as soon as epoch 3. Running the fitting loop again and
stopping after two epochs, we obtain a macro average of 0.9723 for the validation
set and of 0.9735 for the test set.
iterative training procedures and can handle larger corpora than singular value
decomposition.
As the power of computers and graphical processing units (GPU) increased
dramatically, word embeddings found new applications. Most notably, replacing
input words with pre-trained embeddings is an efficient way to mitigate sparse data.
A big advantage is that we can train these embeddings on large raw corpora, where
they capture semantic regularities. This explains why they are now ubiquitous in
classification and other NLP models.
Finally, we created pedagogical programs to understand their inner workings.
The original programs of word2vec, Glove, and fastText are open source and
optimized. They are probably better choices for production applications. In addition,
the authors trained them on large corpora and published their vectors. This makes
it possible to reuse them as pretrained parameters in neural network models. We
will see an application of this with GloVe vectors in Chap. 14, Part-of-Speech and
Sequence Annotation.
Chapter 12
Words, Parts of Speech, and Morphology
12.1 Words
In the previous chapters, when we processed the words, we did not make any
distinction on their possible intrinsic character. As words have different grammatical
properties, classical linguists designed classes to gather those sharing common
properties. They called these classes parts of speech (POS). The concept of part
of speech dates back to the classical antiquity philosophy and teaching. Plato made
a distinction between the verb and the noun. After him, the word classification
further evolved, and parts of speech grew in number until Dionysius Thrax fixed and
formulated them in a form that we still use today. Aelius Donatus popularized the
list of the eight parts of speech: noun, pronoun, verb, adverb, participle, conjunction,
preposition, and interjection, in his work Ars grammatica, a reference reading in the
Middle Ages.
The word parsing comes from the Latin phrase partes orationis ‘parts of speech’.
It corresponds to the identification of the words’ parts of speech in a sentence. In
natural language processing, POS tagging is the automatic annotation of words with
grammatical categories, also called POS tags. Parts of speech are also sometimes
called lexical categories.
Most European languages have inherited the Greek and Latin part-of-speech
classification with a few adaptations. The word categories as they are taught today
roughly coincide in English, French, and German in spite of some inconsistencies.
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 325
P. M. Nugues, Python for Natural Language Processing, Cognitive Technologies,
https://doi.org/10.1007/978-3-031-57549-5_12
326 12 Words, Parts of Speech, and Morphology
This is not new. To manage the nonexistence of articles in Latin, Latin grammarians
tried to get the Greek article into the Latin pronoun category.
The definition of the parts of speech is sometimes arbitrary and has been
a matter of debate. From Dionysius Thrax, tradition has defined the parts of
speech using morphological and grammatical properties. We shall adopt essentially
this viewpoint here. However, words of a certain part of speech share semantic
properties, and some grammars contain statements like a noun denotes a thing and
a verb an action.
Parts of speech can be clustered into two main classes: the closed class and the
open class. Closed class words are relatively stable over time and have a functional
role. They include words such as articles, like English the, French le, or German
der, which change very slowly. Among the closed class, there are the determiners,
the pronouns, the prepositions, the conjunctions, and the auxiliary and modal verbs
(Table 12.1).
Open class words form the bulk of a vocabulary. They appear or disappear with
the evolution of the language. If a new word is created, say a hedgedog, a cross
between a hedgehog and a Yorkshire terrier, it will belong to an open class category:
here a noun. The main categories of the open class are the nouns, the adjectives, the
verbs, and the adverbs (Table 12.2). We can add interjection to this list. Interjections
are words such as ouch, ha, oh, and so on, that express sudden surprise, pain, or
pleasure.
Basic categories can be further refined, that is subcategorized. Nouns, for instance,
can be split into singular nouns and plural nouns. In French and German, nouns
12.1 Words 327
can also be split according to their gender: masculine and feminine for French, and
masculine, feminine, and neuter for German.
Genders do not correspond in these languages and can shape different visions
of the world. Sun is a masculine entity in French—le soleil—and a feminine one in
German—die Sonne. In contrast, moon is a feminine entity in French—la lune—and
a masculine one in German—der Mond.
Additional properties that can further specify main categories are often called the
grammatical features. Grammatical features vary among European languages and
include notably the number, gender, person, case, and tense. Each feature has a set
of possible values; for instance, the number can be singular or plural.
Grammatical features are different according to their parts of speech. In English,
a verb has a tense, a noun has a number, and an adjective has neither tense nor
number. In French and German, adjectives have a number but no tense. The feature
list of a word defines its part of speech together with its role in the sentence.
12.1.3 Two Significant Parts of Speech: The Noun and the Verb
The Noun
Nouns are divided into proper and common nouns. Proper nouns are names of
persons, people, countries, companies, and trademarks, such as England, Robert,
Citroën. Common nouns are the rest of the nouns. Common nouns are often used to
qualify persons, things, and ideas.
A noun definition referring to semantics is a disputable approximation, however.
More surely, nouns have certain grammatical features, namely the number, gender,
and case (Table 12.3). A noun group is marked with these features, and other words
of the group, that is, determiners, adjectives, must agree with the features they share.
While number and gender are probably obvious, case might be a bit obscure for
those who do know languages such as Latin, Russian, or German. Case is a function
marker that inflects words such as nouns or adjectives. In German for example, there
are four cases: nominative, accusative, genitive, and dative. The nominative case
corresponds to the subject function, the accusative case to the direct object function,
and the dative case to the indirect object function. Genitive denotes a possession
relation. These cases are still marked in English and French for pronouns.
In addition to these features, the English language makes a distinction between
nouns that can have a plural: count nouns, and nouns that cannot: mass nouns. Milk,
water, air are examples of mass nouns.
Verbs
Semantically, verbs often describe an action, an event, a state, etc. More positively,
and as for the nouns, verbs in European languages are marked by their morphology.
This morphology, that is the form variation depending on the grammatical features,
is quite elaborate in a language like French, notably due to the tense system. Verbs
can be basically classified into three main types: auxiliaries, modals, and main verbs.
Auxiliaries are helper verbs such as be and have that enable us to build some
of the main verb tenses. Modal verbs are verbs immediately followed by another
verb in the infinitive. They usually indicate a modality, a possibility. Modal verbs
are more specific to English and German. In French, semiauxiliaries correspond to
a similar category.
Main verbs are all the other verbs. Traditionally, main verbs are categorized
according to their complement’s function:
• Copula or link verb—verbs linking a subject to an (adjective) complement.
Copulas include verbs of being such as be, être, sein when not used as
auxiliaries, and other verbs such as seem, sembler, scheinen.
• Intransitive—verbs taking no object.
• Transitive—verbs taking an object.
• Ditransitive—verbs taking two objects.
Verbs have more features than other parts of speech. First, the verb group
shares certain features of the noun (Table 12.4). These features must agree with
corresponding ones of the verb’s subject.
Verbs have also specific features, namely the tense, the mode, and the voice:
Tense locates the verb, and the sentence, in time. Tense systems are elaborate in
English, French, and German, and do not correspond. Tenses are constructed
using form variations or auxiliaries. Tenses are a source of significant form
variation in French;
Mood enables the speaker to present or to conceive of the action in various ways;
Voice characterizes the sequence of syntactic groups. Active voice corresponds
to the “subject, verb, object” sequence. The reverse sequence corresponds
to the passive voice. This voice is possible only for transitive verbs. Some
constructions in French and German use a reflexive pronoun. They correspond to
the pronominal voice.
12.2 Standardized Part-of-Speech Tagsets and Grammatical Features 329
While basic parts of speech are relatively well defined: determiners, nouns, pro-
nouns, adjectives, verbs, auxiliaries, adverbs, conjunctions, and prepositions, there
is a debate on how to standardize them for a computational analysis. One issue is
the level of detail. Some tagsets feature a dozen tags, some over a hundred. Another
issue that is linked to the latter is that of subcategories. How many classes for verbs?
Only one, or should we create auxiliaries, modal, gerund, intransitive, transitive
verbs, etc.?
The debate becomes even more complicated when we consider multiple lan-
guages. In French and German, the main parts of speech can be divided into
subclasses depending on their gender, case, and number. In English, these divisions
are useless. Although it is often possible to map tagsets from one language to
another, there is no indisputable universal scheme, even within the same language.
Fortunately, with the collection and annotation of multilingual corpora, practical
standards have emerged, although the discussion is not over. We will examine two
annotation schemes: The Universal Part-of-Speech Tagset (UPOS) (Petrov et al.
2012) for parts of speech and an extension of MULTEXT (Ide and Véronis 1995;
Monachini and Calzolari 1996) for grammatical features.
to tag texts in seven European languages using statistical methods. They also added
features (subcategories) specific to each language (Table 12.5).
MULTEXT (Ide and Véronis 1995; Monachini and Calzolari 1996) was a
multinational initiative that aimed at providing an annotation scheme for all the
Western and Eastern European languages. For the parts of speech, MULTEXT
merely perpetuated the traditional categories and assigned them a code. The
universal part-of-speech tagset (UPOS) (Petrov et al. 2012) is almost identical,
but includes mappings with other tagsets used in older corpora. This ensured its
popularity and UPOS is now widely adopted; see Table 12.6.
12.2 Standardized Part-of-Speech Tagsets and Grammatical Features 331
a French one:
A user can extend the coding scheme and add attributes if the application requires
it. A noun could be tagged with some semantic features such as country names,
currencies, etc.
332 12 Words, Parts of Speech, and Morphology
Finally, it is easy to see that the sequence of attributes, such as Nc-s-, is just a
duplicate of the list of keys and values, N[type=common number=singular]. In
addition, this sequence needs to be ordered. We can remove the redundancy and just
use the UPOS codes and a set of key/value pairs separated by vertical bars. We can
encode the three examples above as:
NOUN number=singular
NOUN gender=masculine|number=singular
NOUN gender=neuter|number=singular|case=nominative
where we write the pairs in any order. When associated with UPOS, we do not need
a noun type as the UPOS noun is a common noun.
This format for grammatical features is adopted by CoNLL-U and the Universal
Dependencies corpora. It has become a sort of de facto standard; see the two next
sections.
12.3 The CoNLL Format 333
Table 12.9 Annotation of the Spanish sentence: La reestructuración de los otros bancos checos
se está acompa nando por la reducción del personal ‘The restructuring of Czech banks is
accompanied by the reduction of personnel’ (Palomar et al. 2004) using the CoNLL-U format
ID FORM LEMMA UPOS FEATS
1 La el DET Definite=Def|Gender=Fem
|Number=Sing|PronType=Art
2 reestructuración reestructuración NOUN Gender=Fem|Number=Sing
3 de de ADP AdpType=Prep
4 los el DET Definite=Def|Gender=Masc
|Number=Plur |PronType=Art
5 otros otro DET Gender=Masc|Number=Plur
|PronType=Ind
6 bancos banco NOUN Gender=Masc|Number=Plur
7 checos checo ADJ Gender=Masc|Number=Plur
8 se se PRON Case=Acc|Person=3|PrepCase=Npr
|PronType=Prs|Reflex=Yes
9 está estar AUX Mood=Ind|Number=Sing|Person=3
|Tense=Pres|VerbForm=Fin
10 acompañando acompañar VERB VerbForm=Ger
11 por por ADP AdpType=Prep
12 la el DET Definite=Def|Gender=Fem
|Number=Sing|PronType=Art
13 reducción reducción NOUN Gender=Fem|Number=Sing
14 del del ADP AdpType=Preppron
15 personal personal NOUN Gender=Masc|Number=Sing
16 . . PUNCT PunctType=Peri
334 12 Words, Parts of Speech, and Morphology
where
• The ID column is the word index in the sentence;
• The FORM column corresponds to the word;
• The LEMMA column contains the lemma and the phrase los otros bancos
starting at index 4 is lemmatized as el otro banco;
• The UPOS column corresponds to the part of speech: los as well as otros are
determiners; bancos is a noun;
• Finally, the FEATS column corresponds to the grammatical features that are
listed as an unordered set separated by vertical bars. The word bancos ‘banks’
has a masculine gender (Gender=Masc) and a plural number (Number=Plur).
The columns are delimited by a tabulation character, and the sentences by a blank
line. Each sentence is preceded by comments starting with a # character giving the
sentence identifier as well as a raw text version of it:
# sent_id = 3LB-CAST-104_C-5-s10
# text = La reestructuración de los otros bancos checos se \
está acompañando por la reducción del personal.
# orig_file_sentence 001#29
Tables 12.10 and 12.11 show a similar annotation with respectively a sentence
in English from the EWT corpus and a sentence from the French FTB corpus
(Abeillé et al. 2003; Abeillé and Clément 2003) converted to the CoNLL-U format
by Silveira et al. (2014) and Candito et al. (2009).
The Universal Dependencies (UD) repository (de Marneffe et al. 2021) consists of
more than 200 annotated corpora in more than 100 languages. They all adopt the
CoNLL-U format that we described in Sects. 12.2 and 12.3 for the parts of speech
as well as for the grammatical features. As it is quite large and multilingual, it is a
valuable source of training data to build morphological parsers and part-of-speech
taggers. We describe here a Python reader to load a corpus and extract the forms,
lemmas, and features.
12.4 A CoNLL Reader in Python 335
Table 12.10 Annotation of the English sentence: Or you can visit temples or shrines in Okinawa.
from the EWT corpus following the CoNLL-U format (Silveira et al. 2014)
ID FORM LEMMA UPOS FEATS
1 Or or CCONJ _
2 you you PRON Case=Nom|Person=2|PronType=Prs
3 can can AUX VerbForm=Fin
4 visit visit VERB VerbForm=Inf
5 temples temple NOUN Number=Plur
6 or or CCONJ _
7 shrines shrine NOUN Number=Plur
8 in in ADP _
9 Okinawa Okinawa PROPN Number=Sing
10 . . PUNCT _
Table 12.11 Annotation of the French sentence: À cette époque, on avait dénombré cent quarante
candidats ‘At that time, we had counted one hundred and forty candidates’ (Abeillé et al. 2003;
Abeillé and Clément 2003) following the CoNLL-U format
ID FORM LEMMA UPOS FEATS
1 À à ADP _
2 cette ce DET Gender=Fem|Number=Sing|PronType=Dem
3 époque époque NOUN Gender=Fem|Number=Sing
4 , , PUNCT _
5 on il PRON Gender=Masc|Number=Sing|Person=3
6 avait avoir AUX Mood=Ind|Number=Sing|Person=3
|Tense=Imp|VerbForm=Fin
7 dénombré dénombrer VERB Gender=Masc|Number=Sing
|Tense=Past|VerbForm=Part
8 cent cent NUM NumType=Card
9 quarante quarante NUM NumType=Card
10 candidats candidat NOUN Gender=Masc|Number=Plur
11 . . PUNCT _
The UD corpora are available from GitHub1 and we can collect them using the
techniques described in Sect. 2.14. Then to extract the fields of a CoNLL corpus,
we need to parse its structure. We give here an example program consisting of two
classes for the CoNLL-U format. We can easily adapt it to the slight format changes
used by the different CoNLL tasks. The first class, Token, models the row of a
CoNLL corpus: An annotated word. We just create a subclass of a dictionary:
class Token(dict):
pass
1 https://github.com/UniversalDependencies.
336 12 Words, Parts of Speech, and Morphology
We create a token object by passing a dictionary representing the row and all its
columns, as for instance for the first row in Table 12.9:
tok = Token({’ID’: ’1’, ’FORM’: ’La’, ’LEMMA’: ’el’,
’UPOS’: ’DET’, ’FEATS’:
’Definite=Def|Gender=Fem|Number=Sing|PronType=Art’})
The second class, CoNLLDictorizer, transforms the corpus content into a list
of sentences, where each sentence is a list of Tokens. We follow the transformer
structure of scikit-learn to write this class to integrate it more easily with other
modules of this library. A transformer has two methods, fit() to learn some
parameters, and transform() to apply the transformation. Here, we will only use
the latter. When we create a corpus object, we pass the names of the columns, the
sentence separator, and the column separator, and we store them in the object. Then
transform() splits the corpus into sentences, each sentence is split into rows, and
each row is split into columns. We create a Token of each row.
class CoNLLDictorizer:
def fit(self):
pass
12.5 Lexicons
A lexicon is a list of words, and in this context, lexical entries are also called
the lexemes. Lexicons often cover a particular domain. Some focus on a whole
language, like English, French, or German, while some specialize in specific areas
such as proper names, technology, science, and finance. In some applications,
lexicons try to be as exhaustive as is humanly possible. This is the case of internet
crawlers, which index all the words of all the web pages they can find. Computerized
lexicons are now embedded in many popular applications such as in spelling
checkers, thesauruses, or definition dictionaries of word processors. They are also
often the first building block of most language processing programs.
Several options can be taken when building a computerized lexicon. They range
from a collection of words—a word list—to words carefully annotated with their
pronunciation, morphology, and syntactic and semantic labels. Words can also be
related together using semantic relationships and definitions.
A key point in lexicon building is that many words are ambiguous both
syntactically and semantically. Therefore, each word may have as many entries as it
has syntactic or semantic readings. Table 12.12 shows words that have two or more
parts of speech and senses.
Computerized lexicons are now available from industry and from sources on the
Internet in English and many other languages. Most notable ones in English include
word lists derived from the Longman Dictionary of Contemporary English (Procter
1978) and the Oxford Advanced Learner’s Dictionary (Hornby 1974). Table 12.13
shows the first lines of letter A of an electronic version of the OALD.
338 12 Words, Parts of Speech, and Morphology
Table 12.13 The first lines of the Oxford Advanced Learner’s Dictionary
Word Pronunciation Syntactic tag Syllable count or verb pattern (for verbs)
a @ S-* 1
a EI Ki$ 1
a fortiori eI ,fOtI’OraI Pu$ 5
a posteriori eI ,p0sterI’OraI OA$,Pu$ 6
a priori eI ,praI’OraI OA$, Pu$ 4
a’s Eiz Kj$ 1
ab initio &b I’nISI@U Pu$ 5
abaci ’&b@saI Kj$ 3
aback @’b&k Pu% 2
abacus ’&b@k@s K7% 3
abacuses ’&b@k@sIz Kj% 4
abaft @’bAft Pu$,T-$ 2
abandon @’b&nd@n H0%,L@% 36A,14
abandoned @’b&nd@nd Hc%,Hd%,OA% 36A,14
abandoning @’b&nd@nIN Hb% 46A,14
abandonment @’b&nd@nm@nt L@% 4
abandons @’b&nd@nz Ha% 36A,14
abase @’beIs H2% 26B
abased @’beIst Hc%,Hd% 26B
abasement @’beIsm@nt L@% 3
Table 12.14 An excerpt from BDLex. Digits encode accents on letters. The syntactical tags of the
verbs correspond to their conjugation type taken from the Bescherelle reference
Entry Part of speech Lemma Syntactic tag
a2 Prep a2 Prep_00_00;
abaisser Verbe abaisser Verbe_01_060_**;
abandon Nom abandon Nom_Mn_01;
abandonner Verbe abandonner Verbe_01_060_**;
abattre Verbe abattre Verbe_01_550_**;
abbe1 Nom abbe1 Nom_gn_90;
abdiquer Verbe abdiquer Verbe_01_060_**;
abeille Nom abeille Nom_Fn_81;
abi3mer Verbe abi3mer Verbe_01_060_**;
abolition Nom abolition Nom_Fn_81;
abondance Nom abondance Nom_Fn_81;
abondant Adj abondant Adj_gn_01;
abonnement Nom abonnement Nom_Mn_01;
abord Nom abord Nom_Mn_01;
aborder Verbe aborder Verbe_01_060_**;
aboutir Verbe aboutir Verbe_00_190_**;
aboyer Verbe aboyer Verbe_01_170_**;
abre1ger Verbe abre1ger Verbe_01_140_**;
abre1viation Nom abre1viation Nom_Fn_81;
abri Nom abri Nom_Mn_01;
abriter Verbe abriter Verbe_01_060_**;
dictionary from 1905 (Bohbot et al. 2018) and Diderot’s Encyclopédie from 1751–
1772 (Guilbaud 2017).
Letter trees (de la Briandais 1959) or tries (pronounce try ees) are useful data
structures to store large lexicons and to search words quickly. The idea behind a
trie is to store the words as trees of characters and to share branches as far as the
letters of two words are identical. Tries can be seen as finite-state automata, and
Fig. 12.1 shows a graphical representation of a trie encoding the words bin, dark,
dawn, tab, table, tables, and tablet.
In Python, we can represent this trie as embedded lists, where each branch is a
list. The first element of a branch is the root letter: the first letter of all the subwords
that correspond to the branch. The leaves of the trie are the lexical entries, here the
words themselves. Of course, these entries could contain more information, such as
the part of speech, the pronunciation, etc.
340 12 Words, Parts of Speech, and Morphology
i n
b bi bin
b
k
r dar dark
a
d d da w
n
daw dawn
t tables
s
a b l e
t ta tab tabl table
t
tablet
Fig. 12.1 A letter tree encoding the words tab, table, tablet, and tables
[
[’b’, [’i’, [’n’, ’bin’]]],
[’d’, [’a’, [’r’, [’k’, ’dark’]],
[’w’, [’n’, ’dawn’]]]],
[’t’, [’a’, [’b’, ’tab’,
[’l’, [’e’, ’table’,
[’s’, ’tables’],
[’t’, ’tablet’]]]]]]
]
12.6 Morphology
12.6.1 Morphemes
Fig. 12.2 Concatenative morphology where prefixes and suffixes are concatenated to the stem
s a ng st
Lexical morpheme
singen
Fig. 12.3 Embedding of the stem into the grammatical morphemes in the German verb sangst
(second-person preterit of singen). After Simone (2007, p. 144)
Grammatical morphemes
Past participle
ge s u ng en
Lexical morpheme
singen
Fig. 12.4 Embedding of the stem into the grammatical morphemes in the German verb gesungen
(past participle of singen). After Simone (2007, p. 144)
12.6.2 Morphs
Morphology can be classified into inflection, the form variation of a word according
to syntactic features such as gender, number, person, tense, etc., and derivation,
the creation of a new word—a new meaning—by concatenating a word with a
specific affix. A last form of construction is the composition (compounding) of two
words to give a new one, for instance, part of speech, can opener, pomme de terre.
Composition is more obvious in German, where such new words are not separated
with a space, for example, Führerschein. In English and French, some words are
formed in this way, such as bedroom, or are separated with a hyphen, centre-ville.
However, the exact determination of other compounded words—separated with a
space—can be quite tricky.
12.6 Morphology 343
Inflection
Derivation
generated as simply, because the word does not exist or sounds weird. In addition,
some affixes cannot be mapped to clear semantic features.
Compounding is a feature of German, Dutch, and the Scandinavian languages.
It resembles the English noun sequences with the difference that nouns are not
separated with a white space.
Morphological Processing
Table 12.22 Open class word morphology, where * denotes zero or more elements and ? denotes
an optional element
English and French prefix* stem suffix* inflection?
German inflection? prefix* stem* suffix* inflection?
French Marche
1. Une marche dans la forêt 1. marche: noun singular feminine
2. Il marche dans la cour 2. marcher: verb present third person
singular
German Lauf
1. Der Lauf der Zeit 1. Der Lauf: noun, sing, masc
2. Lauf schnell! 2. laufen: verb, imperative, singular
be appended to the word (Table 12.22). As we saw earlier, these rules are general
principles of concatenative morphology that have exceptions.
Ambiguity
function in the sentence. As we saw in the introduction, this process has been done
by generations of pupils dating as far back as the schools of ancient Greece and the
Roman Empire.
Paper lexicons do not include all the words of a language but only lemmas. Each
lemma is fitted with a morphological class to relate it to a model of inflection
or possible exceptions. A French verb will be given a class of conjugation or its
exception pattern—one among a hundred. English or German verbs will be marked
as regular or strong and in this latter case will be given their irregular forms. Then,
a reader can apply morphological rules to produce all the lexical forms of the
language.
Automatic morphological processing tries to mimic this human behavior. Never-
theless, it has not been so widely implemented in English as in other languages.
Programmers have often preferred to pack all the English words into a single
dictionary instead of implementing a parser to do the job. This strategy is possible
for European languages because morphology is finite: there is a finite number of
noun forms, adjective forms, or verb forms. It is clumsy, however, to extend it to
languages other than English because it considerably inflates the size of dictionaries.
Statistics from Xerox (Table 12.24) show that techniques available for storing
English words are very costly for many other languages. It is not a surprise
that the most widespread morphological parser—KIMMO—was originally built
for Finnish, one of the most inflection-rich languages. In addition, while English
inflection is tractable by means of storing all the forms in a lexicon, it is often
necessary to resort to a morphological parser to deal with forms such as: computer,
computerize, computerization, recomputerize (Antworth 1994), which cannot all be
foreseen by lexicographers.
Lexical:
d
i s
e
n
t
a
n
g
l
e +Verb
+PastBoth
+123sp
⏐ ⏐ ⏐ ⏐ ⏐ ⏐ ⏐ ⏐ ⏐ ⏐ ⏐ ⏐ ⏐ ⏐
Surface: d i s e n t a n g l 0 0 e d
Lexical:
h a
p
p
y +Adj
+Comp
⏐ ⏐ ⏐ ⏐ ⏐ ⏐ ⏐
Surface: h a p p i e r
Lexical:
G
r u
n
d +Noun
+Masc
+Pl
+NomAccGen
⏐ ⏐ ⏐ ⏐ ⏐ ⏐ ⏐ ⏐ ⏐
Surface: G r ü n d 0 0 0 e
morphs. For example, the Xerox parser (Beesley and Karttunen 2003) output for
disentangled, happier, and Gründe is:
disentangle+Verb+PastBoth+123SP
happy+Adj+Comp
Grund+Noun+Masc+Pl+NomAccGen
where the feature +Verb denotes a verb, +PastBoth, either past tense or past
participle, and +123SP any person, singular or plural; +Adj denotes an adjective
and +Comp, a comparative; +Noun denotes a noun, +Masc masculine, +Pl, plural,
and +NomAccGen either nominative, accusative, or genitive. (All these forms are
ambiguous, and the Xerox parser shows more than one interpretation per form.)
Given these new lexical forms, the parser has to align the feature symbols with
letters or null symbols. The principles do not change, however (Fig. 12.5).
12.7 Morphological Parsing 349
q0 q1 q2
Morphological FSTs encode the lexicon and express all the legal transitions. Arcs
are labeled with pairs of symbols representing letters of the surface form—the
word—and the lexical form—the set of morphs.
Table 12.27 shows the future tense of regular French verb chanter ‘sing’, where
suffixes are specific to each person and number, but are shared by all the verbs of the
so-called first group. The first group accounts for the large majority of French verbs.
Table 12.28 shows the aligned forms and Fig. 12.7 the corresponding transducer.
The arcs are annotated by the input/output pairs, where the left symbol corresponds
to the lexical form and the right one to the surface form. When the lexical and
surface characters are equal, as in c:c, we just use a single symbol in the arc.
350 12 Words, Parts of Speech, and Morphology
t 20
c h a n
1 2 3 4 5 n
17 18 s
t o 19
a
s 14
12 i
13
t 15
n
12 13 s
L:L 14
o
a
s 9
7 i
8
Fig. 12.8 A finite-state transducer describing the future tense of French verbs of the first group
This transducer can be generalized to any regular French verb of the first group
by removing the stem part and inserting a self-looping transition on the first state
(Fig. 12.8).
The transducer in Fig. 12.8 also parses and generates forms that do not exist.
For instance, we can forge an imaginary French verb *palimoter that still can be
conjugated by the transducer. Conversely, the transducer will successfully parse the
12.7 Morphological Parsing 351
Table 12.29 Future tense of Italian verb cantare and Spanish and Portuguese verbs cantar, ‘sing’
Language Number.\Person First Second Third
Italian Singular canterò canterai canterà
Plural canteremo canterete canteranno
Spanish Singular cantaré cantarás cantará
Plural cantaremos cantaréis cantarán
Portuguese Singular cantarei cantarás cantará
Plural cantaremos cantareis cantarão
The transducer we created for the conjugation of French verbs can be easily
transposed to other Romance languages such as Italian, Spanish, or Portuguese, as
shown in Table 12.29.
12.7.6 Ambiguity
In the transducer for future tense, there is no ambiguity. That is, a surface form has
only one lexical form with a unique final state. This is not the case with the present
tense (Table 12.30), and
(je) chante ‘I sing’
(il) chante ‘he sings’
have the same surface form but correspond, respectively, to the first- and third-
person singular.
This corresponds to the transducer in Fig. 12.9, where final states 5 and 7 are the
same.
352 12 Words, Parts of Speech, and Morphology
n s
L:L 8 9 10
o
z
5-7 11
s
Fig. 12.9 A finite-state transducer encoding the present tense of verbs of the first group
Dionysius Thrax fixed the parts of speech for Greek in the second century BCE.
They have not changed since and his grammar is still interesting to read, see Lallot
(1998). A short and readable introduction in French to the history of parts of speech
is Ducrot and Schaeffer (1995).
Accounts on finite-state morphology can be found in Sproat (1992) and Ritchie
et al. (1992). Roche and Schabes (1997) is a useful book that describes fundamental
algorithms and applications of finite-state machines in language processing, espe-
cially for French. Kornai (1999) covers other aspects and languages. Kiraz (2001)
on the morphology of Semitic languages: Syriac, Arabic, and Hebrew. Beesley and
Karttunen (2003) is an extensive description of the two-level model in relation with
the historical Xerox tools.
General-purpose finite-state transducers toolkits are available online. They in-
clude the FSA utilities (van Noord and Gerdemann 2001), the FSM library (Mohri
et al. 1998) and its follower, OpenFst.2
Although the lemmatizers in the chapter used transducers and rules, it is possible
to formulate lemmatization with classifiers that we can train on annotated corpora.
See Chrupała (2006) for an interesting account on these techniques and Björkelund
et al. (2010) for an implementation.
2 https://www.openfst.org/.
Chapter 13
Subword Segmentation
In Chap. 12, Words, Parts of Speech, and Morphology, we used rules to split the
words into morphemes. Figures in Table 12.24 show that this reduces considerably
the size of the lexicon. Nonetheless, writing a morphological parser involves a
linguistic knowledge, sometimes difficult to formulate and implement. In this
chapter, we will deal with automatic techniques to extract subwords from a corpus,
often matching morphemes, and enabling us to set a size limit to the lexicon.
We will first examine an algorithm that identifies morphemes using elementary
statistics on the character distribution without any knowledge of linguistic rules.
Then, we will study three other automatic methods, also based on statistics, to
derive subwords, namely byte-pair encoding (Gage, 1994; Sennrich et al., 2016),
WordPiece (Schuster & Nakajima, 2012), and Unigram/SentencePiece (Kudo, 2018;
Kudo & Richardson, 2018).
Schuster and Nakajima (2012) applied their lexical model in voice search for
Japanese and Korean. Their vocabulary consisted of 200,000 subwords shared
between the two languages. Sennrich et al. (2016) showed that BPE could improve
automatic translation while Wu et al. (2016) is another example of subword
tokenization in translation.
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 355
P. M. Nugues, Python for Natural Language Processing, Cognitive Technologies,
https://doi.org/10.1007/978-3-031-57549-5_13
356 13 Subword Segmentation
Table 13.1 The most English -e -s -ed -ing -al -ation -ly -ic -ent
frequent morphemes
French -s -e -es -ent -er -ds -re -ation -ique
German -en -e -te -ten -er -es -lich -el
Turkish -m -in -lar -ler -dan -den -inl -ml
Swahili -wa -ia -u -eni -o -isha -ana -we
wa- m- ku- ali- ni- aka- ki- vi-
1. For the first step, let us begin with the word prefixes. The algorithm generates
all the word prefixes starting with one character and, for each prefix, computes
the distribution of the characters following it. If the number of characters in
this distribution is greater than half the alphabet size and the most frequent
next character represents less than 50% of the distribution, then the prefix is a
morpheme. We apply the same algorithm to the word suffixes by reversing the
words.
2. Next, we complement the list of morphemes by splitting each word into a prefix
and a suffix. For a given prefix, if more than half of the suffixes belong to
the morphemes discovered in step 1, we consider the rest of the suffixes as
morphemes too. We also apply this rule to a given suffix to find the prefixed
morphemes. Both steps have a cut off frequency.
Table 13.1 shows the most frequent morphemes Déjean (1998) found in a few
languages.
Using these lists, we segment a word with the longest match algorithm. We
implement it with a regular expression consisting of a disjunction of morphemes. To
extract the suffixes, we need to reverse the corresponding strings. In the following
code, we build a list of the English morphemes in Table 13.1 and we reverse them:
suffix_morphemes = [’-e’, ’-s’, ’-ed’, ’-ing’,
’-al’, ’-ation’, ’-ly’, ’-ic’, ’-ent’]
rev_morphemes = [’^’ + suffix[1:][::-1] for
suffix in suffix_morphemes]
[’celebr’, ’ation’]
13.2 Byte-Pair Encoding 357
While Déjean (1998) designed the previous method with natural language pro-
cessing in mind, Gage (1994) created byte-pair encoding (BPE) as a compression
algorithm. His program reads a data sequence in the form of bytes and replaces the
most frequent adjacent pair of bytes with a single byte not in the original data. This
process repeats recursively and stores each pair it found with its replacement in a
table. We can restore the original data from the table.
Sennrich et al. (2016) adapted the original BPE algorithm to build automatically a
lexicon of subwords from a corpus. These subwords consist of a single character, a
sequence of characters, possibly a whole word, and the size of the lexicon is fixed
in advance. The main steps of the algorithm are:
1. Split the corpus into individual characters. These characters will be the initial
subwords and will make up the start vocabulary:
2. Then:
(a) Extract the most frequent adjacent pair from the corpus;
(b) Merge the pair and add the corresponding subword to the vocabulary;
(c) Replace all the occurrences of the pair in the corpus with the new subword;
(d) Repeat this process until we have reached the desired vocabulary size.
13.2.2 Pretokenization
BPE normally does not cross the whitespaces. This means that we can speed up the
learning process with a pretokenization of the corpus and a word count.
The simplest pretokenization uses the whitespaces as delimiters, where we match
the words, the numbers, and the rest, excluding the whitespaces, with this regular
expression as in Sect. 9.2.1:
pattern = r’\p{L}+|\p{N}+|[^\s\p{L}\p{N}]+’
To apply it to a corpus, we first read it as a string and store it in the text variable.
We then pretokenize it with the following code, where we keep the word positions,
as in Sect. 9.5.2:
import regex as re
Applying this code to a corpus consisting of the Iliad and the Odyssey, we obtain
the words:
[(’BOOK’, (0, 4)), (’I’, (5, 6)), (’The’, (10, 13)),
(’quarrel’, (14, 21)), (’between’, (22, 29)),
(’Agamemnon’, (30, 39)), (’and’, (40, 43)),
(’Achilles’, (44, 52)), ...]
The word positions would enable us to restore the original text. Nonetheless,
in the rest of this section, we set them aside to simplify the code and we define a
pretokenize() function with findall() instead:
def pretokenize(pattern, text):
return re.findall(pattern, text)
In the rest of the implementation, we will use a class to encapsulate the BPE
functions. So far, the class only contains the pretokenization and we define the
pretokenization pattern when we create a BPE object:
class BPE():
def __init__(self):
self.pattern = r’\p{L}+|\p{N}+|[^\s\p{L}\p{N}]+’
We will add the other functions we need along with their description in the next
sections.
The initial vocabulary consists of the individual characters from which we build
iteratively the subwords. For this, we create a dictionary, words_bpe, where we
associate the words with their subwords, starting with these characters. As key, we
use the word string and as value, a dictionary with the word frequency, freq, and the
list of subwords in construction, swords.
We count the words from the pretokenized text with the Counter class:
>>> from collections import Counter
yielding:
[(’,’, 19518), (’the’, 15311), (’and’, 11521),
(’of’, 8677), (’.’, 6870)]
Using word_cnts, we create a new function to initialize the subwords that we add
to the class. We also extracts the set of characters from the corpus that we assign to
self.vocab:
def _bpe_init(self, text):
words = self.pretokenize(text)
word_cnts = Counter(words)
self.words_bpe = {
word: {’freq’: freq,
’swords’: list(word)}
for word, freq in word_cnts.items()}
self.vocab = list(
set([char for word in self.words_bpe
for char in self.words_bpe[word][’swords’]]))
Running:
>>> bpe = BPE()
>>> bpe._bpe_init(text)
we have the dictionary entry:
>>> bpe.words_bpe[’her’]
{’freq’: 1147, ’swords’: [’h’, ’e’, ’r’]}
Once we have passed this initial step, we can implement the vocabulary
construction. This is a loop, where at a given iteration, we update the swords value
by merging the most frequent pair of subwords.
We count the adjacent pair of symbols, either characters or subwords, with this
function:
def _count_bigrams(self):
self.pair_cnts = Counter()
for word_dict in self.words_bpe.values():
swords = tuple(word_dict[’swords’])
freq = word_dict[’freq’]
for i in range(len(swords) - 1):
self.pair_cnts[swords[i:i + 2]] += freq
At the first iteration, we have:
>>> bpe = BPE()
>>> bpe._bpe_init(text)
>>> bpe._count_bigrams()
>>> max(bpe.pair_cnts, key=bpe.pair_cnts.get)
(’h’, ’e’)
360 13 Subword Segmentation
We merge the pair (’h’, ’e’) in all the subwords of words_bpe so that
[’h’, ’e’, ’r’]
becomes:
[’he’, ’r’]
We apply this operation with the function below, where the pair and the subwords
are lists. We traverse the subwords and we extend a new list with its items or the pair
if it matches two adjacent items:
def _merge_pair(self, pair, swords):
pair_str = ’’.join(pair)
i = 0
temp = []
while i < len(swords) - 1:
if pair == swords[i:i + 2]:
temp += [pair_str]
i += 2
else:
temp += [swords[i]]
i += 1
if i == len(swords) - 1:
temp += [swords[i]]
swords = temp
return swords
We can now build the list of merges with a function that gets the most frequent
adjacent pair of subwords, merges it in all the words_bpe dictionaries, and repeats
the process until we have reached the predetermined vocabulary size. Let us call this
function fit(). When it terminates, we add the pairs to the initial set of characters
with _build_vocab():
def fit(self, text):
self._bpe_init(text)
self.merge_ops = []
for _ in range(self.merge_cnt):
self._count_bigrams()
self.best_pair = max(self.pair_cnts,
key=self.pair_cnts.get)
13.2 Byte-Pair Encoding 361
merge_op = list(self.best_pair)
self.merge_ops.append(merge_op)
for word_dict in self.words_bpe.values():
word_dict[’swords’] = self._merge_pair(
merge_op,
word_dict[’swords’])
self._build_vocab()
def _build_vocab(self):
swords = list(map(lambda x: ’’.join(x), self.merge_ops))
self.vocab += swords
Before we can run this function, we need to modify the __init__() function to give
the number of merges:
def __init__(self, merge_cnt=200):
self.pattern = r’\p{L}+|\p{N}+|[^\s\p{L}\p{N}]+’
self.merge_cnt = merge_cnt
and our corpus of Homer works, the five first merges in bpe.merge_ops[:5] are:
[[’h’, ’e’], [’t’, ’he’], [’a’, ’n’], [’i’, ’n’], [’o’, ’u’]]
13.2.7 Encoding
Once we have derived a list of merge operations, we can apply these operations to
the characters of a word in the same order we created them:
def encode(self, word):
swords = list(word)
for op in self.merge_ops:
swords = self._merge_pair(op, swords)
return swords
13.2.8 Tokenizing
To tokenize a text, we first apply a pretokenization. We then encode all its words.
To speed up the merges, we use a cache:
def tokenize(self, text):
tokenized_text = []
cache = {}
words = self.pretokenize(text)
for word in words:
if word not in cache:
cache[word] = self.encode(word)
subwords = cache[word]
tokenized_text += subwords
return tokenized_text
We can now tokenize a whole text, for instance this quote from Virgil:1
Exiled from home am I; while, Tityrus, you
Sit careless in the shade
bpe = BPE(pattern)
bpe.fit(text)
ecloges_str = """Sit careless in the shade"""
>>> bpe.tokenize(ecloges_str)
[’S’, ’it’, ’c’, ’are’, ’le’, ’s’, ’s’, ’in’, ’the’, ’s’,
’had’, ’e’]
In the tokenized string of the last paragraph, we have lost track of the visual word
separators. This makes it difficult to read and can be a bit confusing. GPT2 (Radford
et al., 2019) provides a solution to this with a pretokenization that creates initial
tokens with a prefixed whitespace and then replaces the whitespaces with the Ġ
character.
This idea is easy to implement. We just need to modify the pretokenization
pattern so that it matches one possible leading whitespace:
pattern = r’ ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+’
1 Nos patriam fugimus; tu, Tityre, lentus in umbra, Virgil, Eclogues, Rhoades, James, translator.
We run the program again on the corpus with this pretokenization and we obtain
new merge rules. We then apply these rules to tokenize Virgil’s verse yielding:
[’S’, ’it’, ’Ġc’, ’a’, ’re’, ’l’, ’ess’, ’Ġin’, ’Ġthe’,
’Ġsh’, ’ad’, ’e’]
So far, we have used the Unicode characters as our initial vocabulary. As Unicode
has about 150,000 characters, this can lead to a very large lexicon if we want to train
BPE on any kind of text, in any language. That is why the GPT-2 tokenizer uses the
original byte-level encoding of Gage (1994) and restricts the initial symbol set to
the 256 possible bytes.
To make the bytes printable, the GPT-2 pretokenization replaces all the ASCII
and Latin-1 control or nonprintable characters with a codepoint shifted by 256.
The codepoint ranges of these characters are [0, 32] and [127, 160], plus the soft-
hyphen, 173, see Tables 4.1 and 4.3. In total, there are 68 characters that we number
from 0 to 67. We shift them with the function chr(n + 256), where n is their rank, 0
to 67. For the whitespace in the previous section, the first of these shifted characters,
we have:
>>> chr(ord(’ ’) + 256)
’Ġ’
While this technique results in a smaller lexicon, it impairs the legibility of non-
ASCII characters, and thus of languages outside English. For instance, the UTF-8
codes of letters like é or ä consist two bytes and are rendered as ’é’ and ’ä'. Indic
or Chinese characters lead to three or four such bytes.
364 13 Subword Segmentation
N
. log P (xi ) + C(xy)(log P (x) + log P (y)).
i=1
xi ,xi+1 /=x,y
N
. log P (xi ) + C(xy) log P (xy).
i=1
xi ,xi+1 /=x,y
To build the lexicon, we proceed iteratively as with BPE and, at each step, we
select the pair that improves the most the criterion above.
We count the pretokenized words and we store them in a data structure identical
to that of BPE. As with BPE, we keep track of the start of a word, this time with the
’ ’ (U+2581) character.
The WordPiece code to pretokenize a text and count the words is very similar to
that of BPE. We also extracts the set of characters from the corpus. This forms the
initial vocabulary:
class WordPiece():
def __init__(self, merge_cnt=200):
self.pattern = r’\p{P}|[^\s\p{P}]+’
self.merge_cnt = merge_cnt
this yields:
{{’ BOOK’: {’freq’: 24, ’swords’: [’ ’, ’B’, ’O’, ’O’, ’K’]},
’ I’: {’freq’: 3202, ’swords’: [’ ’, ’I’]},
’ The’: {’freq’: 548, ’swords’: [’ ’, ’T’, ’h’, ’e’]},
’ quarrel’: {’freq’: 28, ’swords’: [’ ’, ’q’, ’u’, ’a’,
’r’, ’r’, ’e’, ’l’]},
...
Given a pair of adjacent symbols, we can now calculate the model likelihood before
they are merged and after it. We do this for all the pairs:
def _calc_pair_gains(self):
sword_cnts = Counter()
self.pair_gains = Counter()
for word_dict in self.words_wp.values():
subwords = tuple(word_dict[’swords’])
366 13 Subword Segmentation
freq = word_dict[’freq’]
for i in range(len(subwords) - 1):
sword_cnts[subwords[i]] += freq
self.pair_gains[subwords[i:i + 2]] += freq
sword_cnts[subwords[len(subwords) - 1]] += freq
pair_cnt = sum(self.pair_gains.values())
sword_cnt = sum(sword_cnts.values())
for pair in self.pair_gains:
self.pair_gains[pair] *= (
log(self.pair_gains[pair]/pair_cnt)
- log(sword_cnts[pair[0]]/sword_cnt)
- log(sword_cnts[pair[1]]/sword_cnt))
We rank the pairs by decreasing order of improvement. For Homer’s Iliad and
Odyssey, this gives:
>>> wp = WordPiece()
>>> wp._wp_init(text)
>>> wp._calc_pair_gains()
>>> sorted(wp.pair_gains, key=wp.pair_gains.get, reverse=True)
We select the best pair from this list and merge all its occurrences in the subwords
of self.words_wp. We use the same function as with BPE:
_merge_pair(self, pair, swords)
We repeat this operation in a loop, where, at each iteration, we select the pair that
improves most the language model and we merge it in all the words. We stop the
loop when we have reached the predefined number of subwords, here 200:
def fit(self, text):
self._wp_init(text)
self.merge_ops = []
for _ in range(self.merge_cnt):
self._calc_pair_gains()
self.best_pair = max(self.pair_gains,
key=self.pair_gains.get)
merge_op = list(self.best_pair)
self.merge_ops.append(merge_op)
for word_dict in self.words_wp.values():
word_dict[’swords’] = self._merge_pair(merge_op,
word_dict[’swords’])
self._build_vocab()
def _build_vocab(self):
swords = list(map(lambda x: ’’.join(x), self.merge_ops))
self.vocab += swords
13.3 The WordPiece Tokenizer 367
Once we have extracted the subwords, we add them to the initial symbols to build
the final vocabulary: self.vocab.
For Homer’s works, the first merges are:
>>> wp = WordPiece()
>>> wp.fit(text)
>>> wp.merge_ops
[[’t’, ’h’], [’th’, ’e’], [’a’, ’n’], [’an’, ’d’], ...]
13.3.4 Encoding
The match can fail to segment a word if it contains characters not in the
vocabulary. For example, in the corpus we used, there is no é and the segmentation
of touché results in:
>>> re.findall(wp.sword_regex, ’ touché’)
[’ to’, ’u’, ’ch’]
We can now write a complete encoding function that checks that the segmentation
does not lose any character:
def encode(self, word):
subwords = re.findall(self.sword_regex, word)
if ’’.join(subwords) != word:
# some subwords are not in the vocabulary
subwords = [self.unk_word]
return subwords
We have
>>> wp.encode(’ Therefore’)
[’ The’, ’re’, ’fore’]
>>> wp.encode(’ touché’)
[’[UNK]’]
13.3.5 Tokenization
def bert_wp(swords):
i = 0
while i < len(swords) - 1:
if swords[i] == ’ ’:
swords = swords[:i] + \
[’’.join([swords[i], swords[i + 1]])] + swords[i + 2:]
i += 1
return [sword[1:] if sword[0] == ’ ’ else ’##’ + sword
for sword in swords]
and we have:
>>> bert_wp(swords)
[’S’, ’##it’, ’c’, ’##ar’, ’##e’, ’##le’, ’##s’, ’##s’,
’in’, ’the’, ’sh’, ’##ad’, ’##e’, ’!’]
It is often the case that we can segment a word in multiple ways. For example,
given the vocabulary .V = {t, h, e, th, he, the}, the word the has four possible
tokenizations:
[’t’, ’h’, ’e’]
[’th’, ’e’]
[’t’, ’he’]
[’the’]
BPE and WordPiece always produce the same results as they use a deterministic
algorithm, either by applying a list of merge operations or by finding the longest
matches. In this section, we will describe an algorithm that maximizes a unigram
language model (Kudo, 2018). This means that, for a given word, out of all the
possible segmentations, the tokenizer keeps the one that has the maximal likelihood.
As with BPE, we start with a pretokenization into words. Then, given a
vocabulary V of subwords and a word w, where
the unigram tokenization of w is the subword sequence .sw1 , sw2 , ..., swn that
maximizes
n
. P (swi )
i=1
the
he
t h e
q0 q1 q2 q3
th
3. .P (t) × P (he),
4. .P (the),
and selects the highest probability. These possible tokenizations correspond to the
transitions shown in Fig. 13.1.
To prevent an arithmetic underflow, in our implementation, as in Sect. 10.3,
we will replace the product with a negative sum of logarithms, the negative log-
likelihood (NLL):
n
. − log P (swi )
i=1
For this description, we can create our initial unigram class with a pretokenization
that will use this pattern:
words = re.findall(r’\p{P}|[^\s\p{P}]+’, text)
class Unigram():
def __init__(self, uni_probs):
self.uni_probs = uni_probs
self.pattern = r’\p{P}|[^\s\p{P}]+’
the rest of the class is the same and we can fit the BPE model:
bpe = BPE()
bpe.fit(text)
tokens = bpe.tokenize(text)
We then count the words and we create a dictionary with the negative logarithms
of the relative frequencies. We use this function:
def calc_nll(tokens):
token_cnts = Counter(tokens)
total_cnt = token_cnts.total()
uni_probs = {token: -log(cnt/total_cnt) for
token, cnt in token_cnts.items()}
return uni_probs
372 13 Subword Segmentation
These values will suffice to run a first iteration of the unigram segmentation. We will
describe the next steps of expectation-maximization in Sect. 13.4.4 when we have
tokenized a text with the initial frequencies.
Now that we have devised a way to build an initial vocabulary, let use dive into the
word tokenization. We will examine two techniques: one with a brute-force search of
all the possible segmentations and a more frugal method with the Viterbi algorithm.
Brute-Force Search
Given a word, a brute-force algorithm will generate all the possible sequences of
substrings. For this, we model the splits with Booleans that we insert between the
characters. In Fig. 13.1, the intercharacters would correspond to the nodes between
the start and the end nodes, i.e .q1 and .q2 . We then split the string between two
characters if the Boolean value is 1.
By varying all the values, we enumerate all the possible splits. Table 13.2 shows
the subwords of the with the four possible values of .(q1 , q2 ).
Let us now write the split_word() function that splits a string according to
splitpoints given as a string of binary digits. We just go through the split points
and create a new list if the value is 1:
@staticmethod
def split_word(string, splitpoints):
subwords = []
prev_sp = 0
for i, sp in enumerate(splitpoints, start=1):
if sp == ’1’:
subword = string[prev_sp:i]
prev_sp = i
subwords.append(subword)
subword = string[prev_sp:]
subwords.append(subword)
return subwords
Let us incorporate this function as a static method in the Unigram class. For there,
we model the split points as a string of four Booleans, for example ’0110’:
>>> Unigram.split_word(’there’, ’0110’)
[’th’, ’e’, ’re’]
To produce the strings of binary digits, such as ’0110’, we will use Python’s
formatted string literals. These literals consist of an f prefix and a content enclosed
in curly braces, f’{number:formatting}’. The content is a number with formatting
specifications. The format specifier is a mini-language and has many options. Here
we will use the literal: f’{number:0{number_of_bits}b}’ telling to format number on
the specified number of binary digits and padded with zeros.
For a word with n characters, we have .2n−1 sequences of binary digits corre-
sponding to the possible split points. This is a variant of the powerset, here applied
to subsequences. To create the segmentations, we generate all the numbers between
0 and .2n−1 encoded on .n − 1 bits and we call Unigram.split_word():
word = ’there’
cnt_sp = len(word) - 1
candidates = []
for i in range(2**cnt_sp):
splitpoints = f’{i:0{cnt_sp}b}’
candidates += [Unigram.split_word(word, splitpoints)]
candidates
resulting in
[[’there’],
[’ther’, ’e’],
[’the’, ’re’],
[’the’, ’r’, ’e’],
...
[’t’, ’h’, ’e’, ’r’, ’e’]]
Now that we have these sequences, we can compute the maximal likelihood
of a word segmentation, or here the minimal sum of logarithms. Our brute-force
tokenization generates all the segmentations, evaluates their likelihoods, and returns
the most likely one. We limit the length of a word to 20 characters.
def encode(self, word):
cnt_sp = len(word) - 1
if cnt_sp > 20:
return list(word)
candidates = []
for i in range(2**cnt_sp):
splitpoints = f’{i:0{cnt_sp}b}’
candidates += [Unigram.split_word(word, splitpoints)]
return min(
[(cand,
sum(map(lambda x: self.uni_probs.get(x, 1000), cand))
) for cand in candidates],
key=lambda x: x[1])
374 13 Subword Segmentation
When a segment is not in the vocabulary, we assign it a very unlikely value: 1000.
We then apply the tokenization to a text with the method:
def tokenize(self, text):
cache = {}
nll_text = 0.0
tokenized_text = []
words = self.pretokenize(text)
for word in words:
if word not in cache:
cache[word] = self.encode(word)
subwords, nll = cache[word]
tokenized_text += subwords
nll_text += nll
return tokenized_text, nll_text
Viterbi Algorithm
Our second algorithm will start with an empty list and incrementally add subwords.
It will evaluate the likelihood of incomplete sequences and eliminate those that have
no chance of maximizing the likelihood.
Consider there, for instance. After two characters, we can compare two seg-
mentations: [’ t’, ’h’] and [’ th’]. Trained on Homer’s works with BPE, the
negative log likelihoods of t, h, and th are respectively 5.11, 5.25, and 5.57. At
this point in the segmentation, there is no way 5.11 + 5.25 = 10.36 can be better than
5.57. We can then discard [’ t’, ’h’]. We will store the intermediate result that,
at index 2, is the best segmentation [’ th’] and reuse it when we proceed further
in the word. This is the idea of the Viterbi algorithm (Viterbi, 1967). Using Viterbi’s
words, we will call the intermediate result at index i, a survivor.
As data structure, given a word of n character, we will use two lists: one to store
the surviving subwords at index i, swords, and one to store the minimal negative log
likelihoods, min_nlls. We initialize the first list with the input words truncated from
index i to its end: word[:i], i.e. longer and longer prefixes, and the second with the
negative log likelihoods of these subwords.
Following our example, we implement the algorithm with two loops. The first
one goes through the characters of the word using index i; the second loop with
index j goes through the characters from 1 up to i and computes the likelihood of
the word[j:i] subword. If the resulting segmentation is better, we update the NLL
of the word at index .i − 1. We use here a dynamic programming technique: Instead
of recomputing the negative log likelihoods, we look them up in this list.
Eventually, when we have reached the end of the word, we extract the subwords.
For this, we start from the end, where we have the last subword. We remove the
13.4 Unigram Tokenizer 375
incomplete sequences leading to this last subword. We repeat this process until we
have reached the beginning of the list.
def encode(self, word):
n = len(word)
swords = [word[:i] for i in range(1, n + 1)]
min_nlls = [self.uni_probs.get(sword, 1000.0) for sword in swords]
In Sect. 13.4.2, we outlined how to estimate the probabilities. Now that we have ran
a first segmentation with BPE estimates (uni_probs_old), we can recompute new
estimates (uni_probs_new). This corresponds to the function below.
def em(text, uni_probs_old):
cache = {}
tokens = []
unigram = Unigram(uni_probs_old)
words = unigram.pretokenize(text)
for word in words:
if word not in cache:
cache[word] = unigram.encode(word)[0]
tokens += cache[word]
uni_probs_new = calc_nll(tokens)
return uni_probs_new
376 13 Subword Segmentation
We can then reapply a segmentation and repeat this procedure a few times, here 5,
or until the estimates have converged, i.e. do not change between two tokenizations.
uni_probs = dict(uni_probs_bpe)
for _ in range(5):
uni_probs = em(words, uni_probs)
Finally, so far, we do not know the optimal vocabulary of subwords for a unigram
segmentation. In the previous sections, we have assumed that we could obtain it
with BPE. This is nonetheless an approximation only. For a given vocabulary size,
we would need to generate and test all the possible subword combinations to find it.
In practice, this is impossible.
As a feasible implementation, Kudo (2018) proposed to start with a superset of
the final vocabulary that we obtain with another method, here BPE. This superset, V ,
is the seed vocabulary and its size should be reasonably large to be sure it contains
the optimal subset. Then, we remove the words from V by order of least significance
for the quality of the language model until we reach the desired size. We always keep
subwords of length 1 to avoid out-of-vocabulary words.
We implement the elimination incrementally: For all the subwords .swi ∈ V , we
compute the NLL value on a corpus with a vocabulary deprived from this subword,
.V \{swi }:
We measure the contribution of .swi to the performance of the language model with
this loss equation:
SentencePiece (Kudo & Richardson, 2018) is a final subword tokenizer using either
BPE or a unigram language model. The main difference with the previous algo-
rithms is that there is no pretokenization. This is especially useful for languages like
Chinese and Japanese, where there is no space between the words. SentencePiece
considers the white spaces as other characters.
As preprocessing, SentencePiece replaces all the white spaces with ’ ’
(U+2581). We can then train it on raw corpora. The decoding of tokenized text
is very easy. Kudo and Richardson (2018) just replace the ’ ’ characters with
spaces with this statement:
detok = ’’.join(tokens).replace(’ ’, ’ ’)
The Hugging Face company has reimplemented a suite of tokenizers with the most
common algorithms. They are easy to use and very fast. They come with pre-trained
models and we can also train them on specific corpora.
The Hugging Face tokenizers share a common API in Python. A tokenizer
consists of a pipeline of classes:
1. Normalizer, which normalizes the text, for instance in lowercase, by removing
accents, or normalizing Unicode;
2. PreTokenizer, for instance a white space tokenizer;
3. Model, the algorithm to train from a corpus, BPE for instance;
4. Decoder, the program to tokenize a text once we a trained model;
5. PostProcessor, to add some special tokens such as start or end.
We can create a class implementing this sequence or reuse an existing one trained
on a corpus. In the next section, we describe a pretrained BPE and then how to build
a tokenizer from scratch.
For this experiment, we will use the GPT-2 tokenizer. We create the model with the
statements:
378 13 Subword Segmentation
In the rest of this section, we show the tokenizer results with this quote:
ecloges_str = """Exiled from home am I; while, Tityrus, you
Sit careless in the shade"""
We can also create a tokenizer from scratch and train it on a corpus. Here is a
minimal configuration for BPE:
from tokenizers import Tokenizer, decoders, models,\
pre_tokenizers, processors, trainers
tokenizer = Tokenizer(models.BPE())
tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel(add_prefix_space
=False)
tokenizer.decoder = decoders.ByteLevel()
tokenizer.post_processor = processors.ByteLevel()
trainer = trainers.BpeTrainer(vocab_size=500)
We specify the vocabulary size to the trainer and we train it on Homer’s Iliad and
Odyssey stored in the text string. Before, we split the string into paragraphs:
trainer = trainers.BpeTrainer(vocab_size=500)
text_sentences = re.split(r’\n+’, text)
tokenizer.train_from_iterator(text_sentences, trainer=trainer)
13.7 Further Reading 379
While BPE dates from 1994, subword tokenizers are quite recent in natural language
processing. Their adoption is due to the emergence of very large multilingual
corpora and the impossibility to encode an unlimited number of words. This
probably explains why their first applications were for machine translation; see Wu
et al. (2016) inter alia. The transformers (Vaswani et al., 2017) popularized them
as they are their standard input format. We will see this architecture in Chap. 15,
Self-Attention and Transformers.
Subword tokenizers have many intricate technical details and, in doubt on their
precise implementation, the best is to read their code. Karpathy (2022) provides a
very didactical implementation of BPE and Kudo (2017) of SentencePiece, both on
GitHub. Hugging Face created a fast implementation of many subword tokenizers
in Rust,2 as well as a good documentation of the API.3
2 https://github.com/huggingface/tokenizers/.
3 https://huggingface.co/docs/tokenizers/index.
Chapter 14
Part-of-Speech and Sequence Annotation
In Sect. 12.5, we saw that a same word may have two or more parts of speech leading
to different morphological analyses or syntactic interpretations. The word can is an
example of such an ambiguity as we can assign it two tags: either noun or modal
verb. It is a noun in the phrase a can of soup and a modal verb in I can swim.
Ambiguity resolution, that is retaining only one part of speech (POS) and
discarding the others, is generally referred to as POS tagging or POS annotation.
In this chapter, we will see how to do this automatically with machine-learning
techniques. We will proceed from the most simple to more elaborate systems. We
will start with an elementary baseline technique, we will then use feed-forward
networks with one-hot encoding and embeddings; we will finally use a new kind
of device called recurrent neural networks.
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 381
P. M. Nugues, Python for Natural Language Processing, Cognitive Technologies,
https://doi.org/10.1007/978-3-031-57549-5_14
382 14 Part-of-Speech and Sequence Annotation
Table 14.1 Ambiguities in part-of-speech annotation with the sentence: That round table might
collapse
Words Possible tags Example of use UPOS tags
That Subordinating conjunction That he can swim is good SCONJ
Determiner That white table DET
Adverb It is not that easy ADV
Pronoun That is the table PRON
Relative pronoun The table that collapsed PRON
round Verb Round up the usual suspects VERB
Preposition Turn round the corner ADP
Noun A big round NOUN
Adjective A round box ADJ
Adverb He went round ADV
table Noun That white table NOUN
Verb I table that VERB
might Noun The might of the wind NOUN
Modal verb She might come AUX
collapse Noun The collapse of the empire NOUN
Verb The empire can collapse VERB
14.2 Baseline
Before we start writing elaborate tagging algorithms, let us implement the baseline
technique as it requires extremely limited efforts. As we saw, this baseline is to tag
each word with its most frequent part of speech. We can derive the frequencies from
a part-of-speech annotated corpus, such as those from the Universal Dependencies
(UD) repository, see Sect. 12.3. Among the English UD corpora, the largest one
is the English Web Treebank (EWT) with about 250,000 words split into training,
validation, and test sets.
We collect the training, validation, and test sets from GitHub and we store
them as strings in train_sentences, val_sentences, and test_sentences. Using the
CoNLL dictorizer from Sect. 12.4), we create lists of dictionaries from these strings
with the column names as keys:
conll_dict = CoNLLDictorizer(col_names)
train_dict = conll_dict.transform(train_sentences)
val_dict = conll_dict.transform(val_sentences)
test_dict = conll_dict.transform(test_sentences)
We then compute the distribution of the parts of speech of a word with this function
that we apply to the training set:
def distribution(corpus, word_key=’FORM’, pos_key=’UPOS’):
word_cnt = count_word(corpus, word_key)
pos_dist = {key: Counter() for key in word_cnt.keys()}
for sentence in corpus:
for row in sentence:
distribution = pos_dist[row[word_key]]
distribution[row[pos_key]] += 1
return pos_dist
As preprocessing, we can lowercase the letters or keep them as they are. Table 14.2
shows the distribution of the parts of speech for the words in the sentence That round
table might collapse with counts extracted without changing the case.
The POS tagger model is then just a dictionary that associates a word with its
most frequent part of speech. We use this function to obtain the associations:
word_pos = {}
for word in pos_dist:
word_pos[word] = max(pos_dist[word],
key=pos_dist[word].get)
POS tagging is then very simple: It is just a dictionary lookup. Table 14.2 shows that
on our highly ambiguous sentence, the baseline accuracy is only of 2/5 i.e. 40%.
384 14 Part-of-Speech and Sequence Annotation
Table 14.2 Part of speech distribution in the EWT. We kept the original case of the letters
Words Parts-of-speech counts Most frequent POS Correct POS
That PRON: 58, DET: 15, SCONJ: 6 PRON DET
round NOUN: 4, ADV: 3, ADJ: 2, ADP: 2 NOUN ADJ
table NOUN: 14 NOUN NOUN
might AUX: 77 AUX AUX
collapse NOUN: 2, VERB: 1 NOUN VERB
We cannot apply the tagger as is to the test set as it contains words unseen in
the training set. A last thing is then to decide how to deal with such words. A first
strategy is to use the most frequent tag. For this, we just need to apply a counter to
the values of word_pos:
>>> Counter(word_pos.values())
Counter({’NOUN’: 7322,
’PROPN’: 3805,
’VERB’: 3412,
...
We would then tag the unseen words as nouns. Another idea is to check which tag
would get the best results on a different corpus. We have the validation set for this
that we will use to tune the model before we run the final evaluation on the test set.
To implement this method, we compute the distribution of the unseen words, i.e.
those not in the word_pos dictionary with:
def unseen_words_pos_dist(corpus,
word_pos,
word_key=’FORM’,
pos_key=’UPOS’):
unseen_words = Counter()
for sentence in corpus:
for word in sentence:
if not word[word_key] in word_pos:
unseen_words[word[pos_key]] += 1
return unseen_words
These results show that if we choose to tag an unknown word as a noun, we would
be right 622 times, but 739 if we choose to tag it as a proper noun. This latter method
is thus better and we will tag the unseen words as proper nouns.
14.4 Part-of-Speech Tagging with Linear Classifiers 385
Table 14.3 Annotation of the English sentence: I can’t remember. in the EWT. Note that not is
annotated as a particle in this corpus rather than the more traditional adverb
ID FORM LEMMA UPOS FEATS
1 I I PRON Case=Nom|Number=Sing|Person=1|PronType=Prs
2–3 can’t _ _ _
2 ca can AUX VerbForm=Fin
3 n’t not PART _
4 remember remember VERB VerbForm=Inf
5 . . PUNCT _
14.3 Evaluation
Applying our tagger to the test set, we reach an accuracy of 86.4% on the EWT
corpus for English and, with the same method, we obtain 91.2% on the GSD corpus
(Guillaume et al. 2019) in French.
In the EWT and GSD corpora, contracted forms like can’t, don’t, or won’t show
twice in annotated sentences as in Table 14.3, with the original form, can’t, and
as two morphological parsed lemmas, can and not. Similarly in French, the words
du and des are expanded respectively into de le and de les. We did not apply any
specific preprocessing to remove them as this had insignificant consequences on the
final accuracy.
In addition to accuracy, we can derive a confusion matrix that shows for each
tag how many times a word has been correctly or wrongly labeled. Table 14.4
shows the results on the EWT test set for the baseline method. It enables us to
understand and track errors. The diagonal shows the breakdown of the tags correctly
assigned, for example, 96.8% for determiners (DET). The rest of the table shows the
tags wrongly assigned, i.e., 1.7% of the determiners have been tagged as pronouns
(PRON) and 0.9% as subordinating conjunctions (SCONJ). This table is only an
excerpt, therefore the sum of rows is not equal to 100.
Table 14.4 A confusion matrix breaking down the results of the baseline method on the EWT test set. The first column corresponds to the correct tags, and for
each tag, the rows give the assigned tags. We set aside some parts of speech from the table. A perfect tagging would yield an identity matrix
Tagger .→
.↓Correct ADJ ADP ADV AUX CCONJ DET NOUN PRON PROPN SCONJ VERB
ADJ 82.8 0.6 1.5 0. 0. 0.1 1.9 0. 11.5 0. 1.6
ADP 0. 88.2 0.4 0. 0. 0. 0. 0. 0.3 0.6 0.
ADV 5.3 7.1 78.5 0.1 0.2 1.3 0.8 2.5 2.1 1.4 0.1
AUX 0. 0. 0. 88.9 0. 0. 0.1 0.1 0.3 0. 6.5
CCONJ 0. 0.1 0. 0. 99.7 0. 0. 0. 0.1 0. 0.
DET 0.2 0. 0.1 0. 0.2 96.8 0.1 1.7 0.1 0.9 0.
NOUN 0.8 0.1 0.2 0.1 0. 0. 76.2 0. 19. 0.1 3.1
PRON 0. 0. 0. 0. 0. 2.1 0. 93.2 0.1 4.4 0.
PROPN 1.1 0.3 0. 0.1 0. 0. 4.1 0. 93.6 0. 0.4
SCONJ 0. 33.3 1.6 0. 0. 0. 0. 0.3 1.6 60.4 0.
VERB 0.6 0.9 0.2 3.6 0. 0. 5.7 0. 7.2 0. 81.5
14 Part-of-Speech and Sequence Annotation
14.4 Part-of-Speech Tagging with Linear Classifiers 387
Fig. 14.2 The feature vectors corresponding to the X input matrix and the parts of speech to
predict representing the .y output
Figure 14.2 shows more feature vectors from this sentence. We first use this table
to train a model; we then apply it sequentially to assign the tags. If we use logistic
regression, the tagger outputs a probability, .P (ti |wi−2 , wi−1 , wi , wi+1 , wi+2 ), that
we can associate with each tag of the sequence.
388 14 Part-of-Speech and Sequence Annotation
At the beginning and end of the sentence, the window extends beyond the
sentence boundaries. A practical way to handle this is to pad the sentence—the
words and parts of speech—with dummy symbols such as BOS (beginning of
sentence) and EOS (end of sentence) or <s> and </s>. If the window has a size
of five words, we will pad the sentence with two BOS symbols in the beginning and
two EOS symbols in the end.
We extract the features from POS-annotated corpora, and we train the models
using machine-learning libraries such as scikit-learn (Pedregosa et al. 2011) for
logistic regression or PyTorch for neural networks. Real systems would use more
features than those from the core feature set such as the word characters, prefixes
and suffixes, word bigrams, part-of-speech bigrams, etc. Nonetheless, the principles
remain the same, the only difference would be a larger feature set.
In this section, we will implement a simple tagger that uses a context of five words
centered on the current word to predict the part of speech. We will first write a
feature extractor, convert these features in numerical vectors, and store them in a
matrix. We will then train and apply a logistic regression model. Figure 14.3 shows
the overall structure of such a sequence annotator, where given an input .xi , the
system predicts an output, .yi . Note that we could apply this architecture to any kind
of sequences, provided that they have equal length.
y0 y1 y2 yn−1 yn
Sequence annotator
x0 x1 x2 xn−1 xn
Fig. 14.3 Sequence annotation, here using logistic regression and applied to part-of-speech
tagging. The input and output sequences have equal length
14.5 Programming a Part-of-Speech Tagger with Logistic Regression 389
We load the training, validation, and test sets and convert them as lists of dictionaries
as in Sects. 12.4 and 14.2. Let us name train_dict, val_dict, and test_dict, the
datasets stored as lists of dictionaries.
Then, given a sentence, we extract the words before and after the current word.
We store these words in a dictionary compatible with scikit-learn transformers so
that we can vectorize them using the built-in DictVectorizer() class; see Sect. 6.4.1.
To complete this extraction, we will proceed in two steps:
1. We extract the words and parts of speech from the dictionaries;
2. We create the word contexts as tables.
We extract the words and parts of speech of a sentence with the function:
def extract_cols(sent_dict, x=’FORM’, y=’UPOS’):
(input, target) = ([], [])
for word in sent_dict:
input += [word[x]]
target += [word.get(y, None)]
return input, target
At this point, for the sentence in Table 12.10, we have the words:
>>> train_sent_words[8131]
[’Or’, ’you’, ’can’, ’visit’, ’temples’, ’or’, ’shrines’,
’in’, ’Okinawa’, ’.’]
Now, we can create the feature vectors consisting of words centered on the
position of the part of speech to predict. The length of the feature lists is given
by the left and right contexts of w_size. The function pads the sentence with begin-
of-sentence (BOS) and end-of-sentence (EOS) symbols and extracts windows of 2
.× w_size + 1 words as in Table 14.1:
390 14 Part-of-Speech and Sequence Annotation
dict_vectorizer = DictVectorizer()
X_train = dict_vectorizer.fit_transform(X_train_cat)
The scikit-learn classifiers can handle .y vectors consisting of symbols and we do
not need to convert the POS classes into numbers.
14.6 Part-of-Speech Tagging with Feed-Forward Networks 391
Now that we have the features in X_train and the parts of speech in y_train_cat,
we can train a classifier. We use logistic regression from the linear_model module
in scikit-learn and its fit() function:
from sklearn import linear_model
classifier = linear_model.LogisticRegression()
model = classifier.fit(X_train, y_train_cat)
We write a predict() function to carry out the POS prediction from a sentence. It
encodes the features with the transform() function from DictVectorizer, where this
time it does not need to fit the symbols, and applies the classifier with predict().
We store the predicted POS tags in the PPOS key of the dataset dictionary:
def predict_sentence(sentence,
model,
dict_vectorizer,
ppos_key=’PPOS’):
sent_words, _ = extract_cols(sentence)
X_cat = create_X_cat(sent_words)
X = dict_vectorizer.transform(X_cat)
y_pred_vec = model.predict(X)
# We add the predictions in the PPOS column
for row, y_pred in zip(sentence, y_pred_vec):
row[ppos_key] = y_pred
return sentence
We load the dataset as in the previous section and the preprocessing to build the X
matrix is almost identical. The only difference is to ensure that the PyTorch tensors
392 14 Part-of-Speech and Sequence Annotation
have the same numerical types as the NumPy arrays. For this, we vectorize the
matrices as nonsparse and with a datatype of 32-bit floats:
dict_vectorizer = DictVectorizer(sparse=False, dtype=np.float32)
We create TensorDataset and DataLoader objects for our three sets. We set the
batch_size to 512 for the training process meaning that will use 512 samples per
update:
train_dataset = TensorDataset(X_train, y_train)
train_dataloader = DataLoader(
train_dataset, batch_size=512, shuffle=True)
We use the Sequential module as in Sect. 8.5.3. The logistic regression model
has just one linear layer. The input dimension is the length of the rows in X and the
output dimension is given by the total number of parts of speech:
model = nn.Sequential(
nn.Linear(X_train.size(dim=1), len(pos2idx))
)
We can now write a training loop, where we fit the parameters on the training set and
evaluate the results on the validation set. This will enable us to monitor the gradient
descent and see when the model starts overfitting. To simplify the code, we define
first an evaluation function that computes the loss and accuracy of a model. Its input
parameters are the model, the loss function, and a DataLoader object:
def evaluate(model,
loss_fn,
dataloader) -> tuple[float, float]:
model.eval()
with torch.no_grad():
loss = 0
acc = 0
batch_cnt = 0
for X_batch, y_batch in dataloader:
batch_cnt += 1
y_batch_pred = model(X_batch)
loss += loss_fn(y_batch_pred, y_batch).item()
acc += (sum(torch.argmax(y_batch_pred, dim=-1)
== y_batch)/y_batch.size(dim=0)).item()
return loss/batch_cnt, acc/batch_cnt
The training loop is similar to those we have already written. The training part is
followed by an evaluation on the validation set:
for epoch in range(EPOCHS):
train_loss = 0
train_acc = 0
batch_cnt = 0
model.train()
for X_batch, y_batch in tqdm(train_dataloader):
batch_cnt += 1
y_batch_pred = model(X_batch)
loss = loss_fn(y_batch_pred, y_batch)
optimizer.zero_grad()
loss.backward()
optimizer.step()
train_acc += (sum(
torch.argmax(y_batch_pred, dim=-1)
== y_batch)/y_batch.size(dim=0)).item()
train_loss += loss.item()
model.eval()
with torch.no_grad():
history[’accuracy’] += [train_acc/batch_cnt]
history[’loss’] += [train_loss/batch_cnt]
val_loss, val_acc = evaluate(model, loss_fn, val_dataloader)
history[’val_loss’] += [val_loss]
history[’val_accuracy’] += [val_acc]
We plot the training and validation losses and accuracies curves to monitor if they
follow the same trend. When the validation accuracy starts to decrease, this means
394 14 Part-of-Speech and Sequence Annotation
Fig. 14.4 Feed-forward network with one-hot input. The training loss and accuracy over the
epochs
that the model is overfitting and we should stop. Figure 14.4 shows the curves, where
we observe an overfitting starting at epoch 17.
We apply the model trained on 16 epochs to the test set of the English Web
Treebank and we compute the accuracy with:
evaluate(model, loss_fn, test_dataloader)
Building a network with more layers is very easy with PyTorch. We just to add them
to the model, where the activation function of the layers will be ReLU, except the last
one.
As a rule of thumb, the number of nodes in hidden layers should gradually
decrease from number of input nodes to the number of classes. Here we create a
hidden layer with as twice as many nodes as there are parts of speech:
model = nn.Sequential(
nn.Linear(X_train.size(dim=1), 2 * len(pos2idx)),
nn.ReLU(),
nn.Linear(2 * len(pos2idx), len(pos2idx)),
)
Using the same training set and hyperparameters identical to those of our previous
experiment, the model starts overfitting at the second epoch. This architecture with
one hidden layer does not improve the accuracy with a score of 91.43% after two
epochs on the test set of the English Web Treebank. As a conclusion here, the
simpler logistic regression model is more effective.
14.7 Embeddings 395
14.7 Embeddings
For example, the first values of the 100-dimensional embedding vector representing
the word table are:
(−0.615, 0.897, 0.568, 0.391, −0.224, 0.490, 0.109, 0.274, −0.238, −0.522, ...
.
1 https://nlp.stanford.edu/projects/glove/.
396 14 Part-of-Speech and Sequence Annotation
To create the index, we merge the words in the training corpus with the words in
the GloVe file and we create a set of all the words.
embeddings_words = embeddings_dict.keys()
vocabulary = set(corpus_words + list(embeddings_words))
We then create a word index, where we keep index 0 for the unknown words:
idx2word = dict(enumerate(sorted(vocabulary), start=1))
word2idx = {v: k for k, v in idx2word.items()}
Finally, we replace the lowercased words with their index in the context
dictionaries:
for x_train_cat in X_train_cat:
for word in x_train_cat:
x_train_cat[word] = word2idx[
x_train_cat[word].lower()]
Note that the resulting matrices contain indices and not one-hot vectors. We process
the parts of speech the same way as we did in Sect. 14.6.1.
We create the embedding table with a random initialization
embedding_table = torch.randn(
(len(vocabulary) + 1, EMBEDDING_DIM))/10
and we fill the index values when available with the GloVe embeddings:
for word in vocabulary:
if word in embeddings_dict:
embedding_table[word2idx[word]] = embeddings_dict[word]
14.8 Recurrent Neural Networks 397
Fig. 14.5 Feed-forward network with embedding input. The training loss and accuracy over the
epochs
Now all is in place to create the neural network. We use a sequential model:
It consists of a trainable embedding layer that fetches the embeddings from the
word indices in the X matrix storing the contexts; we load the 100-dimension GloVe
embeddings with from_pretrained() as in Sect. 11.8; a Flatten() class that flattens
the embedded vectors; and a linear layer:
model = nn.Sequential(
nn.Embedding.from_pretrained(
embedding_table, freeze=False),
nn.Flatten(),
nn.Linear(5 * embedding_table.size(dim=1),
len(pos2idx))
)
The rest of the code and hyperparameters is identical to that of the previous
section. We also use the NAdam optimizer with a learning rate of 0.005.
Figure 14.5 shows the loss and accuracy curves. We see that the model starts
overfitting at epoch 3. After training the model on two epochs on the training set of
the English Web Treebank, we obtain an accuracy of 93.34 on the test set. This is
nearly 2% better than with a one-hot encoding.
In Figs. 14.1 and 14.2, we predicted a part of speech .ti from the words surrounding
it, .(wi−2 , wi−1 , wi , wi+1 , wi+2 ). Once identified, this part of speech certainly sets
constrains on the next one, for instance a determiner will probably exclude a verb
as following tag. A possible improvement to our previous programs would be to
include one or two previous predictions, .ti−1 and .ti−2 in the feature vector so that
we have more information on the context:
Hidden units
Input units
y0 y1 y2 yn−1 yn
W x0 W x1 W x2 W xn−1 W xn
x0 x1 x2 xn−1 xn
Fig. 14.7 The network from Fig 14.6 unfolded in time. In this figure, the connection between the
hidden states and the outputs are not trainable
W xi + U hi−1 + b.
.
14.8 Recurrent Neural Networks 399
The number of rows in .W corresponds to the number of hidden units. We pass the
result through an activation function, usually a hyperbolic tangent, to get the output
of the hidden layer at index i:
The hidden vector .hi is passed to a linear layer and then a softmax function to
predict the output vector .yi .
The design of a simple part-of-speech tagger with recurrent neural networks and
PyTorch is easy. We load the datasets and we dictorize them as in Sect. 14.5.1. We
then extract the .x and .y aligned sequences from all the sentences to create the .X and
.Y matrices. We split the matrix construction into four steps:
1. For each sentence in the corpus, collect the parallel sequences of words and parts
of speech. This corresponds to the FORM and UPOS columns in Table 12.10;
2. Build the word and part-of-speech indices and replace each word and part of
speech in the lists with their index;
3. Pad the lists with a padding symbol so that all the lists have the same length and
create two matrices from the two lists of lists;
4. Finally, convert the numbers into embedding vectors. As in Sect. 14.7, we will
use the GloVe embeddings.
We collect the lists of words and parts of speech using the extract_cols()
function as in Sect. 14.5.1. It extracts by default the FORM and UPOS columns
and returns two lists per sentence. Using the same comprehension as in Sect. 14.5.1,
we build the lists of words and parts of speech:
train_sent_words, train_sent_pos
val_sent_words, val_sent_pos
test_sent_words, test_sent_pos
where we set the words in lowercase to be compatible with GloVe. We do the same
with the validation and test sets.
Indices
To create the word and part-of-speech indices, we first collect the list of unique
words as well as the parts of speech from the training set.
400 14 Part-of-Speech and Sequence Annotation
corpus_words = sorted(set([word
for sentence in train_sent_words
for word in sentence]))
pos_list = sorted(set([pos
for sentence in train_sent_pos
for pos in sentence]))
We also collect the words from the embedding dictionary as in Sect. 14.7 and we
merge the vocabularies:
embeddings_words = embeddings_dict.keys()
vocabulary = set(corpus_words + list(embeddings_words))
We then create the word indices that will serve as input to the PyTorch embedding
table.
Sequences of words or characters are by nature of different lengths. As PyTorch
processes mostly tensors of rectangular dimensions, we need to pad the input so that
all the sequences have an equal length. To make a distinction with other indices, the
Embedding class has an argument that assigns the padding symbol to a given index:
padding_idx. In the embedding table, the corresponding row will be filled with zeros
and the parameters will be nontrainable.
>>> embedding_pad = nn.Embedding(vocab_size, embedding_dim,
padding_idx=0)
In the word index, we start at 2 to make provision for the padding symbol 0 and
unknown words, 1. The POS index will start at 1 to differentiate the POS indices
with the padding symbol too:
idx2word = dict(enumerate(vocabulary, start=2))
idx2pos = dict(enumerate(pos_list, start=1))
word2idx = {v: k for k, v in idx2word.items()}
pos2idx = {v: k for k, v in idx2pos.items()}
We convert the lists of training sentences and parts of speech into tensors:
X_train_idx = to_index(train_sent_words, word2idx)
Y_train_idx = to_index(train_sent_pos, pos2idx)
14.8 Recurrent Neural Networks 401
Padding
We can now pad the sequences to an identical length with the PyTorch
pad_sequence() built-in function. This length corresponds to that of the longest
sentence in the batch. As we decided, we set the padding index to 0.
We saw earlier that the traditional format of the X input matrix is to store the
features of an observation, here a sentence, in a row. Each row containing the words
of this sentence. We then stack vertically the observations in X. The first dimension
of the input matrix is then called the batch dimension or batch axis.
PyTorch has a default format for the recurrent networks and the pad_sequence()
function, where the columns are the sentences and the rows, the words across the
sentences. For instance, the first row contains the first word of all the sentences in
the batch. It is possible to change this order with the argument batch_first=True
that restores the traditional order.
To see the difference, let us pad three short sentences in the corpus with the
traditional order:
>>> pad_sequence(X_train_idx[8131:8134], batch_first=True)
We will use the traditional batch-first order in the rest of this chapter as we are
used to it.
from torch.nn.utils.rnn import pad_sequence
Embeddings
We create the embedding matrix with the values from GloVe or random values for
the words in the training corpus, but not in GloVe:
EMBEDDING_DIM = 100
embedding_table = torch.randn(
(len(vocabulary) + 2, EMBEDDING_DIM))/10
for word in vocabulary:
if word in embeddings_dict:
embedding_table[word2idx[word]] = embeddings_dict[word]
Network Architecture
Now we have processed the data so that we can use it as input to a RNN. We build the
network with a class derived from Module, where we stack layers in a pipeline. We
start with an embedding layer; we add a PyTorch RNN() layer; and finally a Linear()
layer to make the prediction. The most important parameters of RNN() are:
1. The input size, that corresponds here to the embedding dimension;
2. The hidden size, that corresponds to the output dimension of the recurrent
network. We use here 128;
3. We define the number of recurrent layers that we want to stack with num_layers.
We will create two layers;
4. We set batch_first to true.
5. We may define a dropout rate that we will explain in a next section;
6. By default, RNN() runs the cells from left to right as shown in Fig. 14.7. It does
not take into account the RNN outputs to the right of the current word. We set the
bidirectional Boolean to true to run RNN() from left to right and right to left in
parallel. Then for a given word, a bidirectional RNN layer concatenates the left
and right RNN outputs.
In a bidirectional network, we have twice the number of hidden nodes of a single
direction. We need then to multiply the linear input size by two.
14.8 Recurrent Neural Networks 403
As output, a RNN object returns two values: The whole sequence of predictions
and the last prediction corresponding to the last word in the sentence. We will predict
the whole sequence of parts of speech. Our class is then:
class Model(nn.Module):
where we add 1 to the number of classes to take the padding symbol into account.
We define a loss and an optimizer, where we tell the loss to ignore the padding
index:
loss_fn = nn.CrossEntropyLoss(ignore_index=0)
optimizer = torch.optim.NAdam(model.parameters(), lr=0.005)
Training Loop
The training loop is overall identical to those we have already written. The only
difference is in the computation of the loss. Our input consists of sentences vertically
404 14 Part-of-Speech and Sequence Annotation
stacked in a matrix and the model outputs sequences of predictions. This output is
a 3rd order tensor (sometimes called a 3D matrix), where the first axis is the batch,
the second one, the word position in the sentence, and the 3rd one, the probabilities
of the parts of speech.
As input to the loss function, we need to provide a tensor of probabilities and a
tensor of indices. To do this, we flatten the Y matrix containing the indices of the true
parts of speech in one single vector of indices. We use: Y.reshape(-1). Similarly,
.Ypred is the stack of sentences, where the rows contain the probabilities of the parts
of speech for each input word. We reshape them in a matrix, where the first axis
represents the rows put end to end and the second axis, the prediction probabilities
for each part of speech:
Y_pred.reshape(-1, Y_pred.size(dim=-1))
We integrate the evaluation and computation of the loss in the training loop:
for epoch in range(EPOCHS):
train_loss = 0
train_accuracy = 0
t_words = 0
model.train()
for X_batch, Y_batch in tqdm(train_dataloader):
Y_batch_pred = model(X_batch)
loss = loss_fn(
Y_batch_pred.reshape(
-1,
Y_batch_pred.size(dim=-1)),
Y_batch.reshape(-1))
optimizer.zero_grad()
loss.backward()
optimizer.step()
with torch.no_grad():
n_words = torch.sum(Y_batch > 0).item()
t_words += n_words
train_loss += n_words * loss.item()
train_accuracy += torch.mul(
torch.argmax(
Y_batch_pred, dim=-1) == Y_batch,
Y_batch > 0).sum().item()
model.eval()
with torch.no_grad():
history[’accuracy’] += [
train_accuracy/t_words]
history[’loss’] += [train_loss/t_words]
val_loss, val_acc = evaluate(model,
loss_fn,
val_dataloader)
history[’val_loss’] += [val_loss]
history[’val_accuracy’] += [val_acc]
With a batch size of 128 and hyperparameters identical to our previous exper-
iments for the rest, the model starts overfitting at epoch 3. We reach an accuracy
of 93.47% on the test set of the English Web Treebank after two epochs. This is
slightly better than the feed-forward network with embeddings.
14.8.2 Dropout
Fig. 14.8 RNN network with embedding input. The training loss and accuracy over the epochs
One way to fight the overfit of neural networks is to drop out some connections,
that is set them to zero, randomly at each update in the training step. In PyTorch, we
can insert dropout layers anywhere in a sequential architecture using this statement:
nn.Dropout(rate)
where rate is a number between 0 and 1 representing the fraction of dropped out
inputs: 0, no input is dropped, 1, all the inputs are dropped. This dropout is only for
the training step and we must be careful to set the model in the evaluation mode for
inferences:
model.eval()
For RNN layers, we set the corresponding dropout rates with the dropout parameter
of the RNN class.
Figure 14.8 shows the accuracy and the loss using the previous hyperparameters
and a dropout rate of 0.2. We reach an accuracy of 93.60% on the test set of the
English Web Treebank after 2 epochs. This is about the same as without dropout,
but the training curves are more regular.
14.8.3 LSTM
As we have seen, simple RNNs use the previous output as input. They have then
a very limited feature context. Although, we can run them in both directions, they
cannot memorize large word windows and model long dependencies.
Long short-term memory units (LSTM) (Hochreiter and Schmidhuber 1997) are
an extension to RNNs that can remember, possibly forget, information from longer
or more distant sequences. This has made them a popular technique for part-of-
speech tagging.
14.8 Recurrent Neural Networks 407
. gt = tanh(Wg xt + Ug ht−1 + bg ).
From the previous output and current input, we compute three kinds of filters,
or gates, that will control how much information is passed through the LSTM cell.
Each of these gates uses two matrices, a bias, and have the form of a RNN equation.
The two first gates, .i and .f, defined as:
it = activation(Wi xt + Ui ht−1 + bi ),
.
ft = activation(Wf xt + Uf ht−1 + bf ),
model respectively how much we will keep from the base equation and how much
we will forget from the long-term state.
To implement this selective memory, we apply the two gates to the base equation
and to the previous long-term state with the element-wise product (Hadamard
product), denoted .⊙, and we sum the resulting terms to get the current long-term
state:
. ct = it ⊙ gt + ft ⊙ ct−1 .
. ot = activation(Wo xt + Uo ht−1 + bo )
. ht = ot ⊙ tanh(ct ).
Figure 14.9 shows the data flow between the inputs, .ct−1 and .ht−1 , and the outputs
ct and .ht . As for the other forms of neural networks, the LSTM parameters are
.
yt
ct−1 ⊙ Σ ct
gt
⊙
Wg ht−1 Ug xt
Σ ft
it
W f ht−1 U f xt
Σ
tanh(ct )
Wi ht−1 Ui xt
Σ
Wo ht−1 ot
ht−1 Σ ⊙ ht
Uo xt
xt
Fig. 14.9 An LSTM unit showing the data flow, where .gt is the unit input, .it , the input gate, .ft ,
the forget gate, and .ot , the output gate. The activation functions have been omitted
Fig. 14.10 LSTM network with embedding input. The training loss and accuracy over the epochs
In Sect. 14.8, we saw that the LSTM networks enabled us to obtain the best results
in part-of-speech annotation. In fact, these techniques can apply to any sequence
of symbols, either as input or output. In this section, we will describe application
examples with partial syntactic groups and named entities and we will start with a
few definitions.
We saw in Sect. 12.1.3 that the two major parts of speech are the noun and the verb.
In a sentence, nouns and verbs are often connected to other words that modify their
meaning, for instance an adjective or a determiner for a noun; an auxiliary for a verb.
These closely connected words correspond to partial syntactic structures, or groups,
called noun groups, when the most important word in the structure is a noun or
verb groups, when it is a verb. The terms noun chunks and verb chunks are also
widely used and their identification in a sentence is called chunking.
410 14 Part-of-Speech and Sequence Annotation
More formally, noun groups (Table 14.5) and verb groups (Table 14.6) corre-
spond to verbs and nouns and their immediate depending words. This is often
understood, although not always, as words extending from the beginning of the
constituent to the head noun or the head verb. That is, the groups include the
headword and its dependents to the left. They exclude the postmodifiers. For
the noun groups, this means that modifying prepositional phrases or, in French,
adjectives to the right of the nouns are not part of the groups.
The principles we exposed above are very general, and exact definitions of groups
may vary in the literature. They reflect different linguistic viewpoints that may
coexist or compete. However, precise definitions are of primary importance. Like
for part-of-speech tagging, hand-annotated corpora will solve the problem. Most
corpora come with annotation guidelines. They are usually written before the hand-
annotation process. As definitions are often difficult to formulate the first time, they
are frequently modified or complemented during the annotation process. Guidelines
normally contain definitions of groups and examples of them. They should be
precise enough to enable the annotators to identify consistently the groups. The
annotated texts will then encapsulate the linguistic knowledge about groups and
make it accessible to the machine-learning techniques.
In the NLP literature, a named entity is a term closely related to that of a proper
noun. The word entity is roughly a synonym for object, thing and a named entity
is an entity whose name in a text refers to a unique person, place, object, etc., as
William Shakespeare or Stratford-upon-Avon in the phrase:
William Shakespeare was born and brought up in Stratford-upon-Avon.
14.10 Group Annotation Using Tags 411
Again this reflects overall the distinction between common and proper nouns; see
Fig. 14.11.
Names of people or organizations are frequent in the press and the media, where
they surge and often disappear quickly. The first step before any further processing
is to identify the phrases corresponding to names of persons, organizations, or
locations (Table 14.7). Such phrases can be a single proper noun or a group of
words.
Named entity recognition also commonly extends to temporal expressions
describing times and dates, and numerical and quantity expressions, even if these
are not entities.
The most intuitive way to mark the groups, noun groups, verb groups, or named
entities, is to bracket them with opening and closing parentheses. Ramshaw and
Marcus (1995) give examples below of noun group bracketing, where they insert
brackets between the words where appropriate:
[.N G The government .N G ] has [.N G other agencies and instruments .N G ] for pursuing [.N G
these other objectives .N G ] .
Even [.N G Mao Tse-tung .N G ] [.N G ’s China .N G ] began in [.N G 1949 .N G ] with [.N G a
partnership .N G ] between [.N G the communists .N G ] and [.N G a number .N G ] of [.N G smaller,
non-communists parties .N G ] .
Ramshaw and Marcus (1995) defined a tagset of three elements {I, O, B}, where I
means that the word is inside a noun group, O means that the word is outside, and B
(between) means that the word is at the beginning of a noun group that immediately
412 14 Part-of-Speech and Sequence Annotation
Named entities
Other entities
Fig. 14.11 Named entities: entities that we can identify by their names. Portrait: credits
Wikipedia. Map: Samuel Lewis, Atlas to the topographical dictionaries of England and Wales,
1848, credits: archive.org
14.10 Group Annotation Using Tags 413
follows another noun group. Using this tagging scheme, an equivalent annotation of
the sentences above is:
The/I government/I has/O other/I agencies/I and/I instruments/I for/O pursuing/O these/I
other/I objectives/I ./O
Even/O Mao/I Tse-tung/I ’s/B China/I began/O in/O 1949/I with/O a/I partnership/I
between/O the/I communists/I and/O a/I number/I of/O smaller/I ,/I non-communists/I
parties/I ./O
From its original definition, Tjong Kim Sang and Veenstra (1999) proposed slight
changes to the IOB format. The most widespread variant is called IOB2 or BIO,
where the first word in a group receives the B tag (begin), and the following words
the I tag. As for IOB, words outside the groups are annotated with the O tag. Using
BIO, the two examples would be annotated as:
The/B government/I has/O other/B agencies/I and/I instruments/I for/O pursuing/O these/B
other/I objectives/I ./O
Even/O Mao/B Tse-tung/I ’s/B China/I began/O in/O 1949/B with/O a/B partnership/I
between/O the/B communists/I and/O a/B number/I of/O smaller/B ,/I non-communists/I
parties/I ./O
The BIO annotation scheme gained acceptance from the conferences on Compu-
tational Natural Language Learning (CoNLL 2000, see Sect. 14.14) that adopted it
and went popular enough so that many people now use the term “IOB scheme” when
they actually mean BIO.
414 14 Part-of-Speech and Sequence Annotation
The BIOES tagset is a third scheme, where B tag stands for beginning of group, I,
for inside of group, E, for end of group, S, for group consisting of a single word, and
finally O for outside. Using BIOES, the two examples are annotated as:
The/B government/E has/O other/B agencies/I and/I instruments/E for/O pursuing/O these/B
other/I objectives/E ./O
Even/O Mao/B Tse-tung/E ’s/B China/E began/O in/O 1949/S with/O a/B partnership/E
between/O the/B communists/E and/O a/B number/E of/O smaller/B ,/I non-communists/I
parties/E ./O
We can extend the BIO scheme to annotate two or more group categories. This is
straightforward: We just need to use tags with a type suffix as for instance the tagset
{B-Type1, I-Type1, B-Type2, I-Type2, O} to markup two different group
types, Type1 and Type2.
CoNLL 2000 (Tjong Kim Sang & Buchholz 2000) again is an example such an
annotation extension. The organizers used 11 different group types: noun phrases
(NP), verb phrases (VP), prepositional phrases (PP), adverb phrases (ADVP), sub-
ordinated clause (SBAR), adjective phrases (ADJP), particles (PRT), conjunction
phrases (CONJ), interjections (INTJ), list markers (LST), and unlike coordinated
phrases (UPC).2
The noun phrases, verb phrases, and prepositional phrases making up more the
90% of all the groups in the CoNLL 2000 corpus.
As we saw, the CoNLL shared task in 2000 used the BIO (IOB2) tagset to annotate
syntactic groups. Figure 14.12 shows an example of it with the sentence:
He reckons the current account deficit will narrow to only £1.8 billion in September.
2 We feel that the word “phrase” has a misleading sense here. Most people in the field would
understand it differently. The CoNLL 2000 phrases correspond to what we call group or chunk in
this book: nonrecursive syntactic groups.
14.10 Group Annotation Using Tags 415
and where the prepositional groups are limited to the preposition to avoid recursive
groups. The dataset format is an older variant of CoNLL-U (Sect. 12.3) and has less
information. It consists of only three columns:
1. The first column contains the sentence words with one word per line and a blank
line after each sentence.
2. The second column contains the predicted parts of speech of the words. The
CoNLL 2000 organizers used Brill’s tagger (Brill 1995) trained on the Penn
Treebank (Marcus et al. 1993) to assign these parts of speech (Tjong Kim Sang &
Buchholz 2000). The POS tags are derived from the Penn Treebank and are now
considered legacy. In multilingual corpora, they are replaced by the universal
parts of speech, see Sect. 12.2;
3. The third column contains the groups with the manually assigned tags.
The topic of CoNLL 2002 and 2003 shared tasks was to annotate named entities.
These tasks reused the ideas laid down in CoNLL 2001 with the BIO (IOB2) and
IOB tag sets:
• The CoNLL 2002 annotation (Tjong Kim Sang 2002) consists two columns,
the first one for the words and the second one for the named entities with
four categories, persons (PER), organizations (ORG), locations (LOC), and
miscellaneous (MISC). CoNLL 2002 uses BIO (IOB2). Figure 14.13, left part,
shows the annotation of the sentence:
[.P ER Wolff .P ER ], a journalist currently in [.LOC Argentina .LOC ], played with [.P ER Del
Bosque .P ER ] in the final years of the seventies in [.ORG Real Madrid .ORG ].
• The CoNLL 2003 annotation (Tjong Kim Sang and De Meulder 2003) has
four columns: the words, parts of speech, syntactic groups, and named entities.
416 14 Part-of-Speech and Sequence Annotation
Fig. 14.13 Annotation examples of named entities using the CoNLL 2002 (left) and CoNLL 2003
(right) IOB schemes. CoNLL 2002 has two columns and uses BIO (IOB2). CoNLL 2003 has four
columns and uses IOB for the groups and named entities. After data provided by Tjong Kim Sang
(2002) and Tjong Kim Sang and De Meulder (2003)
Both the syntactic groups and named entities use the original IOB scheme.
Figure 14.13, right part, shows the annotation of the sentence:
[.ORG U.N. .ORG ] official [.P ER Ekeus .P ER ] heads for [.LOC Baghdad .LOC ].
Fig. 14.14 LSTM training curves for named entity recognition. Loss and accuracy over the epochs
model on the training set, monitor the accuracy on the validation set, and evaluate
the performance on the test set.
A simple architecture consists of an embedding layer, some dropout, a recurrent
layer, and a linear layer to make the decision. As parameters, we used frozen GloVe
100 embeddings, two bidirectional LSTM layers, a dropout of 20%, a hidden size
of 128, a batch size of 32, and nadam as optimizer with a learning rate of 0.001. We
measure the performance with the CoNLL 2003 script that computes the F1 score
for all the types of named entities and returns the macro-average.
Figure 14.14 shows the loss and accuracy over the epochs. We can see that the
accuracy and loss reach an optimal value at epochs 6 or 7. We run again the model
with 7 epochs and we obtain a CoNLL score of 85.30% on the test set.
A named entity tagger produces a sequence of tags, where a given tag obviously
depends on the previous ones and sets constrains on the subsequent ones: For
instance, an End tag, in the BIOES annotation, can only be preceded by an Inside or
a Begin, and, conversely, an Inside tag can only be followed by an Inside or an End.
In Sect. 10.8, we modeled the word transitions probabilities with Bayes rules. In
this section, we will estimate the tag probabilities using logistic regression instead.
Such a technique is called conditional random fields (CRF) (Lafferty et al. 2001).
Conditional random fields come with many variants. Here we will consider a simple
form: the linear chain.
Denoting .y the output, here a sequence of tags, and .x, a sequence of inputs,
consisting for instance of the words and the characters, we will try to maximize
.P (y|x), i.e. .ŷ = arg maxy P (y|x).
418 14 Part-of-Speech and Sequence Annotation
P (y, x)
P (y|x) =
. ,
P (y' , x)
y' ∈Y
where '
.y denotes a sequence, Y , the set of all the possible sequences, and
'
. y' ∈Y P (y , x) is a normalizing factor to have a sum of probabilities of one (Sutton
& McCallum 2011, p. 288).
To represent the features, conditional random fields use weighted indicator
functions, also called feature functions. An indicator function is similar to the
dummy variables or one-hot encoding we saw in Sect. 6.3. The function has either
the value one, when a certain condition is met, or zero otherwise. For a sequence,
it can be, for instance, a transition .(yj −1 , yj ) at a certain position index j such as
(Begin, Inside) at position 5 for a given sentence .x.
If we limit us to the possible transitions of y with the BIOES tagset (Sect. 14.10.3)
and if .Y is the complete tagset, we will have .|Y |2 indicator functions. Table 14.8
shows the set of functions for a NER tagset without categories. There are five
tags, and hence 25 indicator functions. For each pair of tags, Table 14.8 shows
the function that will be equal to 1 for this transition. For instance, we have
.f9 (yj −1 , yj ) = 1, when we have the transition I .→ E and 0 for all the other
transitions.
For one position j , the CRF probability is defined as the exponential of:
Y |2
|
. wi fi (yj −1 , yj ),
i=1
and, for the whole sequence of length N, as the product of all the probabilities:
N Y |2
|
. exp wi fi (yj −1 , yj ).
j =1 i=1
N Y |2
|
exp wi fi (yj −1 , yj )
j =1 i=1
P (y|x) =
. .
N Y |2
|
exp wi fi (yj' −1 , yj' )
y' ∈Y j =1 i=1
In our previous presentation, we omitted the .x input. Of course, it plays a role in the
prediction and that is why we need to include it in the probability. To take it into
account, we rewrite the CRF probability for position j as:
K
. exp wi fi (yj −1 , yj , xj ),
i=1
where .xj is the input at index j and K is the number of indicator functions. .xj
can simply be the word at index j or a window of words around .xj . An indicator
function can be the triple .(yi−1 , yj , xi ), for instance the tags Outside and Begin and
the word the, or more simply, the pairs .(yi−1 , yj ) and .(yj , xi ).
For the whole sequence, we have:
N
K
. exp wi fi (yj −1 , yj , xj ),
j =1 i=1
N
K
. arg max wi fi (yj −1 , yj , xj ),
y
j =1 i=1
where we set aside the denominator as it is the same for all the sequences and the
exponential.
However, we do not have access to .y in advance. We could use a brute-force
method: Generate all the possible sequences and keep the highest probability, and
hence the best label sequence. This would not scale of course and we need to resort
to a Viterbi search instead, as in Sect. 13.4.3.
x = (x1 , x2 , ..., xN ),
.
Lample et al. (2016) defined the score of the predicted tag sequence:
y = (y1 , y2 , ..., yN )
.
14.13 Tokenization 421
as
N
N
Score(x, y) =
. ayi ,yi+1 + Pi,yi ,
i=0 i=1
where .ai,j is the transition score from tag i to tag j and .Pi,yi is the logit of tag .yi
at index i in the sequence obtained from the LSTM. This latter term is called the
emission score.
The conditional probability is:
exp Score(x, y)
P (y|x) =
.
exp Score(x, y' )
y' ∈Y
or using logarithms:
. log P (y|x) = Score(x, y) − log exp Score(x, y' ).
y' ∈Y
To fit a CRF model, we need an annotated corpus from which we extract the .X
and Y matrices. As for logistic regression, we maximize the log-likelihood of the
observed sequences (or minimize the negative log-likelihood). We find the optimal
CRF and LSTM weights with a gradient descent.
PyTorch has no built-in module for CRF as of today. However, the PyTorch
documentation provides a good tutorial on how to implement one for named entity
recognition3 (Guthrie 2023). The example program uses a notation that is close to
that in Lample et al. (2016).
Adding a CRF layer to the LSTM, Lample et al. (2016) report a score of 90.20
on the English part of the CoNLL 2003 dataset and even 90.94 when including the
characters in the input.
14.13 Tokenization
3 https://pytorch.org/tutorials/beginner/nlp/advanced_tutorial.html.
422 14 Part-of-Speech and Sequence Annotation
Sect. 14.10.3, where B is the beginning of a token, E, its end, I possible characters
inside a token, and S, a single-character token. They also used the X tag to mark the
delimiters, mostly spaces as in this example abridged from their paper:
Shao et al. (2018) used the corpora from the Universal Dependencies to build
the training set. They created the input and output sequences from unprocessed
sentences and their tokenization in the CoNLL-U format (Sect. 12.3). The model
consists of:
1. An embedding layer, where the embedding represents the current character. For
Asian languages, the embedding is the concatenation of the current character, the
bigram of the current character and the preceding one, and the trigram with the
two surrounding characters are concatenated. We can relate this to what we saw
in Sect. 11.10, here with recurrent networks;
2. Bidirectional gated recurrent units (GRU), a simpler variant of the LSTM
architecture, and
3. A CRF layer.
Once trained, the application of the model to a sequence of characters enables
us to extract the tokens from the tags: These are simply all the matches of the BI*E
and S patterns in the output string. For the example above, this corresponds to On,
considère, qu’, environ, 50 000, Allemands, du, Wartheland, ont, péri, and the period
. that ends the sentence. Note that a white space tokenizer would have wrongly
tokenized the multiword qu’environ as well as the number 50 000.
The B̄ and Ē tags are variants of B and E. In the sentence, they mean that the word
du should be further transduced into de and le as in Sect. 14.3 and Table 14.3. As
these transductions are overall unambiguous in French, the authors used a dictionary
to process them.
Shao et al. (2018) reported the best tokenization performance when their paper
was published. Their model can also be extended to sentence segmentation.
He multiplied the output probabilities from each tagging operation and searched the
tag sequence so that the product:
n
T̂ = arg max
. P (ti |wi−2 , wi−1 , wi , wi+1 , wi+2 , ti−2 , ti−1 , ti−1 )
t1 ,t2 ,t3 ,...,tn
i=1
4 https://www.cnts.ua.ac.be/conll99/npb/.
5 https://www.cnts.ua.ac.be/conll2000/chunking/.
6 https://www.cnts.ua.ac.be/conll2001/clauses/.
7 https://www.cnts.ua.ac.be/conll2002/ner/.
8 https://www.cnts.ua.ac.be/conll2003/ner/.
Chapter 15
Self-Attention and Transformers
After feedforward and recurrent networks in Chaps. 8, Neural Networks and 14,
Part-of-Speech and Sequence Annotation, transformers (Vaswani et al. 2017) are a
third form of networks (a kind of feedforward in fact). This architecture, based on
the concept of attention, features two twin processing pipelines called the encoder
and the decoder.
of identical length with the same dimensionality. The input embeddings usually
represent words, subwords, or characters and, adding a classifier to the output, the
encoder produces a sequence of tags of equal length. This application corresponds
to a sequence annotation as in Chap. 14, Part-of-Speech and Sequence Annotation
and Fig. 14.3.
More precisely, an encoder consists of a stack of N identical layers, where each
layer is a pipeline of two main components shown in Fig. 15.1:
1. An attention mechanism that transforms the X input vectors into context-
dependent vectors; This attention is applied in parallel and the results, called
heads, are concatenated yielding a multihead attention, see the pink block in
Fig. 15.1;
2. The multihead attention output is then passed to a feed-forward network to
produce the final output, Y , see the green block in Fig. 15.1. The X and Y
matrices have identical size.
In this chapter, we will start with the description of attention and the encoder part
of transformers and we will outline their implementation with programs in Python.
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 425
P. M. Nugues, Python for Natural Language Processing, Cognitive Technologies,
https://doi.org/10.1007/978-3-031-57549-5_15
426 15 Self-Attention and Transformers
Feedforward
FFN(X ) = max(X W1 + b1 , 0)W2 + b2
N×
Multihead attention
MultiheadAttention(X)
Fig. 15.1 The encoder layer stacked N times. The X and Y matrices have identical size. After
Vaswani et al. (2017)
PyTorch has also built-in modules for both attention and encoder layers that are
ready to use. We will describe them as they make programming much easier.
We can use both the encoder and the decoder as standalone components. In the
next chapters, we will describe how to train the encoder transformer and then we
will see the complete encoder-decoder architecture.
15.2 Self-Attention
Once trained, the values of the word embeddings we have seen in Chap. 11, Dense
Vector Representations, either Word2Vec or GloVe, are immutable whatever the
words that surround them. In addition, these embeddings ignore the possible senses
of a word. For instance, note, while having at least two senses, as in musical note
and taking note, has only one single embedding vector. Such embeddings are said to
be static or noncontextual. While static embeddings improve the representation of
words over one-hot encodings, their insensitivity to context also entails an obvious
limitation.
15.2 Self-Attention 427
and compare the embeddings of ship with those of the same word in the sentence:
We process and ship your order [in the most cost-efficient way possible]
words_o = sentence_odyssey.lower().split()
words_a = sentence_amazon.lower().split()
and a PyTorch tensor, X, from the embedding dictionary for Odyssey’s quote:
def embedding_matrix(words, embeddings_dict):
X = [embeddings_dict[word] for word in words]
X = torch.stack(X)
return X
X = embedding_matrix(words_o, embeddings_dict)
1 https://www.amazon.com/gp/help/customer/display.html?nodeId=GV8H5D3MMAR7JBLF.
2 https://nlp.stanford.edu/projects/glove/.
428 15 Self-Attention and Transformers
We have seen in Sect. 11.6 that we can evaluate the semantic similarity between two
words, .wi and .wj , with the cosine of their embeddings, .emb(wi ) and .emb(wi ),
defined as:
emb(wi ) · emb(wj )
. cos(emb(wi ), emb(wj )) = .
||emb(wi )|| · ||emb(wj )||
With self-attention, we will compute a similarity between all the pairs of words
in a sentence or a sequence. Using this similarity, we will be able to create new
embeddings modeling their mutual influence.
Practically, we compute the cosines of the pairs with the program in Sect. 5.6.3.
Figure 15.2 shows these cosines for all the pairs of embeddings for Homer’s quote.
For a given word, the cosines of the surrounding words reflect how much meaning
they share with it. For instance, we have a score 0.78 for the pair ship and crew, two
arguably related words.
We can use the cosines as weights and define contextual embeddings as the
weighted sums of the initial embeddings. Using the values in Fig. 15.2, yellow row,
the contextual embedding of ship in this sentence is then:
Fig. 15.2 Cartesian product of the cosines of the pairs of embeddings with ship and crew in bold.
This table uses GloVe 50d vectors trained on Wikipedia 2014 and Gigaword 5
15.2 Self-Attention 429
We compute these weighted sums for all the words, and hence the new embed-
dings, with a simple matrix product. Denoting C the cosine matrix in Fig. 15.2 and
the X the matrix of initial embeddings arranged in rows, the contextual embeddings
are the rows of .CX. For ship, we have the 50-dimensional vector starting with:
ctx_emb(ship)
.
Comparing the sentence from the Odyssey with We process and ship your order
in Fig. 15.3, we see that, this time, the ship embedding receives 52% from the order
embedding reflecting thus a completely different context.
The idea of self-attention proposed by Vaswani et al. (2017) is very similar to that
of the cosines; it simply replaces them with a scaled dot-product. As in the previous
section, we denote X the embeddings of the sentence words and .λ the scaling factor.
The matrix of scaled dot-products for all the pairs is the product:
λXX .
.
This matrix is equivalent to the weights in Fig. 15.2 and has the same size. As scaling
factor, Vaswani et al. (2017) use . √d 1 , where .dmodel is the dimensionality of the
model
input embeddings. As we use GloVe 50d in our examples, we have .dmodel = 50. We
normalize these weights with the softmax function:
XX
softmax √
. ,
dmodel
430 15 Self-Attention and Transformers
Fig. 15.4 Self-attention weights. This table uses GloVe 50d vectors trained on Wikipedia 2014
and Gigaword 5
softmax(x1 , x2 , . . . , xj , . . . , xk )
.
e x1 e x2 e xj e xk
= k , k , . . . , k , . . . , k
xi xi xi xi
i=1 e i=1 e i=1 e i=1 e
where this word keeps 55% of its initial value and gets 13% from crew.
As with the cosines, we compute the contextual embeddings of all the words in
the sentence with the product of the weights by X :
XX
softmax √
. X.
dmodel
15.2 Self-Attention 431
The contextual embeddings result from the product of the weights by the input X.
To keep track of their different contributions, self-attention does not use the original
embeddings directly. It first multiplies X by three trainable matrices: .W Q , .W K , and
V
.W . The size of these matrices is:
qi = xi W Q ,
. ki = xi W ,
K
vi = xi W V ,
where .xi is of size .1 × dmodel and .qi , .ki , and .vi are of size .1 × dk . In the original
implementation, Vaswani et al. (2017) used the value .dmodel = 512. For one single
such attention module, they proposed .dk = dmodel .
Using the complete input sequence (all the tokens), .X, we have:
Q = XW Q ,
. K = XW ,
K
V = XW V ,
where .Q is called the query, .K, the key, and .V , the value.
Following Vaswani et al. (2017), the final equation of self-attention is:
QK
Attention(Q, K, V ) = softmax
. √ V,
dk
√
where .dk is the number of columns of the matrices and . dk , the scaling factor.
Figure 15.5 shows a summary of self-attention.
We will now write the code to compute a self-attention with PyTorch. We first import
the modules:
import math
import torch
import torch.nn.functional as F
432 15 Self-Attention and Transformers
Dot-product attention
QK
Attention(Q, K,V ) = softmax √ V
dk
As results, we obtain the attention weights shown in Fig. 15.4 and new contextual
embeddings as for ship:
ctx_emb(ship)
.
The size of the contextual embeddings matrix is the sequence length by the
dimensionality of the embeddings, .n × dmodel :
>>> attn_output.size()
torch.Size([11, 50])
here 11 words and 50-dimensional embeddings and .n × n for the weight matrix:
>>> attn_weights.size()
torch.Size([11, 11])
15.3 Multihead Attention 433
Once we have seen the effect of attention on a simple example, we can add the .W Q ,
K
.W , and .W
V matrices to the computation. We encapsulate them in a class:
class Attention(nn.Module):
def __init__(self, d_model, d_k):
super().__init__()
self.WQ = nn.Linear(d_model, d_k)
self.WK = nn.Linear(d_model, d_k)
self.WV = nn.Linear(d_model, d_k)
We compute d_model from the size of the embeddings, here the values of the
dictionary. For this, we read the first key and its value:
d_model = embeddings_dict[
next(iter(embeddings_dict))].size(dim=-1)
As with the attention() function, this yields contextual embeddings and a weight
matrix. Both matrices have the same size as in the previous section, but their values
are different as the linear modules have a random initialization.
Vaswani et al. (2017) found they could capture more information when they
duplicated the attention in Fig. 15.5 into parallel modules. This is the idea of
multihead attention that concatenates the results of the single attention modules and
passes them to a linear function shown in Fig. 15.6.
To implement multihead attention with h heads, we create h triples of matrices
(WiQ , WiK , WiV ) with i ranging from 1 to h, where the matrices have the size
dmodel × dk . For each head, we compute an attention value:
Q i Ki
.Hi = softmax √ Vi .
dk
434 15 Self-Attention and Transformers
Multihead attention
Linear
Concat(H1 , H2 , ..., Hh )W O
Concatenation
Concat(H1 , H2 , ..., Hh )
H1 H2 Hh
h heads
Q1 K1 V1 Q2 K2 V2 Qh Kh Vh
We concatenate these heads and we apply the linear function in the form of matrix
W O to reduce the output dimension to dmodel :
Concat(H1 , H2 , . . . Hh )W O .
.
X + MultiheadAttention(X).
.
This result is then passed to a layer normalization (Ba et al. 2016), where for each
token at index i in the sequence, its embedding vector, .xi = (xi,1 , xi,2 , . . . , xi,dmodel ),
is normalized individually. This function is defined as a vector standardization
followed by a Hadamard product with a gain and the addition of a bias. We have:
where j is a coordinate index, .x̄i,. and .σxi,. , the mean and standard deviation of
the embedding vector at index i. .g and .b are learnable parameter vectors of .dmodel
dimensionality, initialized to 1 and 0, respectively.
436 15 Self-Attention and Transformers
X
Multihead attention
Multihead Attention(X)
Fig. 15.7 Residual connection with layer normalization. After Vaswani et al. (2017)
X = LayerNorm(X + MultiheadAttention(X)).
.
We implement this residual network with the nn.LayerNorm() class from Py-
Torch. nn.LayerNorm() applies the normalization to the samples of a mini-batch.
Here we use it for just one sequence.
class LayerNormAttention(nn.Module):
def __init__(self, d_model, nhead):
super().__init__()
self.multihead_attn = MultiheadAttention(d_model,
nhead)
self.layer_norm = nn.LayerNorm(d_model)
The last part of our encoder layer consists of a feedforward network with two linear
modules and a ReLU function, .max(x, 0), in-between:
The size of .W1 is .dmodel × dff and that is .W2 is .dff × dmodel .
The encoder in Fig. 15.1 consists of a stack of N identical layers. Each layer is
composed of a multihead attention and a feedforward network. The implementation
of a layer is straightforward from its description. We just add the feedforward
network in the previous class:
class TransformerEncoderLayer(nn.Module):
def __init__(self, d_model, nhead, d_ff=2048):
super().__init__()
self.multihead_attn = MultiheadAttention(d_model,
nhead)
self.layer_norm_1 = nn.LayerNorm(d_model)
self.W1 = nn.Linear(d_model, d_ff)
self.relu = nn.ReLU()
self.W2 = nn.Linear(d_ff, d_model)
self.layer_norm_2 = nn.LayerNorm(d_model)
Y = self.W2(self.relu(self.W1(Xprime)))
Y = self.layer_norm_2(Xprime + Y)
return Y
Using this class, we create an encoder layer for our GloVe embeddings with 5
heads:
>>> encoder_layer = TransformerEncoderLayer(d_model, nhead)
The complete encoder in Figs. 15.1 and 15.8 consists of a stack of N encoder
layers. We have now programmed all the modules we need to build it. Let us call it
TransformerEncoder.
The TransformerEncoderLayer class has one layer. We first create a layer object,
encoder_layer, and we pass it as input to the TransformerEncoder class. In this class,
we create a list of N layers by cloning the encoder layer with a deep copy and
we register the parameters with a ModuleList as with the multihead attention in
Sect. 15.3. In the original paper, Vaswani et al. (2017) proposed .N = 6.
class TransformerEncoder(nn.Module):
def __init__(self,
encoder_layer,
num_layers):
super().__init__()
self.encoder_stack = nn.ModuleList(
[copy.deepcopy(encoder_layer)
for _ in range(num_layers)])
Encoder layer
Encoder layer
Encoder layer
Positional encoding
Input embedding
X
15.9 Positional Encodings 439
Before we can apply the encoder, we must convert the input words to trainable
embedding vectors. We first create index-to-word and word-to-index dictionaries
from the vocabulary. We assign the unknown token to index 1 and the padding
symbol to 0 as in Sect. 14.8:
idx2word = dict(enumerate(vocabulary, start=2))
word2idx = {word: idx for idx, word in idx2word.items()}
We then create a lookup table, input_embeddings, as in Sect. 14.8 that we fill with
GloVe50 as initial values. We initialize the input embeddings tensor with random
values:
input_embeddings = torch.rand(
(len(word2idx) + 2, d_model))/10 - 0.05
and we replace the random row vectors with those of GloVe when they exist.
for word in embeddings_dict:
input_embeddings[word2idx[word]] = embeddings_dict[word]
In Sect. 9.5.3, we presented bag-of-words techniques, where the word order did not
matter in the vectorization of a document or a sentence. So far, our encoder has the
same property and weakness. It is obvious that the order of the words in a sentence
conveys a part of its meaning and hence that bags of words miss a part of this
semantics.
Vaswani et al. (2017) proposed two techniques to add information on the word
positions. Both consist of vectors of dimension .dmodel that are summed with the
input embeddings:
440 15 Self-Attention and Transformers
1. The first one consists of trainable position embeddings, i.e. index i is associated
with a vector of dimension .dmodel that is summed with the embedding of the input
word at index i;
2. The other consists of fixed vectors encoding the word positions. For a word at
index i, the vector coordinates are defined by two functions:
P E(i, 2j ) = sin i
2j ,
10000 dmodel
.
P E(i, 2j + 1) = cos i
2j .
10000 dmodel
Let us program this second function for an input sequence of length max_len. The
output data structure will be a matrix of max_len rows and .dmodel columns, where
a row at index i contains the embedding vector at this position. We compute the
angle of the sine and cosine by breaking it down into a dividend i, representing the
2j
word index, and a divisor, .10000 dmodel , representing a coordinate in the embeddings.
We then compute the sine and cosine of these angles alternatively using the fancy
indexing properties of PyTorch:
def pos_encoding(max_len, d_model):
dividend = torch.arange(max_len).unsqueeze(0).T
divisor = torch.pow(10000.0,
torch.arange(0, d_model, 2)/d_model)
angles = dividend / divisor
pe = torch.zeros((max_len, d_model))
pe[:, 0::2] = torch.sin(angles)
pe[:, 1::2] = torch.cos(angles)
return pe
We create the positional encoding vectors with a maximum sentence length with:
max_len = 100
pos_embeddings = pos_encoding(max_len, d_model)
We can now create an Embedding class to convert the indices of the input words
to embeddings and add the positional encoding. In the __init__() method, we
create the embedding tables. We set the positional encoding as untrainable. In the
forward() method, we just add the input and positional embeddings. Following
√
Vaswani et al. (2017), we multiply the input embeddings by . dmodel and we apply
a dropout to the sum of embeddings.
15.10 Converting the Input Words 441
class Embedding(nn.Module):
def __init__(self,
vocab_size,
d_model,
dropout=0.1,
max_len=500):
super().__init__()
self.d_model = d_model
self.input_embedding = nn.Embedding(vocab_size, d_model)
pe = self.pos_encoding(max_len, d_model)
self.pos_embedding = nn.Embedding.from_pretrained(
pe, freeze=True)
self.dropout = nn.Dropout(dropout)
The input embeddings of this object are randomly initialized. We assign them
pretrained values with:
>>> embedding.input_embedding.weight = nn.Parameter(input_embeddings)
We then generate the input indices from the list of words from Odyssey’s quote:
>>> x = torch.LongTensor(
list(map(lambda x: word2idx.get(x, 1), words_o)))
As the Embedding class contains a dropout, it will drop samples in the training
mode. In the evaluation mode, it will keep all the samples. This is what we want
here and we apply it to the indices:
>>> embedding.eval()
>>> X = embedding(x)
In this chapter so far, we have described all the elements we need to build
the encoder part of a transformer. This enabled us to understand its structure
more deeply. To simplify their implementation, we have created functions that
apply to one sample, i.e. one sentence. In a real application, among the possible
improvements to our code, we would need to process minibatches stored in 3rd
order tensors (sometimes called 3D matrices), where the first axis corresponds to
the samples, the second to the word indices, and the third one to the embeddings.
To see how we would proceed with a minibatch, let us integrate our two sentences
from the Odyssey and Amazon.com in the X matrix. We first create a list of indices
with:
sentences = [words_o] + [words_a]
sent_idx = []
for sent in sentences:
sent_idx += [torch.LongTensor(
list(map(lambda x: word2idx.get(x, 1),
sent)))]
and we check that the embedding tensor consists of two samples of 11 aligned word
indices, where each word is represented by a 50-dimensional embedding:
>>> X_batch.size()
torch.Size([2, 11, 50])
However, we cannot apply our encoder stack, encoder(), to this new .Xbatch tensor
as we computed the self-attention with a matrix product. The operation:
X_batch @ X_batch.T
throws an error. Instead, we must use a batched matrix multiplication, bmm(a, b),
that computes the products of the pairs of matrices in a and b. We also transpose the
index and embedding axes of the second term:
torch.bmm(X_batch,
torch.transpose(X_batch, 1, 2))
15.12 PyTorch Transformer Modules 443
The respective matrix products are stored in a batch that we can normalize with
a softmax function and further multiply with .Xbatch . We would then pass on the
results to the rest of the stack. Nonetheless, the complete adaptation of the encoder
to batches is left as an exercise.
Fortunately for us, PyTorch has a set of optimized modules to implement trans-
formers that can handle minibatches. We describe them now. Overall, they use the
same names as those in the previous sections and, for an elementary input, the same
parameters. We just add nn. to refer to a PyTorch module. Of course, these modules
have many more options.
In a minibatch, the index matrices are aligned to the longest sentence and thus
include padding tokens for the others. We have to remove these tokens from the
attention mechanism. We carry this out with a masking matrix, where we set the
elements to true when we have a padding symbol. For our two sentences, we have:
>>> padding_mask = (X_idx == 0)
>>> padding_mask
tensor([[False, False, False, False, False, False, False,
False, False, False, False],
[False, False, False, False, False, False, True,
True, True, True, True]])
here with five heads and for the GloVe 50-dimensional embeddings. .dk has a value
that follows the rule in Sect. 15.3: d_model // nhead. We use batches with the
conventional order, where the rows correspond to the input words, and we need
to set batch_first to true.
We apply the attention to the input vectors with:
attn_output, attn_weights = multihead_attn(Q, K, V)
attn_output contains the output embeddings and attn_weights are averaged across
the heads by default. The .W Q , .W K , .W V , and .W O matrices are randomly initialized.
We can inspect them with the state_dict() method:
>>> multihead_attn.state_dict()
OrderedDict([(’in_proj_weight’,
tensor([[-0.0294, -0.1154, ...
(’in_proj_bias’,
tensor([0., 0., 0., ...
(’out_proj.weight’,
tensor([[ 0.1335, 0.1068, -0.0705, ....
(’out_proj.bias’,
tensor([0., 0., 0., ...)
PyTorch has an optimized class to create a fully functional encoder layer that incor-
porates the multihead attention from the previous section,
nn.TransformerEncoderLayer:
encoder_layer = nn.TransformerEncoderLayer(d_model,
nhead,
batch_first=True)
With these classes, creating the core of an encoder only takes two lines of Python
and is thus extremely easy.
We obtain the output vectors for one layer with:
enc_layer_output = encoder_layer(
X_batch, src_key_padding_mask=padding_mask)
The input and output sequences of an encoder transformer have equal length. A
first application is then to use it as an annotator for POS tagging or named entity
recognition as in Chap. 14, Part-of-Speech and Sequence Annotation. Given an
embedding layer and an encoder stack, the output consists of vectors of .dmodel
dimensionality. We just need to add a linear layer to carry out a classification into
tags.
15.13 Application: Sequence Annotation 445
The rest of the program: The dataset preparation and training loop are identical
to that in Sect. 14.8. Training the encoder and evaluating it on CoNLL 2003 data,
we will not reach the scores we obtained with LSTMs however. We will see how we
can improve them with pretraining in the next chapter.
446 15 Self-Attention and Transformers
Class
Linear
h0
Encoder
[CLS] x1 x2 xn−1 xn
We can also apply our encoder to text classification. As we have here just one output,
the class of the text, we will prefix the input sequence with a specific start token
[CLS], see Fig. 15.9. We will then use the encoded vector of it, .h0 on the figure,
as input to a linear layer. The mutihead attention that collects influences from all
the tokens will enable it to integrate the semantics of the sentence. Using a softmax
function, we will then output the class. This technique is similar to that of the BERT
system (Devlin et al. 2019) that we will describe in the next chapter.
To implement this idea, we just need to modify the last line of the previous class:
return self.fc(X)
and the X matrix to extract the first output vector in the sentences of the batch and
pass them to the linear layer:
return self.fc(X[:, 0, :])
So far, outside of the embeddings, where we initially used GloVe, the encoder
stack is filled with random parameters that we trained on CoNLL 2003, a relatively
small annotated dataset. Devlin et al. (2019) proposed a method to pretrain models
on very large raw corpora that found applications in the whole spectrum of
language processing applications: The bidirectional encoder representations from
transformers (BERT).
To fit the parameters, BERT proceeds in two steps:
1. The first step essentially boils down to training a language model on a large
dataset. The language model predicts masked words inside a sentence and if two
sentences follow each other. This step uses a raw corpus, needing no annotation.
It enables the model to learn associations between the words. This step is called
the pretraining;
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 449
P. M. Nugues, Python for Natural Language Processing, Cognitive Technologies,
https://doi.org/10.1007/978-3-031-57549-5_16
450 16 Pretraining an Encoder: The BERT Language Model
2. The second step adapts the model to a specific application, for instance text
classification or named entity recognition. It requires an annotated dataset,
usually much smaller, where it will adjust further and more finely the parameters.
This step is called the fine-tuning.
The two main pillars of the first step are the corpus gathering and preparation
and the creation of a language model. We describe them now and, to exemplify the
procedures, we will use the two first abridged sentences of the Odyssey:1
Tell me, O Muse, of that ingenious hero
Many cities did he visit
In Chap. 10, Word Sequences, we designed a language model that, given a sequence
of words, predicts the word to follow. The BERT language model is a bit different. It
draws its inspiration from cloze tests (Taylor 1953) that we already saw in Sect. 11.9.
The pretraining step consists of two simultaneous classification tasks, where each
input sample is a pair of sentences (in fact word sequences):
1. The masked language model (MLM), for which we replace some tokens of the
input sample with a specific mask token [MASK]. We train the model to predict
the value of the masked words as in this example restricted to one sentence:
Input: Tell me , O [MASK] , of that ingenious [MASK]
Predictions: Muse hero
2. The next sentence prediction (NSP), where we create pairs of sentences that may
or may not follow each other. Each pair corresponds to an input sample and we
train the model to predict if the second sentence is the successor of the first one or
not. Below are two examples with the two classes, where possibly masked words
are written in clear:
Sentence pair:
(1) Tell me , O Muse , of that ingenious hero
(2) Many cities did he visit
Is next: True
Sentence pair:
(1) Tell me , O Muse , of that ingenious hero
(2) Exiled from home am I ;
Is next: False
1 Thecomplete quote is: Tell me, O Muse, of that ingenious hero who travelled far and wide after
he had sacked the famous town of Troy. Many cities did he visit, and many were the nations with
whose manners and customs he was acquainted;
16.2 Creating the Dataset 451
Devlin et al. (2019) created pairs of sentences with masked words to match
precisely these two tasks. As pretraining corpus, BERT used the English Wikipedia,
excluding the tables and figures, and BooksCorpus, totaling about 3.3 billion words.
They formatted the corpus with one sentence per line and an empty line between
the documents. They tokenized the text with WordPiece, see Chap. 13, Subword
Segmentation, limiting the vocabulary to 30,000 tokens.
We will now write a small program to materialize these ideas. The complete
implementation is available from Google Research GitHub repository.2
Assuming we have read the corpus, split it into sentences, and formatted the dataset
in a list of sentences as:
sentences = [
’Tell me, O Muse, of that ingenious hero’,
’Many cities did he visit’]
For each pair of sentences denoted A and B, we build two input sequences: the list
of tokens and the list of segment identifiers.
1. We concatenate the list of tokens of the two segments A and B. We insert two
special tokens to mark their boundaries: [CLS] at the start of the first segment and
[SEP] at the end of both segments. Considering our two sentences and setting
aside the subword tokenization, this yields the markup:
[CLS] Tell me , O Muse , of that ingenious hero [SEP] Many cities did he visit [SEP]
2. In the segment list, we replace each word in segment A with 0 and in segment B
with 1.
Below is the code to implement this, where we store the pairs in a list of
dictionaries. A dictionary contains three keys so far: tokens, the list of tokens,
segment_ids the list of segment markers, and is_next, a Boolean telling if the
two segments follow each other:
def create_sample(tokens_a, tokens_b, next=True):
tokens = [’[CLS]’] + tokens_a + [’[SEP]’]
segment_ids = len(tokens) * [0]
tokens.extend(tokens_b + [’[SEP]’])
segment_ids += (len(tokens_b) + 1) * [1]
sample = {’tokens’: tokens,
’segment_ids’: segment_ids,
’is_next’: True}
if not next:
sample[’is_next’] = False
return sample
tokenized_sents = []
for sent in sentences:
tokenized_sents += [re.findall(pat, sent)]
dataset = []
for i in range(len(tokenized_sents) - 1):
sample = create_sample(tokenized_sents[i],
tokenized_sents[i + 1])
dataset += [sample]
BERT’s pretraining input consists of pairs of segments. In half of these pairs, the
sentences follow each other, as in our example. In the other half, the second sentence
is randomly sampled from another document. In BERT’s original implementation,
the length of the input is of exactly 128 or 512 subword tokens. To create the pairs,
the complete algorithm is:
1. Append consecutive sentences of the corpus until we reach this length or surpass
it;
2. Split randomly the result into two segments, A and B, so that each segment
consists of a nonempty sequence of sentences;
3. In half of the pairs, the second segment should not be the successor of the first
one. For these samples, replace the second segment with sentences from another
document;
4. Trim excess tokens at the front and end of the longest segment to match the length
limit.
16.2 Creating the Dataset 453
This algorithm is more complex than our examples in this chapter. We leave its
implementation to the reader.
Once we have a dataset with consecutive and nonconsecutive segments, we can
train a model. The next sentence prediction task is a binary classification with
the is_next Boolean set to True or False. It is designed to learn relationships
between sentences. For this, we add a linear layer on top of the encoder and we use
the encoded value of the ’[CLS]’ token to predict is_next as in Sect. 15.14 and
Fig. 15.9.
This MLM task is a cloze test, where Devlin et al. (2019) randomly selected 15% of
the words for masking with at least one masked word per sample. It also applies to
the pairs of segments and for our two sentences, this would correspond to 3 words
as we have 15 tokens, for instance:
[CLS] Tell me , O [MASK] , of that ingenious [MASK] [SEP] Many [MASK] did he visit
[SEP]
The program below tokenizes the sentences and replaces 15% of the tokens with
’[MASK]’. We add two keys to our dictionaries: masked_pos, the positions of the
masked words, and masked_tokens, their value.
def mask_tokens(sample,
mask_prob=0.15,
max_predictions=20):
cand_idx = [
i for i in range(len(sample[’tokens’]))
if sample[’tokens’][i] not in [’[CLS]’, ’[SEP]’]]
mask_cnt = max(1, int(round(len(sample[’tokens’])
* mask_prob)))
mask_cnt = min(mask_cnt, max_predictions)
random.shuffle(cand_idx)
sample[’masked_pos’] = sorted(cand_idx[:mask_cnt])
sample[’masked_tokens’] = [
token for i, token in
enumerate(sample[’tokens’])
if i in sample[’masked_pos’]]
sample[’tokens’] = [
’[MASK]’ if i in sample[’masked_pos’] else token
for i, token in enumerate(sample[’tokens’])]
return sample
’segment_ids’: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
1, 1, 1, 1, 1, 1],
’is_next’: True,
’masked_pos’: [5, 10, 13],
’masked_tokens’: [’Muse’, ’hero’, ’cities’]}]
With this procedure, the model could rely too heavily on the [MASK] token
for predictions in other applications. To remove this dependency, in the complete
implementation, out of the 15% selected tokens:
• 80% of the time, the program keeps the replacement as it is with [MASK];
• 10% of the time, it replaces the mask with random words;
• 10% of the time, it keeps the original word.
Once we have created the pairs of sentences, we map each token to the sum of three
trainable embedding vectors: the token embedding, the segment embedding, and the
positional embedding of the token in the sentence:
• Each token in the vocabulary, 30,000 WordPiece subwords in BERT base, is
associated with a specific embedding vector;
• For the segments, we have two embedding vectors representing either the first or
second sentence;
• Each position index in the sequence is associated with a learned embedding
vector. The combined length of both sentences is less than 512, hence there
are 512 positional embeddings. Note that this is a difference with the constant
positional encoding of the encoder in Sect. 15.9.
In the BERT base version, each embedding is a 768-dimensional vector.
Figure 16.1 shows the input with the two sentences simplified from the previous
section:
Tell me of that hero. Many cities did he visit.
where we stripped the punctuation. For each token, BERT looks up the three
corresponding vectors, token, segment, and position, and sums them to get the input
representation. The two first vectors in Fig. 16.1 are: .ECLS + EA + E0 and .ETell +
EA +E1 . The final input representation for the 13 words is a matrix of size .13×768,
one vector (row) per word.
To get the word embeddings, we use the PyTorch Embedding class. We need first
to convert the tokens into indices as well as five special tokens: [’CLS’], [’UNK’],
[’PAD’], [’SEP’], and [’MASK’]. We extract the vocabulary, vocabulary, the
vocabulary size, vocab_size, the word-to-index and index-to-word dictionaries.
This is something we have done multiple time in this book and we skip this piece
of code. We set the padding index to 0. To store them, we add two last keys to our
dictionaries: token_ids and masked_ids.
16.3 The Input Embeddings 455
Input embeddings IE [CLS] IE Tell IE me IE of IE that IE hero IE [SEP] IE Many IE cities IE did IE he IE visit IE [SEP]
Σ Σ Σ Σ Σ Σ Σ Σ Σ Σ Σ Σ Σ
Segment embeddings EA EA EA EA EA EA EA EB EB EB EB EB EB
Token inputs [CLS] Tell me of that hero [SEP] Many cities did he visit [SEP]
Fig. 16.1 The BERT input consists of two segments bounded by the [CLS] and [SEP] symbols.
The input embeddings are the sum of three trainable embeddings: Token, segment, and position.
Each embedding is a .dmodel -dimensional vector, where .dmodel = 768 with BERT
Computing these indices from Homer’s work and applying the word-to-index
conversion, we have for our small dataset:
>>> dataset
[{’tokens’: [’[CLS]’, ’Tell’, ’me’, ’,’, ’O’, ’[MASK]’, ’,’,
’of’, ’that’, ’ingenious’, ’[MASK]’, ’[SEP]’, ’Many’,
’[MASK]’, ’did’, ’he’, ’visit’, ’[SEP]’],
’token_ids’: [1897, 1688, 6051, 7, 1209, 1898, 7, 6428,
8726, 5463, 1898, 1899, 1064, 1898, 3680, 5075, 9308, 1899],
...}]
Let us now write a class to represent these embeddings. The forward method
sums the vectors and, following Devlin et al. (2019), we add a dropout of 10% and
we apply a layer normalization.
class BERTEmbedding(nn.Module):
def __init__(self, vocab_size,
d_model=d_model,
maxlen=512,
n_segments=2,
dropout=0.1):
super().__init__()
self.d_model = d_model
self.tok_embedding = nn.Embedding(vocab_size, d_model,
padding_idx=0)
self.pos_embedding = nn.Embedding(maxlen, d_model)
self.seg_embedding = nn.Embedding(n_segments, d_model)
self.dropout = nn.Dropout(dropout)
self.layer_norm = nn.LayerNorm(d_model)
self.seg_embedding(segment_ids)
return self.layer_norm(self.dropout(embedding))
y_nsp = []
if len(dataset) < start + BATCH_SIZE:
end = len(dataset)
else:
end = start + BATCH_SIZE
for i in range(start, end):
tok_ids += [dataset[i][’token_ids’]]
masked_pos += [dataset[i][’masked_pos’]]
masked_ids += [dataset[i][’masked_ids’]]
seg_ids += [dataset[i][’segment_ids’]]
if dataset[i][’is_next’]:
y_nsp += [1]
else:
y_nsp += [0]
y_nsp = torch.LongTensor(y_nsp)
tok_ids = torch.nn.utils.rnn.pad_sequence(tok_ids,
16.5 BERT Architecture 457
batch_first=True)
seg_ids = torch.nn.utils.rnn.pad_sequence(seg_ids,
batch_first=True)
return tok_ids, seg_ids, y_nsp, masked_pos, masked_ids
In its base version, BERT’s architecture consists of 12 layers, each layer with 8
self-attention heads. The embedding dimension is 768 and must be divisible by the
number of heads. Using PyTorch classes from Sect. 15.12, we can easily implement
it in the __init__() method. The forward() method has two arguments: the
sequence of tokens and the sequence of segment identifiers that are needed to
compute the embeddings.
As the sequences are of unequal lengths, we have aligned them with padding
symbols in the batch. We create a mask of Boolean values, input_mask,
to indicate where the padding is. We tell the encoder to ignore it with the
src_key_padding_mask argument. This is necessary to remove the corresponding
symbols from the attention mechanism.
class BERT(nn.Module):
def __init__(self, vocab_size,
d_model=768,
maxlen=128,
n_segments=2,
nhead=8,
num_layers=12):
super().__init__()
self.embeddings = BERTEmbedding(vocab_size,
d_model,
maxlen,
n_segments)
self.encoder_layer = nn.TransformerEncoderLayer(
d_model,
batch_first=True,
nhead=nhead,
dim_feedforward=4 * d_model)
self.encoder = nn.TransformerEncoder(
self.encoder_layer,
num_layers=num_layers)
Once we have built the BERT structure, we can train it with the masked language
model and the next sentence prediction. We add two linear modules on top of the
encoder, see Fig. 16.2:
1. One that will use the [CLS] output and predict if the second sentence is a
successor of the first one or not;
2. A second that will read the output of the masked tokens and predict the
corresponding word. In the original implementation, the masked language model
Encoder
Input embeddings IE [CLS] IE Tell IE me IE of IE that IE [MASK] IE [SEP] IE Many IE [MASK] IE did IE he IE visit IE [SEP]
S S S S S S S S S S S S S
Segment embeddings EA EA EA EA EA EA EA EB EB EB EB EB EB
Token inputs [CLS] Tell me of that [MASK] [SEP] Many [MASK] did he visit [SEP]
Fig. 16.2 BERT pretraining: The [CLS] position is used to predict if the second sentence follows
the first one and the [MASK] input to predict the actual word, here with a perfect results
16.6 BERT Pretraining 459
uses the weights of the token embedding matrix as linear layer and adds trainable
biases.3
The encoder outputs vectors for all the tokens in the sequence. We only consider
the masked words only to compute the loss. We extract them from the encoder
output with the extract_rows() method before we predict their values.
class BERTLM(nn.Module):
def __init__(self, vocab_size,
d_model=768,
maxlen=128,
n_segments=2,
nhead=8,
num_layers=12):
super().__init__()
self.bert = BERT(vocab_size, d_model=d_model,
maxlen=maxlen,
n_segments=n_segments,
nhead=nhead,
num_layers=num_layers)
self.masked_lm = nn.Linear(d_model, vocab_size)
self.masked_lm.weight = self.bert.embeddings.
tok_embedding.weight
self.next_sentence = nn.Linear(d_model, 2)
For both predictions, the loss is the cross entropy and, for a batch of input
sequences, we sum the mean of the next sentence prediction loss and the mean
of the masked language model loss. We compute this cross entropy loss with
nn.CrossEntropyLoss().
We create the loss and the optimizer:
loss_fn = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(bert_lm.parameters(), lr=0.01)
Devlin et al. (2019) applied the pretrained BERT encoder to a set of applications:
text classification, classification of pairs of sentences, sequence tagging, and
question answering. We first describe the classification of sentences and pairs of
sentences.
16.8.1 Classification
As datasets for classification, Devlin et al. (2019) used the General Language
Understanding Evaluation (GLUE) benchmark (Wang et al. 2018). GLUE has two
sentence classification tasks and six regarding pairs of sentences:
16.8 BERT Applications 461
• The first task is a binary classification. It uses two corpora, SST-2 (Socher
et al. 2013) and CoLA (Warstadt et al. 2018), where the sentences in SST-2 are
annotated with a positive or negative sentiment and those in CoLA, whether they
are linguistically acceptable or not;
• The second task uses six corpora with pairs of sentences as input: MNLI
(Williams et al. 2018), QQP (Iyer et al. 2017), QNLI (Wang et al. 2018), STS-B
(Cer et al. 2017), MRPC (Dolan & Brockett 2005), and RTE (Bentivogli et al.
2009). MNLI labels each pair with either entailment, contradiction, or neutral;
QQP and MRPC whether two questions or sentences are equivalent; STS-B
whether two sentences are semantically equivalent with a score ranging from
1 to 5; and RTE whether the second sentence is an entailment of the first one.
Finally, in the QNLI corpus, each pair consists of a question and a sentence. The
pair is positive if the sentence contains the answer to the question and negative
otherwise.
In the two first tasks, we classify a sentence, .(x1 , x2 , ..., xn ), and in the six others,
a pair of sentences, .(x1 , x2 , . . . , xn ) and .(x1 , x2 , . . . , xp ). Devlin et al. (2019)
represented this input similarly to that in Sect. 16.2. They added a [CLS] token
at the beginning of the sequence:
.([CLS], x1 , x2 , . . . , xn )
and, for the pairs, they also inserted a [SEP] token in between:
The classifier consists of the pretrained encoder model that they supplemented with
an additional linear layer, W , on top of the encoder output of the [CLS] token, .h0
on Fig. 16.3. The vector of predicted probabilities is then given by:
ŷ = softmax(h0 W ).
.
For each corpus, they fined-tuned the classifier either by fitting the last layer weights,
W , or the parameters of the complete encoder, including W . This fine-tuning task
requires a much smaller corpus than the pretraining task and is much faster.
With their pretraining and fine-tuning procedure, Devlin et al. (2019) could
outperformed other previous architectures and methods on the GLUE tasks.
Class Class
W W
h0 h0
BERT BERT
Embeddings Embeddings
[CLS] x1 x2 ... xi ... xn−1 xn [CLS] x1 x2 ... xn [SEP] x1 ... xp−1 xp
Fig. 16.3 BERT applications: sentence classification, left part, and pair classification, right part
W W ... W ... W W ... W W S/E S/E ... S/E ... S/E ... S/E
h0 h1 h2 ... hi ... h j h j+1 ... hn−1 hn h0 h1 h2 ... h[SEP] h1 ... hi ... hj ... hp
BERT BERT
Embeddings Embeddings
[CLS] x1 x2 ... xi ... xj x j+1 ... xn−1 xn [CLS] x1 x2 ... [SEP] x1 ... xi ... xj ... xp
Fig. 16.4 BERT applications: Sequence annotation, left part, and question answering, right part,
where .(x1 , x2 , ..., xn ) corresponds to the question tokens and .(x1 , x2 , . . . , xp ) to the passage ones.
The span .(hi , ..., hj ) maximizes the .S · hi + E · hj score and corresponds to the answer prediction
.(xi , ..., xj )
([CLS], x1 , x2 , ..., xn )
.
(h0 , h1 , h2 , ..., hn ).
.
ŷi = softmax(hi W ).
.
Finally, Devlin et al. (2019) applied BERT to a question answering task, more
precisely to the Stanford question answering datasets, SQuAD 1.1 and 2.0 (Ra-
jpurkar et al. 2016, 2018). SQuAD consists of pairs of questions and passages from
Wikipedia as input and Rajpurkar et al. (2016) formulate question answering as the
retrieval of an answer in the form of a text span in the input passage.
As an example, starting from the introductory paragraph of the article on
precipitation in Wikipedia:
In meteorology, precipitation is any product of the condensation of atmospheric water vapor
that falls under gravity. The main forms of precipitation include drizzle, rain, sleet, snow,
graupel and hail... Precipitation forms as smaller droplets coalesce via collision with other
rain drops or ice crystals within a cloud. Short, intense periods of rain in scattered locations
are called “showers.”
Rajpurkar et al. (2016) posed three questions for which the answers are under-
lined in the text:
1. What causes precipitation to fall? gravity;
2. What is another main form of precipitation besides drizzle, rain, snow, sleet and
hail? graupel;
3. Where do water droplets collide with ice crystals to form precipitation? within a
cloud.
In SQuAD 1.1, the 100,000 questions always have an answer in their associated
passage. This is not the case in SQuAD 2.0, where the authors added 50,000
unanswerable questions to it.
To fine-tune BERT, Devlin et al. (2019) formatted the question-passage pairs as
in the next sequence prediction:
where .x1 , x2 , ..., xn are the question tokens, .x1 , x2 , ..., xp , the passage tokens, and
[SEP], the separation symbol.
The SQuAD question answering resembles BIO sequence annotation in
Sect. 14.10 as, given a question, we can mark the answer in the passage with
begin and inside tags and the rest with outside tags. Devlin et al. (2019) chose to
estimate two tags only, start and end, with two trainable weight matrices denoted
S and E (Fig. 16.4, right part). In fact, as we have a binary classification, S and E
are simply two vectors. The classification is then comparable to logistic regression
with the dot products .S · hi and .E · hi to determine if index i is the start or end of
the answer.
Logistic regression could yield multiple positive start and end tags in a passage.
Instead Devlin et al. (2019) computed the dot products of the start position over
the output sequence. They normalized them with the softmax function yielding a
464 16 Pretraining an Encoder: The BERT Language Model
probability distribution:
S · hi
ŷstart
. = .
j S · hj
i
E · hi
. ŷend = .
j E · hj
i
They finally defined the score of a candidate span as .s(i, j ) = S · hi + E · hj
with .i ≤ j and the answer prediction as the index pair maximizing:
The predicted answer is then the .(xianswer , ..., xj answer ) passage span.
Devlin et al. (2019) fine-tuned the model in Fig. 16.4, right part, with a loss set
to the sum of the cross entropies of the correct S and E positions.
In SQuAD 2.0, some questions are unanswerable. Devlin et al. (2019) extended
their model with a prediction of the encoded output of the [CLS] token, .h0 :
snull = S · h0 + E · h0 .
.
where .τ is adjusted on the validation set so that it maximizes the final score.
Devlin et al. (2019) reported SQuAD F1 scores that outperformed all the other
systems at the time they published their paper.
The SQuAD benchmark alone does not reflect the entire complexity of a real
question answering system. In this section, we outline a few practical points.
Before a system can answer any question, we must first collect a large set of
documents that will form its knowledge source and store it in a repository. As with
SQuAD, wikipedia is often a starting point for such a collection. Practically, we
collect wikipedia articles with the scraping techniques described in Sect. 4.5. The
size of the collection can range from a subset of articles to all of a wikipedia
16.9 Using a Pretrained Model 465
language version, the whole wikipedia in all the languages, or include even more
resources: either available from the web or nonpublic documents.
The wikipedia articles are sometimes quite long and thus impossible to process
for a transformer. We usually split them into shorter paragraphs or passages. We then
index them with the techniques in Sect. 9.5.1. Many question answering system use
the Lucene4 open-source tool for this. It is an excellent indexer that can process
large quantities of text.
Transformer are too slow to examine the millions of documents sometimes
needed to answer a question. That is why we must divide the answering process
in two steps: The retrieval of a short list of passage candidates with a fast algorithm
and then the identification of the answer in these passages. We call this combination
the retriever-reader model:
1. Given a question, the passage retriever selects the paragraphs of the collection
that may contain the answer and ranks them. The ranking algorithm represents
the passages with vector space techniques, as in Sect. 9.5.3, or dense vectors, as
in Chap. 11, Dense Vector Representations; we can use Sentence-BERT (SBERT)
(Reimers and Gurevych 2019), for instance, to create sentence or paragraph
embeddings; the passage retriever then computes the similarity between the
question and a passage with cosines. Again, we can use Lucene for this or a
vector database;
2. Using the short list of candidates from the previous step, the reader applies a
classifier to decide if the answer is in the passage or not, and if yes, identifies it
with the transformer.
Pretraining a BERT model with the methods in Sects. 16.1 and 16.6 is beyond the
computing capacity of many programmers. Fortunately, the BERT team made their
pretrained models available and they were soon followed by many others. In this
section, we will describe how to use a pretrained BERT model for a classification
task. We will rely on the Hugging Face library (Wolf et al. 2020) and we will use
the IMDB dataset of movie reviews (Maas et al. 2011).
To fine-tune a model, we need three components:
1. The model, for instance BERT, pretrained on a large unannotated corpus. The
fine-tuning step will either reuse it as is, the parameters are said to be frozen, or
refine its existing parameters;
2. The subword tokenizer that matches the tokens used to pretrain the model;
3. The classification head that projects the [CLS] output vector to the number of
classes. This classification head will be trained from scratch.
4 https://lucene.apache.org/.
466 16 Pretraining an Encoder: The BERT Language Model
Transformer models are very large. In this example, we will use a leaner version
of BERT called DistilBERT (Sanh et al. 2019). DistilBERT has 40% less parameters
than BERT based uncased and still reaches 95% of BERT’s performances on the
GLUE benchmarks. This makes the experiment possible on a personal computer
with no graphics processing unit (GPU).
Hugging Face provides a large repository of datasets that are compatible with
their subsequent processing pipelines. The IMDB dataset consists of movie reviews
annotated as positive or negative. We import it with these statements:
from datasets import load_dataset
imdb = load_dataset(’imdb’)
This downloads the dataset from a Hugging Face server and stores it in a cache
folder such as: ~/.cache/huggingface/datasets. Subsequent calls will load it
from the cache.
Once dowloaded, imdb returns:
>>> imdb
DatasetDict({
train: Dataset({
features: [’text’, ’label’],
num_rows: 25000
})
test: Dataset({
features: [’text’, ’label’],
num_rows: 25000
})
unsupervised: Dataset({
features: [’text’, ’label’],
num_rows: 50000
})
})
telling that the dataset consists of two annotated sets split into training and test and
an unannotated set called unsupervised.
We assign the training and test sets with these statements
train_set = imdb[’train’]
test_set = imdb[’test’]
The negative and positive reviews are arranged as two blocks of equal lengths. We
examine excerpts of the first and 12,500th reviews with:
>>> train_set[0]
{’text’: ’I rented I AM CURIOUS-YELLOW from my video store...
16.11 Tokenization 467
where the first review is categorized as negative and the second as positive.
16.11 Tokenization
Each Hugging Face model has its tokenizer, BPE in the case of DistilBERT, trained
on a large corpus and ready to use. We create it from the model name:
from transformers import AutoTokenizer
model_name = ’distilbert-base-uncased’
tokenizer = AutoTokenizer.from_pretrained(model_name)
These statements will download the tokenizer model the first time we execute them.
Again, it will be stored in the cache. Alternatively, we can download the whole
DistilBERT model explicitly, including the tokenizer, with the statement:
git clone https://huggingface.co/distilbert-base-uncased
The input_ids are the token indices and the attention_mask are the valid
tokens. Remember that in a batch, we have to pad the sequences to a same length
to create a tensor. We mark the padded tokens with a 0 in attention_mask. Here
there is only one sentence and all the tokens are valid.
{’input_ids’: [101, 1045, 12524, 1045, 2572,..., 102],
’attention_mask’: [1, 1, 1, 1, 1, ..., 1]}
Let us now see the effects of the attention mask on this small corpus with
sentences of different lengths:
classics = [
’Tell me, O Muse, of that hero’,
’Many cities did he visit’,
’Exiled from home am I ;’]
{’input_ids’: [
[101, 2425, 2033, 1010, 1051, 18437, 1010, 1997, 2008, 5394, 102],
[101, 2116, 3655, 2106, 2002, 3942, 102, 0, 0, 0, 0],
[101, 14146, 2013, 2188, 2572, 1045, 1025, 102, 0, 0, 0]],
’attention_mask’: [
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
[1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0],
[1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0]]}
We tokenize the dataset and to speed up the process, we use the Dataset.map
function. We set the maximal length to 128 to save memory:
imdb_tokenized = imdb.map(lambda x: tokenizer(x[’text’],
truncation=True,
padding=True,
max_length=128),
batched=True)
16.12 Fine-Tuning
16.12.1 Architecture
Now we can proceed to the fine-tuning task. The architecture consists of a pretrained
model and a classification head. There are predesigned models from Hugging Face
that just do that and that we will reuse. We just need to specify the number of classes,
here two, to get the right output:
from transformers import AutoModelForSequenceClassification
num_labels = 2
model = (AutoModelForSequenceClassification
.from_pretrained(model_name, num_labels=num_labels)
.to(’cpu’))
When printing the model, we see its structure starting with the embeddings, the
transformer with six blocks, removed from the text here, and finally the classification
head with two linear layers, pre_classifier and classifier:
16.12 Fine-Tuning 469
DistilBertForSequenceClassification(
(distilbert): DistilBertModel(
(embeddings): Embeddings(
(word_embeddings): Embedding(30522, 768, padding_idx=0)
(position_embeddings): Embedding(512, 768)
(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
(transformer): Transformer(
(layer): ModuleList(
(0-5): 6 x TransformerBlock(
...
)
(pre_classifier): Linear(in_features=768, out_features=768, bias=True)
(classifier): Linear(in_features=768, out_features=2, bias=True)
(dropout): Dropout(p=0.2, inplace=False))
model_output_name = ’finetuned_imdb’
training_args = TrainingArguments(output_dir=model_output_name,
num_train_epochs=2,
learning_rate=2e-5,
per_device_train_batch_size=32,
per_device_eval_batch_size=32,
weight_decay=0.01,
evaluation_strategy=’epoch’,
disable_tqdm=False)
16.12.3 Training
We can now start the fine-tuning of the parameters. We create a Trainer object and
we start it. Once trained, we save the model. With this statement, we also store the
tokenizer in the model to be able to reuse it.
from transformers import Trainer
tokenizer=tokenizer)
trainer.train()
trainer.save_model()
We added a function to compute the accuracy of the model on the evaluation set,
here the test set:
def compute_accuracy(eval_output):
labels_ids = eval_output.label_ids
predictions = eval_output.predictions.argmax(axis=-1)
return {’accuracy’: np.mean(predictions == labels_ids)}
16.12.4 Prediction
We can run predictions with the model we have just trained. We will then need
to apply the tokenizer before. The model input will consist of the input ids, the
attention mask, and possibly the labels if we want to compute the loss:
finetuned_model = AutoModelForSequenceClassification.from_pretrained(
’finetuned_imdb’)
We can also create a prediction pipeline that will include the tokenizer and the
model:
from transformers import pipeline
ftuned_model_name = ’finetuned_imdb’
classifier = pipeline(’text-classification’,
model=ftuned_model_name)
When training a model, by default we fit all the parameters. In fact the different
layers of a transformer learn different kinds of patterns, from very generic semantic
relations to application-ready properties, where the lower layers are more abstract
16.12 Fine-Tuning 471
>>> count_parameters(model)
We can posit that the accuracy of the IMDB classification will depend mostly on
the last classification layers which have a much small number of parameters:
>>> count_parameters(model.pre_classifier)
>>> count_parameters(model.classifier)
By default, all the parameters are trainable. We “freeze” them with this loop that
disables the gradients:
for param in model.parameters():
param.requires_grad = False
We then set the two last classification layers as trainable with these loops:
for param in model.pre_classifier.parameters():
param.requires_grad = True
>>> count_parameters(model)
Loading the original pretrained model again and freezing the parameters as
described, we create a new Trainer and fine-tune the classifier layers. After two
epochs, this simpler setup yields accuracies that are well below that of the original
one though: About 77.87% on the test set compared to 87.83% when training all the
parameters.
472 16 Pretraining an Encoder: The BERT Language Model
BERT has been one of the first successful outcomes of transformers. It showed
the capacity of large language models to encapsulate a massive amount of textual
knowledge in a pipeline of matrices. Its excellent scores on multiple benchmarks
proved its versatility. Instead of keeping their code secret, BERT’s authors made
the implementation available on GitHub.5 This ensured it an immense success soon
followed by a multitude of replicas, extensions, or modifications: RoBERTa (Liu
et al. 2019), Multilingual BERT, DistilBERT, etc. to name a few, either mono or
multilingual.
Devlin et al. (2019, Sect. 5.2) showed that the encoder performance was linked
to the size of the model: the larger, the better. This caused a race to gigantism with
transformer models, all architectures included, now reaching a trillion parameters
(Ren et al. 2023). Fitting such an astronomic number of parameters is not free
however. Strubell et al. (2019) highlighted how much NLP models rely on the
intensive use of electricity-hungry computing platforms and that they come at a
considerable energy cost.
In this chapter, we fine-tuned and applied a text classification model with the
Hugging Face application programming interface. Tunstall et al. (2022) describe
in more detail this interface with application examples. In addition to IMDB
and DistilBERT, the Hugging Face repository hosts numerous datasets, pretrained
models, and ready-to-use classes to implement tasks such as the ones we described
here, including sequence annotation and question answering. Due to its large
number of open-source tools, Hugging Face has become a popular resource in the
field of large language models.
5 https://github.com/google-research/bert/.
Chapter 17
Sequence-to-Sequence Architectures:
Encoder-Decoders and Decoders
Given the relatively long history of machine translation, a variety of methods have
been experimented on and applied. Starting from the pioneering work of Brown
et al. (1993), machine translation used statistical models and parallel corpora. We
introduce them now.
Parallel corpora are the main data source of machine translation. Administrative
or parliamentary texts of multilingual countries or organizations are widely used
because they are easy to obtain and are often free. The Canadian Hansard or
the European Parliament proceedings are examples of them. Table 17.1 shows an
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 473
P. M. Nugues, Python for Natural Language Processing, Cognitive Technologies,
https://doi.org/10.1007/978-3-031-57549-5_17
474 17 Sequence-to-Sequence Architectures: Encoder-Decoders and Decoders
Table 17.1 Parallel texts from the Swiss federal law on milk transportation
German French Italian
Art. 35 Milchtransport Art. 35 Transport du lait Art. 35 Trasporto del latte
1. Die Milch ist schonend und 1. Le lait doit être transporté 1. Il latte va trasportato verso
hygienisch in den jusqu’à l’entreprise de l’azienda di trasformazione in
Verarbeitungsbetrieb zu transformation avec modo accurato e igienico. Il
transportieren. Das ménagement et conformément veicolo adibito al trasporto va
Transportfahrzeug ist stets aux normes d’hygiène. Le mantenuto pulito. Con il latte
sauber zu halten. Zusammen véhicule de transport doit être non possono essere trasportati
mit der Milch dürfen keine toujours propre. Il ne doit animali e oggetti estranei, che
Tiere und milchfremde transporter avec le lait aucun potrebbero pregiudicarne la
Gegenstände transportiert animal ou objet susceptible qualità.
werden, welche die Qualität d’en altérer la qualité.
der Milch beeinträchtigen
können.
2. Wird Milch ausserhalb des 2. Si le lait destiné à être 2. Se viene collocato fuori
Hofes zum Abtransport transporté est déposé hors de dall’azienda in vista del
bereitgestellt, so ist sie zu la ferme, il doit être placé trasporto, il latte deve essere
beaufsichtigen. sous surveillance. sorvegliato.
3. Milchpipelines sind nach 3. Les lactoducs des 3. I lattodotti vanno puliti e
den Anweisungen des exploitations d’estivage sottoposti a manutenzione
Herstellers zu reinigen und zu doivent être nettoyés et secondo le indicazioni del
unterhalten. entretenus conformément aux fabbricante.
instructions du fabricant.
excerpt of the Swiss federal law in German, French, and Italian on the quality of
milk production.
The idea of machine translation with parallel texts is simple: given a sentence,
a phrase, or a word in a source language, find its equivalent in the target
language. The translation procedure splits the text to translate into fragments, finds
a correspondence for each source fragment in the parallel corpora, and composes
the resulting target pieces to form a translated text. Using the titles in Table 17.1, we
can build pairs from the phrases transport du lait ‘milk transportation’ in French,
Milchtransport in German, and trasporto del latte in Italian.
The idea of translating with the help of parallel texts is not new and has been
applied by many people. A notable example is the Egyptologist and linguist Jean-
François Champollion, who used the famous Rosetta Stone, an early parallel text,
to decipher Egyptian hieroglyphs from Greek.
17.2 Alignment
The parallel texts must be aligned before using them in machine translation.
This corresponds to a preliminary segmentation and mark-up that determines the
corresponding paragraphs, sentences, phrases, and possibly words across the texts.
Alignment of texts in Table 17.1 is made easier because paragraphs are numbered
17.2 Alignment 475
and have the same number of sentences in each language. Some corpora, like
Tatoeba,1 have even pairs of translated sentences. This is not always the case,
however, and some texts show a significantly different sentence structure.
Gale and Church (1993) describe a simple and effective method based on the
idea that
longer sentences in one language tend to be translated into longer sentences in the other
language, and that shorter sentences tend to be translated into shorter sentences.
Their method generates pairs of sentences from the target and source texts,
assigns them a score, which corresponds to the difference of lengths in characters of
the aligned pairs, and uses dynamic programming to find the maximum likelihood
alignment of sentences.
The sentences in the source language are denoted .si , 1 ≤ i ≤ I , and the sentences
in the target language .ti , 1 ≤ i ≤ J . .D(i, j ) is the minimum distance between sen-
tences .s1 , s2 , . . . , si and .t1 , t2 , . . . , tj , and .d(source1 , target1 ; source2 , target2 )
is the distance function between sentences. The algorithm identifies six possible
cases of alignment through insertion, deletion, substitution, expansion, contraction,
or merger. They are expressed by the formula below:
⎛ ⎞
D(i, j − 1) + d(0, tj ; 0, 0)
⎜ D(i − 1, j ) + d(s , 0; 0, 0) ⎟
⎜ i ⎟
⎜ ⎟
⎜ D(i − 1, j − 1) + d(si , tj ; 0, 0) ⎟
.D(i, j ) = min ⎜ ⎟.
⎜ D(i − 1, j − 2) + d(si , tj ; 0, tj −1 ) ⎟
⎜ ⎟
⎝ D(i − 2, j − 1) + d(si , tj ; si−1 , 0) ⎠
D(i − 2, j − 2) + d(si , tj ; si−1 , tj −1 )
and where .l1 and .l2 are the lengths of the sentences under consideration, c the
average number of characters in the source language .L2 per character in the target
language .L1 , and .s 2 its variance. Gale and Church (1993) found a value of c of 1.06
for the pair French–English and 1.1 for German–English. This means that French
and German texts are longer than their English counterparts: 6% longer for French
and 10% for German. They found .s 2 = 7.3 for German–English and .s 2 = 5.6 for
French–English.
Using Bayes’ theorem, we can derive a new distance function:
Gale and Church (1993) estimated the probability .P (alignment) of their six possi-
ble alignments with these figures: substitution 1–1: 0.89, deletion and substitution
1 https://tatoeba.org/.
476 17 Sequence-to-Sequence Architectures: Encoder-Decoders and Decoders
0–1 or 1–0: 0.0099, expansion and contraction 2–1 or 1–2: 0.089, and merger 2–
2: 0.011. They rewrote .P (δ|alignment) as .2(1 − P (|δ|)), which can be computed
from statistical tables. See Gale and Church’s original article.
Alignment of words and phrases uses similar techniques, however, it is more
complex. Figures 17.1 and 17.2 show examples of alignment from Brown et al.
(1993).
generator(y1 , y2 , . . . , yi−1 ) = yi .
.
The .yi symbol is appended to the input and the generation repeats until an
end condition is satisfied. We also saw in Chap. 16 Pretraining an Encoder:
The BERT Language Model that the encoder-transformers could build a semantic
representation of their input.
17.3 The Encoder-Decoder Architecture 477
The transformer consists of two similar components, an encoder like the one in
Chap. 15 Self-Attention and Transformers and a decoder, assembled to carry out
sequence-to-sequence transductions:
1. The encoder builds a representation of an input sequence, .(x1 , x2 , . . . , xn ) using
a language model. Vaswani et al. (2017) called this representation the memory,
M,
. encoder(x1 , x2 , . . . , xn ) = M;
2. The decoder, equiped with a similar language model, uses M and a start symbol
<s> as first input to generate a target sequence. At each step, the decoder generates
a new output symbol from the concatenation of the previous input and the last
output. In addition to the tag to mark the start of the sequence, it uses a second
one to tell it to stop, </s>:
For two source and target sequences of characters in English and French:
Source: (’H’, ’e’, ’l’, ’l, ’o’)
Target: (’B’, ’o’, ’n’, ’j’, ’o’, ’u’, ’r’)
encoder(H, e, l, l, o) = M;
.
and, using M and starting with .< s >, iteratively decode the target sequence:
where </s> stops the generation. In fact, we simplified the decoder output as it is
a vector. To be complete, we would need a linear layer to map it to the character
prediction. Figure 17.3 summarizes this architecture.
478 17 Sequence-to-Sequence Architectures: Encoder-Decoders and Decoders
B o n j o u r </s>
W W W W W W W W
Encoder Decoder
Embeddings Embeddings
In Chap. 15, Self-Attention and Transformers, we saw the encoder architecture with
multihead attention and feed-forward networks and, in Chap. 16, Pretraining an
Encoder: The BERT Language Model, how we can pretrain it as a large language
model and use it in classification, sequence annotation, and question answering.
Figure 17.4 shows now the complete transformer architecture, which consists of an
encoder to the left and a decoder to the right.
The decoder also builds on multihead attention, but with slight differences: Its
first attention module uses a mask to make it auto-regressive and the second one
merges the source and the target so that the sequence positions in the decoder can
attend those in the encoder. We describe these differences now.
Output :
B o n j o u r < /s >
. ↑ ↑↑↑↑↑↑ ↑
Input : < s > H e l l o < /s > Shifted output : < s > B o n j o u r
When translating a sequence, the source is known in advance, while the target is
decoded one symbol at a time, either word, subword, or character. This does not fit
the format of the dataset when we train the model because the attention function, as
we described it earlier, has a complete access to the source and target pairs. To have
17.4 Encoder-Decoder Transformers 479
Output probabilities
softmax
Linear
Feed-forward
memory keys and values
⊕ ⊕
Positional encoding Positional encoding
Fig. 17.4 The transformer consisting of N identical encoder layers (left) and N decoder layers
(right). After Vaswani et al. (2017)
QK ⊺
.MaskedAttention(Q, K, V , U−∞ ) = softmax √ + U−∞ V .
dk
Let us modify the attention() function from Sect. 15.2.5 to include this .U−∞
matrix. We first create a mask with a size parameter corresponding to the length
the sequence with the function:
def attn_mask(size):
U = torch.empty(size, size).fill_(float(’-inf’))
return torch.triu(U, diagonal=1)
480 17 Sequence-to-Sequence Architectures: Encoder-Decoders and Decoders
Fig. 17.5 Masked self-attention weights. This table uses GloVe 50d vectors trained on Wikipedia
2014 and Gigaword 5. Compare with the results in Fig. 15.4
that fills a square matrix with .−∞ values and sets the lower part to 0 with
torch.triu():
>>> attn_mask(5)
tensor([[0., -inf, -inf, -inf, -inf],
[0., 0., -inf, -inf, -inf],
[0., 0., 0., -inf, -inf],
[0., 0., 0., 0., -inf],
[0., 0., 0., 0., 0.]])
We can now repeat the experiment with I must go back to my ship and to my
crew from Chap. 15 Self-Attention and Transformers and compute the attention
weights. We see that while the initial weights in Fig. 15.4 take into account the
complete sentence, the masked ones in Fig. 17.5 consider the past words only. When
multiplying the weights with the V matrix, the word I keeps its initial embedding
vector; the embedding vector of must is the weighted sum of its original one and
that of I; for go, it will be of itself, I and must, etc.
17.4.2 Cross-Attention
In the early days of neural machine translation, Bahdanau et al. (2015) showed that
encoder-decoder architectures benefited from the decoder attending all the positions
of an encoded sequence. This is precisely the purpose of the decoder’s second
17.4 Encoder-Decoder Transformers 481
attention layer in Fig. 17.4, which is fed with the memory, the encoded value of the
input, and combined with the result of the decoder’s first attention. In this operation,
the memory provides the key and the value, .Kenc and .Venc , and the decoder the
query, .Qdec :
⊺
Qdec Kenc
softmax
. √ Venc .
dk
Qdec M ⊺
.CrossAttention(Q, M) = softmax √ M.
dmodel
The product .Qdec M ⊺ has a size of .p × n, which is not modified by the softmax
function, and, after multiply it with M, we have .p × dmodel . This confirms that the
lengths of the target input sequence of the decoder (bottom right of Fig. 17.4) and
that of the output sequence of the transformer (top right of the figure) are identical:
p tokens.
In addition to the connection created by cross-attention, the transformer shares
the embeddings matrices between the input, output, and the last linear layer of the
decoder.
The programming scheme of a decoder is the same as with the encoder. We first
create a decoder layer with TransformerDecoderLayer():
decoder_layer = nn.TransformerDecoderLayer(d_model,
nhead,
batch_first=True)
2 https://tatoeba.org/.
3 https://www.manythings.org/anki/.
17.5 Programming: Machine Translation 483
Once created, the input to a decoder object is a target and a memory. As optional
arguments, we usually add an upper triangular target mask, tgt_mask, to have
an auto-regressive model and two padding masks for the target and the memory,
respectively tgt_key_padding_mask and memory_key_padding_mask. They remove the
padding tokens of the mini-batches from the attention mechanism as we did with
the encoder in Sect. 15.12.
dec_output = decoder(
target,
memory,
tgt_mask=U,
tgt_key_padding_mask=tgt_padding,
memory_key_padding_mask=mem_padding)
The nn.Transformer class is even easier to use than the decoder. The statement
transformer = nn.Transformer(batch_first=True)
creates a transformer with default arguments that are the same as in the base model
of Vaswani et al.:
• d_model = 512;
• nhead = 8
• num_encoder_layers and num_decoder_layers set to 6 and;
• dim_feedforward = 2048.
where the encoder encodes the source and returns a memory and the decoder
decodes this memory and the target. We access the encoder and decoder objects
with:
>>> transformer.encoder
TransformerEncoder(
(layers): ModuleList(
(0-5): 6 x TransformerEncoderLayer(
(self_attn): MultiheadAttention(
...
484 17 Sequence-to-Sequence Architectures: Encoder-Decoders and Decoders
and
>>> transformer.decoder
TransformerDecoder(
(layers): ModuleList(
(0-5): 6 x TransformerDecoderLayer(
(self_attn): MultiheadAttention(
...
The transformer has padding mask arguments as with the encoder or decoder alone.
Additionally, nn.Transformer has a static method to create an upper triangular mask
that we denoted U in the previous sections:
nn.Transformer.generate_square_subsequent_mask(size)
This is already quite a large number and we limit it to 50,000 pairs. We create two
lists of source and target sentences with less than 100 characters:
max_len = 100
num_samples = 50000
4 https://tatoeba.org/downloads.
5 https://www.manythings.org/anki/fra-eng.zip.
17.5 Programming: Machine Translation 485
random.shuffle(lines)
for line in lines:
src_seq, tgt_seq, _ = line.split(’\t’)
if len(src_seq) <= max_len and len(tgt_seq) <= max_len:
src_seqs.append(src_seq)
tgt_seqs.append(tgt_seq)
return src_seqs[:num_samples], tgt_seqs[:num_samples]
French will be source language and English the target one. We have:
>>> src_seqs[5]
’Je suis sans argent.’
>>> tgt_seqs[5]
"I haven’t got any money."
We can now split our dataset into training and validation sets with the proportion
80/20. We compute the train/validation index:
TRAIN_PERCENTAGE = 0.8
train_val = int(TRAIN_PERCENTAGE * num_samples)
val_src_seqs = src_seqs[train_val:]
val_tgt_seqs = tgt_seqs[train_val:]
This results into 40,000 training samples and 10,000 validation ones.
So far, our data consists of strings of characters. We will now convert them into lists
of indices. This is a procedure we have already done a couple of times. The language
pair will share the same vocabulary as in Vaswani et al. (2017).
We first collect the characters from the source and target training sets and we
extract the character set:
src_chars = set(’’.join(train_src_seqs))
tgt_chars = set(’’.join(train_tgt_seqs))
charset = sorted(
list(set.union(src_chars,
tgt_chars)))
We reserve four special tokens for the padding and unknown symbols as well as for
the beginning and end of sequences:
special_tokens = [’<pad>’, ’<unk>’, ’<s>’, ’</s>’]
486 17 Sequence-to-Sequence Architectures: Encoder-Decoders and Decoders
We add them to the character set and we build index-to-token and token-to-index
dictionaries:
charset = special_tokens + charset
idx2token = dict(enumerate(charset))
token2idx = {char: idx for idx, char in idx2token.items()}
We have now:
>>> idx2token
{0: ’<pad>’, 1: ’<unk>’, 2: ’<s>’, 3: ’</s>’, 4: ’ ’,
5: ’!’, 6: ’"’, 7: ’$’, ...}
We have:
>>> train_tgt_seqs[:2]
["I’ve always been very proud of you.",
"I’m aware of the consequences."]
and
>>> seqs2tensors(train_tgt_seqs[:2], token2idx)
[tensor([ 2, 36, 9, 75, 58, 4, 54, ..., 3]),
tensor([ 2, 36, 9, 66, 4, 54, 76, ..., 3])]
The data input in Fig. 17.4 starts with the embedding of the source and target
sequences. We already implemented an Embedding class in Sect. 15.10 that we reuse
17.5 Programming: Machine Translation 487
here. The transformer itself is the PyTorch class from the previous section and the
last layer is just a linear module. We compose a model from these components so
that it encapsulates all the parameters we have to fit.
• In the __init__() method, we create the embedding layers, the transformer itself,
and the last linear layer. As in Vaswani et al. (2017), we share the embedding
weights;
• In the __forward__() method, we embed the source and the target sequences; we
create the target autoregressive mask with the built-in transformer method; and
we pass them to the transformer. We also pass the padding masks.
We use the same defaults as the nn.Transformer() class. Most of the code lines are
parameter initializations:
class Translator(nn.Module):
def __init__(self,
d_model=512,
nhead=8,
num_encoder_layers=6,
num_decoder_layers=6,
dim_feedforward=2048, # 4 x d_model
dropout=0.1,
vocab_size=30000,
max_len=128):
super().__init__()
self.embedding = Embedding(vocab_size,
d_model,
max_len=max_len)
self.transformer = nn.Transformer(
d_model,
nhead,
num_encoder_layers,
num_decoder_layers,
dim_feedforward,
dropout,
batch_first=True)
self.fc = nn.Linear(d_model, vocab_size)
self.fc.weight = \
self.embedding.input_embedding.weight
tgt_key_padding_mask=tgt_padding)
return self.fc(x)
We can now apply our tranformer to an input and examine its output. We first
format the source and target characters, respectively: We select a mini-batch of 32
samples, convert the two lists of character strings to two lists of tensors, and pad
them to obtain tensors:
src_batch = pad_sequence(seqs2tensors(
train_src_seqs[:32], token2idx),
batch_first=True, padding_value=0)
tgt_batch = pad_sequence(seqs2tensors(
train_tgt_seqs[:32], token2idx),
batch_first=True, padding_value=0)
We have now two tensors representing the source and the target.
We create the padding masks so that we remove the padding symbols from the
attention:
src_padding_mask = (src_batch == 0)
tgt_padding_mask = (tgt_batch == 0).float()
as result, we have 32 samples of 64-character long, the maximal length of our mini-
batch here, and logits for 110 characters, the size of our character set.
We can even extract the predictions by taking the maximal values on the last axis
and map them to characters with
>>> tensors2seqs(
torch.argmax(
translator(src_batch, tgt_batch,
src_padding_mask, tgt_padding_mask),
dim=-1),
idx2token)
Training a model without an expensive GPU can take a lot of time. Here we will
use parameters smaller that will not require such an extension for our limited
17.5 Programming: Machine Translation 489
experiment. We set .dmodel to 512, .dff to 512, and the number of layers to 3. We add
2 to the sequence maximal length as we will pad them with <s> and </s> symbols:
translator = Translator(d_model=512,
nhead=8,
num_decoder_layers=3,
num_encoder_layers=3,
dim_feedforward=512,
vocab_size=len(token2idx),
max_len=max_len + 2)
Vaswani et al. (2017) used the Adam optimizer with a variable learning rate. We
simplify it with a constant one instead, the rest of the parameters is the same:
optimizer = torch.optim.Adam(
translator.parameters(), lr=0.0001, betas=(0.9, 0.98),
eps=1e-9)
In the previous chapters, we stored our datasets with the built-in TensorDataset, a
subclass of Dataset, the generic representation of a dataset in PyTorch. Here our
input consists of pairs of tensors. Such a class does not exist and we have to create
it. To derive a class from Dataset, according to the PyTorch documentation, we must
implement three methods: __init__, __len__, and __getitem__, that respectively
creates the dataset object, returns its length, and extracts an item from it.
In __init__, we pass the dataset in the form of lists of source and target
sequences; in __len__(), we compute the size of the dataset, and in __getitem__(),
we return a pair at a certain index. To create a batch, the dataloader uses __getitem__
as many times as we have samples in the batch. As the sequences have different
lengths, we add a collate function to convert the source and target batches in two
tensors.
class PairDataset(Dataset):
def __init__(self, src_seqs, tgt_seqs,
token2idx):
self.src_seqs = src_seqs
490 17 Sequence-to-Sequence Architectures: Encoder-Decoders and Decoders
self.tgt_seqs = tgt_seqs
self.token2idx = token2idx
def __len__(self):
return len(self.src_seqs)
and we pass it to the dataloader that will carry out the iteration:
train_dataloader = DataLoader(train_dataset, batch_size=32,
shuffle=True,
collate_fn=train_dataset.collate)
The collate_fn argument indicates the function to collate the samples into a batch.
We create similar objects for the validation dataset: val_dataset and
val_dataloader.
We can now train the model. We first write an evaluation function as it is a bit easier.
We build the target output by removing the first character, <s> and the target input by
removing the last one so that the target sequences have the same length. We create
the padding mask and we apply the model. The computation of the loss is similar to
that in Sect. 14.8.1.
def evaluate(model, loss_fn, dataloader):
model.eval()
with torch.no_grad():
t_loss = 0
t_correct, t_chars = 0, 0
for src_batch, tgt_batch in dataloader:
tgt_input = tgt_batch[:, :-1]
17.5 Programming: Machine Translation 491
The training function adds the gradient computation and update to the previous
function.
def train(model, loss_fn, optimizer, dataloader):
model.train()
t_loss = 0
t_correct, t_chars = 0, 0
for src_batch, tgt_batch in tqdm(dataloader):
tgt_input = tgt_batch[:, :-1]
tgt_output = tgt_batch[:, 1:]
src_padding_mask = (src_batch == 0)
tgt_padding_mask = (tgt_input == 0).float()
tgt_output_pred = model(src_batch, tgt_input,
src_padding_mask,
tgt_padding_mask)
optimizer.zero_grad()
loss = loss_fn(
tgt_output_pred.reshape(
-1,
tgt_output_pred.size(dim=-1)),
tgt_output.reshape(-1))
loss.backward()
optimizer.step()
with torch.no_grad():
n_chars = (tgt_output != 0).sum()
t_chars += n_chars
t_loss += loss.item() * n_chars
char_pred = torch.argmax(tgt_output_pred, dim=-1)
char_correct = torch.mul((char_pred == tgt_output),
(tgt_output != 0)).sum()
t_correct += char_correct.item()
Fig. 17.6 Training curves of the transformer with a Tatoeba French-to-English corpus. Loss and
accuracy over 10 epochs
Figure 17.6 shows the training curves across the epochs. They show that the
model is still steadily improving and more epochs would be needed to fit it
completely.
Now that we have trained a model, we can translate a sequence. We first embed the
source characters and encode them into a memory. Then, starting from <s>, we embed
the current target characters and decode them with the memory. We predict the next
character with a linear layer that we apply to the last encoded vector. We select the
highest probability. As the decoder is auto-regressive, we form the next input by the
concatenation of the previous one and of the last output.
We repeat this operation always selecting the highest probability for the last
character. This is a greedy decoding similar to that proposed by Rush (2018). We
17.5 Programming: Machine Translation 493
The top-level translation function is just a conversion of the source string into a
tensor and a call the decoder, where we set a maximal length to the target. We clean
the resulting list from the end-of-sequence token and convert it to a string:
def translate(model, src_sentence):
model.eval()
src = seqs2tensors([src_sentence], token2idx)[0]
num_chars = src.size(dim=0)
tgt_chars = greedy_decode(
model, src, max_len=num_chars + 20)
tgt_chars = tensors2seqs([tgt_chars], idx2token)[0]
if tgt_chars[-1] == ’</s>’:
tgt_chars = tgt_chars[:-1]
tgt_str = ’’.join(tgt_chars)
return tgt_str
We can now run the translator on a few examples. Although far from perfect
and sometimes really wrong, the results are quite promising for such a small model
using characters only:
>>> translate(translator, ’Bonjour !’)
’Good day.’
>>> translate(translator, "Va-t’en !")
’Go away.’
>>> translate(translator, ’Attends-moi !’)
’Wait me.’
>>> translate(translator, ’Viens à la maison ce soir’)
’Come home at home tonight.’
>>> translate(translator, ’Viens manger !’)
’Come any money.’
494 17 Sequence-to-Sequence Architectures: Encoder-Decoders and Decoders
The program we have written is merely a starting point and we can improve
its performance with a few modifications. The best way is to follow Vaswani
et al. (2017) and the didactical implementation of Rush (2018) in The Annotated
Transformer notebook.6 We outline a few options here:
1. First we can increase the amount of training data. The datasets from the
Workshops on Machine Translation, for instance WMT 2014, are easy to obtain
and among of the most popular ones in this field. They also serve as benchmark
in the paper;7
2. Then, we should replace the character sequences with subwords. We can use
the BPE algorithm to tokenize the texts, see Chap. 13, Subword Segmentation,
and the SentencePiece program.8 Vaswani et al. used 32,000 subwords for the
English-French pair;
3. Vaswani et al.’s optimizer used a variable learning rate, where they started with
vary low values, increased the rate linearly with the number of steps in the so-
called warmup phase, and then decreased it with the number of steps;
4. Vaswani et al. also obtained an improvement by smoothing the output symbols,
characters or subwords. When we compute a classic cross entropy, we encode
the truth with unit vectors (one-hot encoding). Instead, for a given prediction,
smoothing scales down the true unit vector, .ei , and assigns the remaining
probability mass to the other symbols .ej /=i . The new true distribution is:
Nsubwords
ϵ
(1 − ϵ)ei +
. ej ,
Nsubwords − 1
j =1,j /=i
where .ϵ = 0.1. We measure the loss between true and predicted distributions
with the KL-divergence of Sect. 6.1.4. We can use the PyTorch nn.KLDivLoss() to
implement it. We can also set the label_smoothing parameter of the
nn.CrossEntropyLoss() class to 0.1;
5. In our implementation, we used a greedy decoding. Vaswani et al. (2017) used
a beam search instead, where they decoded four concurrent sequences. At each
decoding step, for each of the four sequences .(< s >, x1b , x2b , . . . , xib ) with b
ranging from one to four, beam search considers the four next tokens .xi+1 b with
the highest scores. In total, we have 16 candidates for the four sequences. The
6 https://nlp.seas.harvard.edu/annotated-transformer/.
7 https://huggingface.co/datasets/wmt14.
8 https://github.com/google/sentencepiece.
17.6 Decoders 495
i
b
P (xi+1
. |< s >, x1b , x2b , . . . , xib ) P (xjb |< s >, x1b , x2b , . . . , xjb−1 ),
j =1
ranks the 16 sequences, and proceeds with the four best ones.
6. All this will require more computing capacity than that available on a common
laptop. GPUs yield a significant acceleration. The PyTorch documentation
provides tutorials on how to adapt a program to GPU computing.9
17.6 Decoders
After the encoder in Chaps. 15 Self-Attention and Transformers and 16, Pretraining
an Encoder: The BERT Language Model, the encoder-decoder in this chapter, the
standalone decoder is a third transformer architecture. In fact, such a decoder is an
encoder with a masked input so that we can train it and run it with an autoregressive
procedure, see Fig. 17.7, left.
The rationale behind it is that, if the encoder can build a semantic representation
of the input, the decoder can do it as well as it has nearly the same structure. So why
not feed the encoder with the input directly and then let it predict the output in an
autoregressive way. Radford et al. (2018) expressed this as the maximization of the
product of each prediction:
N
. P (xi |x1 , x2 , . . . , xi−1 ).
i=1
As example, Radford et al. (2019) proposed to train the model with sequences
such as these ones:
answer the question, document, question, answer
translate to french, english text, french text
to answer questions and translate sentences. In both tasks, we help the decoder with
a specification of what we expect it to do: answer a question or translate to French.
In the rest, the data are similar to that in the BERT application in Sect. 16.8.3 or in
translation in Sect. 17.4. Then given a task like
answer the question, document, question
translate to french, english text
the encoder would generate autoregressively either the answer or the french text.
Outputs
⊕
Outputs
Feed-forward
Masked multi-
head attention
RMS norm
⊕ ⊕
Positional encoding Positional encoding
Inputs Inputs
Fig. 17.7 The transformer decoder with the post layer normalization, left, and prenormalization,
right
P (output|input, task).
.
The condition is usually called the prompt and the output, the answer or the
completion.
X' = X + MultiheadAttention(Norm(X))).
.
dmodel
RMSNorm(xi,1 , xi,2 , . . . , xi,dmodel ) = dmodel 2 · (xi,1 , xi,2 , . . . , xi,dmodel ),
j =1 xi,j
. √
dmodel
= ||xi || · xi .
This normalization is faster than that in Sect. 15.4, as the input is simply rescaled.
3. All the ReLU activation functions, .max(0, x), are replaced with a SwiGLU
function (Shazeer 2020). This function uses the Swish function defined as:
ReLU sets all the negative numbers to zero and is nonlinear. SwiGLU is
continuous and will not zero small negative values. Shazeer (2020) showed
transformers obtained better results on the GLUE benchmark with the latter
function.
4. Touvron et al. (2023a) replaced the absolute positional encoding with rotary
positional embeddings (RoPE) (Su et al. 2024). Using a band matrix, we obtain
them by appling rotations to pairs of coordinates in the initial position encoding.
This technique enables the attention mechanism to inform two tokens of their
relative positions.
In addition to these features, encoders contain scores of technical details that
accumulated can make a major difference. We refer to Touvron et al. (2023a),
Touvron et al. (2023b), and Jiang et al. (2023) for some recent ones.
As with the BERT encoder, decoders are first pretrained on raw corpora. These
corpora have now considerable sizes and represent significant portions of the textual
web including Wikipedia and CommonCrawl,10 scientific papers from arXiv,11 and
programming language code from GitHub. The internet contains many duplicates
10 https://commoncrawl.org/.
11 https://arxiv.org/.
498 17 Sequence-to-Sequence Architectures: Encoder-Decoders and Decoders
and low-quality content. Wenzek et al. (2020) describe a technique to identify them
using hashcodes and a language model. When deduplicated and cleaned, some
pretraining corpora now surpass 2 trillion tokens.
So far, we did not specify the architecture parameters of an encoder and their
number. As general principle, the encoder size must match that of the quantity of
textual information. Current models, like Llama, have stacks of 32 to 80 encoding
layers, a .dmodel value ranging from 4096 to 8192, 32 to 64 attention heads, a number
of model parameters going from 7 to 70 billions. Once the model is created, the
pretraining procedure to fit its parameters follows that of a supervised autoregressive
language model and predicts the next token from the previous ones.
Once we have a pretrained model, we must adapt it to the different tasks we want
it to complete. A first possibility is to fine-tune it on a set of examples as we saw
with BERT (Sect. 16.8). The fine-tuning dataset consists of pairs of prompts and
their completion. Given a pair as input, the model is fined-tuned with a supervised
autoregressive technique just as in the pretraining step. The model parameters are
adjusted so that they predict the correct completion. As public dataset of instruction
pairs, Longpre et al. (2023) describe the Flan Collection of more than 15 million
examples. This dataset is multilingual and the authors divided the tasks into 1800
different categories such as translate, summarize, or explain.
Decoder models are autoregressive. This means that, given an input, whatever it
is, they start generating a sequence as in Sect. 10.8. An astonishing property is that
they often output reasonable answers simply from their pretrained parameters. In
addition to fine-tuning, a second technique is then to use this property called zero-
shot. This can be complemented with one example, one-shot, or a few examples,
few-shots, to guide even more the answer in a process called prompt engineering.
For all these settings, the instruction, the examples, and the prompt are given as
input and the model generates an answer. Brown et al. (2020) gives examples on
translation that we reproduce in Fig. 17.8.
Why this works so well is not completely understood; it is probably due to the
mass of knowledge stored in the matrices. Nonetheless, prompt engineering is not
always trivial and may require a few iterations before the user gets what s/he wants
from the machine.
In this last chapter, we have covered the transformer and transformer decoder ar-
chitectures. These systems enabled natural language processing to make substantial
progress in machine translation, question answering, or text generation. What makes
them appealing is that they can scale in size with no or few architectural changes and
digest a massive amount of knowledge. As a result, this started a race to gigantism,
when large companies began to train transformers on colossal corpora, requiring
huge computing resources. This development is unfortunately beyond the reach of
most of us so far.
Fortunately, many authors of models now publish their parameters as well as
their training programs. This spurs an incredible activity and creativity in the field,
as anyone can reuse the models, sometimes in unexpected applications, fine-tune
the parameters, optimize their representation, and even find ways to re-create them
from scratch at a much lower cost. New models and benchmarking tasks appear
at a stunning pace and, in a such rapidly developing field, it is difficult to write a
definitive conclusion. The future is not written yet and the best we can do, to avoid
obsolescence, is to get involved and try to shape it.
References
Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., Devin, M., Ghemawat, S., Irving, G.,
Isard, M., Kudlur, M., Levenberg, J., Monga, R., Moore, S., Murray, D. G., Steiner, B., Tucker,
P., Vasudevan, V., Warden, P., et al. (2016). Tensorflow: A system for large-scale machine
learning. In Proceedings of the 12th USENIX Conference on Operating Systems Design and
Implementation, OSDI’16, Berkeley (pp. 265–283). USENIX Association.
Abeillé, A., & Clément, L. (2003). Annotation morpho-syntaxique. Les mots simples – Les mots
composés. Corpus Le Monde. Technical report, LLF, Université Paris 7, Paris.
Abeillé, A., Clément, L., & Toussenel, F. (2003). Building a treebank for French. In A. Abeillé
(Ed.), Treebanks: Building and using parsed corpora. Text, speech and language technology
(chap. 10, vol. 20, pp. 165–187). Kluwer Academic.
Allauzen, C., Riley, M., Schalkwyk, J., Skut, W., & Mohri, M. (2007). OpenFst: A general and
efficient weighted finite-state transducer library. In J. Holub & J. Žd’árek (Eds.), Implemen-
tation and Application of Automata. 12th International Conference on Implementation and
Application of Automata, CIAA 2007, Prague, July 2007. Revised Selected Papers. Lecture
notes in computer science (vol. 4783, pp. 11–23). Springer.
Antworth, E. L. (1994). Morphological parsing with a unification-based word grammar. In North
Texas Natural Language Processing Workshop, University of Texas at Arlington.
Antworth, E. L. (1995). User’s guide to PC-KIMMO, Version 2. Summer Institute of Linguistics,
Dallas.
Apache OpenNLP Development Community. (2012). Apache OpenNLP developer documentation.
The Apache Software Foundation, 1.5.2 edition.
Ba, J. L., Kiros, J. R., & Hinton, G. E. (2016). Layer normalization. http://arxiv.org/abs/1607.
06450.
Baeza-Yates, R., & Ribeiro-Neto, B. (2011). Modern information retrieval: The concepts and
technology behind search (2nd edn.). Addison-Wesley.
Bahdanau, D., Cho, K., & Bengio, Y. (2015). Neural machine translation by jointly learning
to align and translate. In Y. Bengio & Y. LeCun (Eds.), 3rd International Conference on
Learning Representations, ICLR 2015, San Diego, CA, USA, May 7–9, 2015, Conference Track
Proceedings.
Beesley, K. R., & Karttunen, L. (2003). Finite state morphology. CSLI Publications.
Beltrami, E. (1873). Sulle funzioni bilineari. Giornale di Matematiche, 11, 98–106.
Bentivogli, L., Clark, P., Dagan, I., & Giampiccolo, D. (2009). The sixth pascal recognizing textual
entailment challenge. In Text Analysis Conference.
Bentley, J., Knuth, D., & McIlroy, D. (1986). Programming pearls. Communications of the ACM,
6(29), 471–483.
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 501
P. M. Nugues, Python for Natural Language Processing, Cognitive Technologies,
https://doi.org/10.1007/978-3-031-57549-5
502 References
Cer, D., Diab, M., Agirre, E., Lopez-Gazpio, I., & Specia, L. (2017). SemEval-2017 task 1:
Semantic textual similarity multilingual and crosslingual focused evaluation. In Proceedings
of the 11th International Workshop on Semantic Evaluation (SemEval-2017) (pp. 1–14).
Association for Computational Linguistics.
Charniak, E. (1993). Statistical language learning. MIT Press.
Chen, S. F., & Goodman, J. (1998). An empirical study of smoothing techniques for language
modeling. Technical Report TR-10-98, Harvard University, Cambridge.
Chollet, F. (2021). Deep learning with Python (2nd edn.). Manning Publications.
Chrupała, G. (2006). Simple data-driven context-sensitive lemmatization. In Proceedings of
SEPLN, Zaragoza.
Church, K. W., & Hanks, P. (1990). Word association norms, mutual information, and lexicography.
Computational Linguistics, 16(1), 22–29.
Church, K. W., & Mercer, R. L. (1993). Introduction to the special issue on computational
linguistics using large corpora. Computational Linguistics, 19(1), 1–24.
Cooper, D. (1999). Corpora: Kwic concordances with Perl. CORPORA Mailing List Archive,
Concordancing thread.
Crystal, D. (1997). The Cambridge encyclopedia of language (2nd edn.). Cambridge University
Press.
d’Arc, S. J. (Ed.). (1970). Concordance de la Bible, Nouveau testament. Éditions du Cerf –
Desclées De Brouwer.
de la Briandais, R. (1959). File searching using variable length keys. In Proceedings of the Western
Joint Computer Conference (pp. 295–298). AFIPS.
de Marneffe, M.-C., Manning, C. D., Nivre, J., & Zeman, D. (2021). Universal dependencies.
Computational Linguistics, 47(2), 255–308.
Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., & Harshman, R. (1990). Indexing
by latent semantic analysis. Journal of the American Society for Information Science, 41(6),
391–407.
Dermatas, E., & Kokkinakis, G. K. (1995). Automatic stochastic tagging of natural language texts.
Computational Linguistics, 21(2), 137–163.
Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of deep
bidirectional transformers for language understanding. In Proceedings of the 2019 Conference
of the North American Chapter of the Association for Computational Linguistics: Human
Language Technologies, Volume 1 (Long and Short Papers) (pp. 4171–4186). Association for
Computational Linguistics.
Dolan, W. B., & Brockett, C. (2005). Automatically constructing a corpus of sentential paraphrases.
In Proceedings of the Third International Workshop on Paraphrasing (IWP2005).
Dozat, T. (2016). Incorporating Nesterov Momentum into Adam. In Proceedings of the 4th
International Conference on Learning Representations (pp. 1–4)
Ducrot, O., & Schaeffer, J.-M. (Eds.). (1995). Nouveau dictionnaire encyclopédique des sciences
du langage. Éditions du Seuil.
Dumais, S. T. (1991). Improving the retrieval of information from external sources. Behavior
Research Methods, Instruments, & Computers, 23(2), 229–236.
Dunning, T. (1993). Accurate methods for the statistics of surprise and coincidence. Computational
Linguistics, 19(1), 61–74.
Déjean, H. (1998). Morphemes as necessary concept for structures discovery from untagged
corpora. In Proceedings of the Joint Conference on New Methods in Language Processing and
Computational Natural Language Learning (pp. 295–298). Macquarie University.
Eckart, C., & Young, G. (1936). The approximation of one matrix by another of lower rank.
Psychometrika, 1(3), 211–218.
Elman, J. L. (1990). Finding structure in time. Cognitive Science, 14(2), 179–211.
Estoup, J.-B. (1912). Gammes sténographiques: Recueil de textes choisis pour l’acquisition
méthodique de la vitesse (3e edn.). Institut sténographique.
Fano, R. M. (1961). Transmission of information: A statistical theory of communications. MIT
Press.
504 References
Ferrucci, D. A. (2012). Introduction to “This is Watson”. IBM Journal of Research and Develop-
ment, 56(3.4), 1:1 –1:15
Francis, W. N., & Kucera, H. (1982). Frequency analysis of english usage. Houghton Mifflin.
Franz, A., & Brants, T. (2006). All our n-gram are belong to you. Retrieved November 7, 2013,
from https://googleresearch.blogspot.se/2006/08/all-our-n-gram-are-belong-to-you.html
Fredholm, I. (1903). Sur une classe d’équations fonctionnelles. Acta Mathematica, 27, 365–390.
Friedl, J. E. F. (2006). Mastering regular expressions (3rd edn.). O’Reilly.
Friedman, J. H. (2001). Greedy function approximation: A gradient boosting machine. Annals of
Statistics, 29, 1189–1232.
Gage, P. (1994). A new algorithm for data compression. The C User Journal, 12(2), 23–38.
Gale, W. A., & Church, K. W. (1993). A program for aligning sentences in bilingual corpora.
Computational Linguistics, 19(1), 75–102.
Galton, F. (1886). Regression towards mediocrity in hereditary stature. Journal of the Anthropo-
logical Institute, 15, 246–263.
Good, I. J. (1953). The population frequencies of species and the estimation of population
parameters. Biometrika, 40(16), 237–264.
Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep learning. The MIT Press.
Google (2019). WordPieceTokenizer in BERT. Retrieved February 05, 2023, from https://github.
com/google-research/bert/blob/master/tokenization.py#L300-L359
Goyvaerts, J., & Levithan, S. (2012). Regular expressions cookbook (2nd edn.). O’Reilly Media.
Grefenstette, G., & Tapanainen, P. (1994). What is a word, what is a sentence? Problems of
tokenization. MLTT Technical Report 4, Xerox.
Grus, J. (2019). Data science from scratch: First principles with Python (2nd edn.). O’Reilly
Media.
Guilbaud, A. (2017). L’ENCCRE, édition numérique collaborative et critique de l’Encyclopédie.
In Recherches sur Diderot et sur l’Encyclopédie (pp. 5–22)
Guillaume, B., de Marneffe, M.-C., & Perrier, G. (2019). Conversion et améliorations de corpus
du français annotés en universal dependencies [conversion and improvement of universal
dependencies french corpora]. Traitement Automatique des Langues, 60(2), 71–95.
Guo, P. (2014). Python is now the most popular introductory teaching language at top
U.S. universities. Retrieved November 23, 2015, from https://cacm.acm.org/blogs/blog-
cacm/176450-python-is-now-the-most-popular-introductory-teaching-language-at-top-us-
universities/fulltext
Guthrie, R. (2023). Bi-LSTM conditional random field discussion. https://pytorch.org/tutorials/
beginner/nlp/advanced_tutorial.html.
Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., & Witten, I. H. (2009). The WEKA
data mining software: An update. SIGKDD Explorations, 11(1), 10–18.
Hastie, T., Tibshirani, R., & Friedman, J. (2009). The elements of statistical learning: Data mining,
inference, and prediction (2nd edn.). Springer.
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In 2016
IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 770–778). IEEE
Computer Society.
Heafield, K. (2011). KenLM: Faster and smaller language model queries. In Proceedings of the
Sixth Workshop on Statistical Machine Translation (pp. 187–197). Association for Computa-
tional Linguistics.
Hinton, G. (2012). RMSProp: Divide the gradient by a running average of its recent magnitude.
Coursera: Neural Networks for Machine Learning, 4, 26–31.
Ho, T. K. (1995). Random decision forests. In Proceedings of 3rd International Conference on
Document Analysis and Recognition (vol. 1, pp. 278–282). IEEE.
Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8),
1735–1780.
Hoerl, A. E. (1962). Application of ridge analysis to regression problems. Chemical Engineering
Progress, 58(3), 54–59.
References 505
Hopcroft, J. E., Motwani, R., & Ullman, J. D. (2007). Introduction to automata theory, languages,
and computation (3rd edn.). Addison-Wesley.
Hornby, A. S. (Ed.). (1974). Oxford advanced learner’s dictionary of current english (3rd edn.).
Oxford University Press.
Huang, J., Gao, J., Miao, J., Li, X., Wang, K., & Behr, F. (2010). Exploring web scale language
models for search query processing. In Proceedings of the 19th International World Wide Web
Conference (pp. 451–460). Raleigh.
Ide, N., & Véronis, J. (1995). Text encoding initiative: Background and context. Kluwer Academic.
Imbs, P., & Quemada, B. (Eds.), (1971–1994). Trésor de la langue française. Dictionnaire de la
langue française du XIXe et du XXe siècle (1789–1960) (16 volumes). Éditions du CNRS puis
Gallimard.
ISO/IEC. (2016). Information technology – Document description and processing languages –
Office open XML file formats (volume ISO/IEC 29500-1:2016(E)). ISO/IEC.
Iyer, S., Dandekar, N., & Csernai, K. (2017). First Quora dataset release: Question pairs. https://
quoradata.quora.com/First-Quora-Dataset-Release-Question-Pairs
James, G., Witten, D., Hastie, T., & Tibshirani, R. (2021). An introduction to statistical learning:
With applications in R (2nd edn.). Springer.
Jelinek, F. (1990). Self-organized language modeling for speech recognition. In A. Waibel & K.-
F. Lee (Eds.), Readings in speech recognition. Morgan Kaufmann. Reprinted from an IBM
Report, 1985.
Jelinek, F. (1997). Statistical methods for speech recognition. MIT Press.
Jelinek, F., & Mercer, R. L. (1980). Interpolated estimation of Markov source parameters from
sparse data. In E. S. Gelsema & L. N. Kanal (Eds.), Pattern recognition in practice (pp. 38–
397). North-Holland.
Jiang, A. Q., Sablayrolles, A., Mensch, A., Bamford, C., Chaplot, D. S., de las Casas, D., Bressand,
F., Lengyel, G., Lample, G., Saulnier, L., Lavaud, L. R., Lachaux, M.-A., Stock, P., Scao, T. L.,
Lavril, T., Wang, T., Lacroix, T., & Sayed, W. E. (2023). Mistral 7b. https://doi.org/10.48550/
arXiv.2310.06825.
Johnson, J., Douze, M., & Jégou, H. (2019). Billion-scale similarity search with GPUs. IEEE
Transactions on Big Data, 7(3), 535–547.
Jordan, C. (1874). Mémoire sur les formes bilinéaires. Journal de Mathématiques Pures et
Appliquées, Deuxiè me Série, 19, 35–54.
Jurafsky, D., & Martin, J. H. (2008). Speech and language processing, an introduction to natural
language processing, computational linguistics, and speech recognition (2nd edn.). Pearson
Education.
Kaeding, F. W. (1897). Häufigkeitswörterbuch der deutschen Sprache. Selbstverlag des Herausge-
bers.
Kaplan, R. M., & Kay, M. (1994). Regular models of phonological rule systems. Computational
Linguistics, 20(3), 331–378.
Karpathy, A. (2022). mingpt. https://github.com/karpathy/minGPT/blob/master/mingpt/bpe.py
Karttunen, L. (1983). KIMMO: A general morphological processor. Texas Linguistic Forum, 22,
163–186.
Karttunen, L., Kaplan, R. M., & Zaenen, A. (1992). Two-level morphology with composition. In
Proceedings of the 15th Conference on Computational Linguistics, COLING-92 (vol. 1, pp.
141–148). Nantes.
Katz, S. M. (1987). Estimation of probabilities from sparse data for the language model component
of a speech recognizer. IEEE Transactions on Acoustics, Speech, and Signal Processing, 35(3),
400–401.
Kernighan, M. D., Church, K. W., & Gale, W. A. (1990). A spelling correction program based
on a noisy channel model. In Papers Presented to the 13th International Conference on
Computational Linguistics (COLING-90), Helsinki (vol. II, pp. 205–210).
Kingma, D. P., & Ba, J. (2014). Adam: A method for stochastic optimization. arXiv:1412.6980.
Kiraz, G. A. (2001). Computational nonlinear morphology: With emphasis on semitic languages.
Studies in natural language processing. Cambridge University Press.
506 References
Kiss, T., & Strunk, J. (2006). Unsupervised multilingual sentence boundary detection. Computa-
tional Linguistics, 32(4), 485–525.
Klang, M. (2023). Hashing with MD5. Personal communication.
Kleene, S. C. (1956). Representation of events in nerve nets and finite automata. In C. E. Shannon
& J. McCarthy (Eds.), Automata studies (pp. 3–42). Princeton University Press.
Knuth, D. E. (1986). The TeXbook. Addison-Wesley.
Kong, Q., Siauw, T., & Bayen, A. (2020). Python programming and numerical methods: A guide
for engineers and scientists. Academic Press.
Kornai, A. (Ed.). (1999). Extended finite state models of language. Studies in natural language
processing. Cambridge University Press.
Koskenniemi, K. (1983). Two-level morphology: A general computation model for word-form
recognition and production. Technical Report 11, University of Helsinki, Department of
General Linguistics.
Kudo, T. (2017). SentencePiece. Retrieved on June 02, 2023, from https://github.com/google/
sentencepiece
Kudo, T. (2018). Subword regularization: Improving neural network translation models with
multiple subword candidates. In Proceedings of the 56th Annual Meeting of the Association
for Computational Linguistics (Volume 1: Long Papers) (pp. 66–75). Association for Compu-
tational Linguistics.
Kudo, T., & Richardson, J. (2018). SentencePiece: A simple and language independent subword
tokenizer and detokenizer for neural text processing. In Proceedings of the 2018 Conference
on Empirical Methods in Natural Language Processing: System Demonstrations (pp. 66–71).
Association for Computational Linguistics.
Kudoh, T., & Matsumoto, Y. (2000). Use of support vector learning for chunk identification. In
Proceedings of CoNLL-2000 and LLL-2000, Lisbon (pp. 142–144)
Kullback, S., & Leibler, R. A. (1951). On information and sufficiency. Annals of Mathematical
Statistics, 22(1), 79–86.
Lafferty, J., McCallum, A., & Pereira, F. (2001). Conditional random fields: Probabilistic models
for segmenting and labeling sequence data. In Proceedings of the Eighteenth International
Conference on Machine Learning (ICML-01) (pp. 282–289). Morgan Kaufmann Publishers.
Lallot, J. (Ed.). (1998). La grammaire de Denys le Thrace (2e edn.). CNRS Éditions, Collection
Science du langage. Text in Greek, translated in French by Jean Lallot.
Lample, G., Ballesteros, M., Subramanian, S., Kawakami, K., & Dyer, C. (2016). Neural
architectures for named entity recognition. In K. Knight, A. Nenkova, & O. Rambow (Eds.),
Proceedings of the 2016 Conference of the North American Chapter of the Association for
Computational Linguistics: Human Language Technologies (pp. 260–270). Association for
Computational Linguistics.
Laplace, P. (1820). Théorie analytique des probabilités (3rd edn.). Coursier.
Le Cun, Y. (1987). Modè le connexionniste de l’apprentissage. Ph.D. Thesis, Université Paris 6.
Legendre, A.-M. (1805). Nouvelles méthodes pour la détermination des orbites des comè tes.
Firmin Didot.
Lewis, D. D., Yang, Y., Rose, T. G., & Li, F. (2004). RCV1: A new benchmark collection for text
categorization research. Journal of Machine Learning Research, 5, 361–397.
Lin, T., Wang, Y., Liu, X., & Qiu, X. (2022). A survey of transformers. AI Open, 3, 111–132.
Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., &
Stoyanov, V. (2019). Roberta: A robustly optimized bert pretraining approach. http://arxiv.org/
abs/1907.11692.
Longpre, S., Hou, L., Vu, T., Webson, A., Chung, H. W., Tay, Y., Zhou, D., Le, Q. V., Zoph, B.,
Wei, J., et al. (2023). The flan collection: Designing data and methods for effective instruction
tuning. arXiv:2301.13688.
Lutz, M. (2013). Learning Python (5th edn.). O’Reilly Media.
Maas, A. L., Daly, R. E., Pham, P. T., Huang, D., Ng, A. Y., & Potts, C. (2011). Learning word
vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting of the Association
References 507
for Computational Linguistics: Human Language Technologies (pp. 142–150). Association for
Computational Linguistics.
Manning, C. D., Raghavan, P., & Schütze, H. (2008). Introduction to information retrieval.
Cambridge University Press.
Manning, C. D., & Schütze, H. (1999). Foundations of statistical natural language processing.
MIT Press.
Marcus, M., Marcinkiewicz, M. A., & Santorini, B. (1993). Building a large annotated corpus of
English: The Penn Treebank. Computational Linguistics, 19(2), 313–330.
Matthes, E. (2019). Python crash course: A hands-on, project-based introduction to programming
(2nd edn.). No Starch Press.
Mauldin, M. L., & Leavitt, J. R. R. (1994). Web-agent related research at the Center for
Machine Translation. In Proceedings of the ACM SIG on Networked Information Discovery
and Retrieval. McLean.
McKinney, W. (2010). Data structures for statistical computing in Python. In S. van der Walt & J.
Millman (Eds.), Proceedings of the 9th Python in Science Conference (pp. 56–61)
McKinney, W. (2022). Python for data analysis: Data wrangling with pandas, NumPy, and Jupyter
(3rd edn.). O’Reilly Media.
McMahon, J. G., & Smith, F. J. (1996). Improving statistical language models performance with
automatically generated word hierarchies. Computational Linguistics, 22(2), 217–247.
Merialdo, B. (1994). Tagging English text with a probabilistic model. Computational Linguistics,
20(2), 155–171.
Mikheev, A. (2002). Periods, capitalized words, etc. Computational Linguistics, 28(3), 289–318.
Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013a). Efficient estimation of word repre-
sentations in vector space. In First International Conference on Learning Representations
(ICLR2013), Workshop Proceedings.
Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013b). Distributed representations
of words and phrases and their compositionality. In C. J. C. Burges, L. Bottou, M. Welling, Z.
Ghahramani, & K. Q. Weinberger (Eds.), Advances in neural information processing systems
(vol. 26, pp. 3111–3119). Curran Associates.
Mohri, M., Pereira, F. C. N., & Riley, M. (1998). A rational design for a weighted finite-state
transducer library. In D. Wood, & S. Yu (Eds.), Automata Implementation. Second International
Workshop on Implementing Automata, WIA ’97, London, Ontario, September 1997. Revised
Papers. Lecture notes in computer science (vol. 1436, pp. 144–158). Springer.
Mohri, M., Pereira, F. C. N., & Riley, M. (2000). The design principles of a weighted finite-state
transducer library. Theoretical Computer Science, 231(1), 17–32.
Monachini, M., & Calzolari, N. (1996). Synopsis and comparison of morphosyntactic phenomena
encoded in lexicons and corpora: A common proposal and applications to European languages.
Technical Report, Istituto di Linguistica Computazionale del CNR, Pisa. EAGLES Document
EAG–CLWG–MORPHSYN/R.
Moore, E. H. (1920). On the reciprocal of the general algebraic matrix, abstract. Bulletin of the
American Mathematical Society, 26, 394–395.
Murphy, K. P. (2022). Probabilistic machine learning: An introduction. MIT Press.
Ney, H., Essen, U., & Kneser, R. (1994). On structuring probabilistic dependences in stochastic
language modelling. Computer Speech and Language, 8(1), 1–38.
Nivre, J., Agić, Ž., Ahrenberg, L., Antonsen, L., Aranzabe, M. J., Asahara, M., Ateyah, L.,
Attia, M., Atutxa, A., Augustinus, L., Badmaeva, E., Ballesteros, M., Banerjee, E., Bank,
S., Barbu Mititelu, V., Bauer, J., Bengoetxea, K., Bhat, R. A., Bick, et al. (2017). Universal
dependencies 2.1. LINDAT/CLARIN digital library at the Institute of Formal and Applied
Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University.
Norvig, P. (2007). How to write a spelling corrector. Retrieved November 30, 2015, from https://
norvig.com/spell-correct.html
Norvig, P. (2009). Natural language corpus data. In Beautiful data: The stories behind elegant data
solutions (pp. 219–242). O’Reilly Media.
Oliphant, T. E. (2015). Guide to NumPy (2nd edn.). CreateSpace Independent Publishing Platform.
508 References
Ratnaparkhi, A. (1996). A maximum entropy model for part-of-speech tagging. In E. Brill & K.
Church (Eds.), Proceedings of the Conference on Empirical Methods in Natural Language
Processing, Philadelphia (pp. 133–142).
Ray, E. T. (2003). Learning XML (2nd edn.). O’Reilly.
Reimers, N., & Gurevych, I. (2019). Sentence-BERT: sentence embeddings using siamese BERT-
networks. In K. Inui, J. Jiang, V. ng, & X. Wan (Eds.), Proceedings of the 2019 Conference
on Empirical Methods in Natural Language Processing and the 9th International Joint
Conference on Natural Language Processing (EMNLP-IJCNLP) (pp. 3982–3992). Association
for Computational Linguistics, Hong Kong.
Ren, X., Zhou, P., Meng, X., Huang, X., Wang, Y., Wang, W., Li, P., Zhang, X., Podolskiy, A.,
Arshinov, G., Bout, A., Piontkovskaya, I., Wei, J., Jiang, X., Su, T., Liu, Q., & Yao, J. (2023).
Pangu-.Σ: Towards trillion parameter language model with sparse heterogeneous computing.
https://doi.org/10.48550/arXiv.2303.10845.
Reynar, J. C. (1998). Topic Segmentation: Algorithms and Applications. Ph.D. Thesis, University
of Pennsylvania, Philadelphia.
Reynar, J. C., & Ratnaparkhi, A. (1997). A maximum entropy approach to identifying sentence
boundaries. In Proceedings of the Fifth Conference on Applied Natural Language Processing,
Washington (pp. 16–19).
Ritchie, G. D., Russell, G. J., Black, A. W., & Pulman, S. G. (1992). Computational morphology.
Practical mechanisms for the english lexicon. MIT Press.
Roche, E., & Schabes, Y. (1995). Deterministic part-of-speech tagging with finite-state transducers.
Computational Linguistics, 21(2), 227–253.
Roche, E., & Schabes, Y. (Eds.). (1997). Finite-state language processing. MIT Press.
Rosenblatt, F. (1958). The perceptron: A probabilistic model for information storage and organi-
zation in the brain. Psychological Review, 65(6), 386–408.
Rouse, R. H., & Rouse, M. A. (1974). The verbal concordance to the scriptures. Archivum Fratrum
Praedicatorum, 44, 5–30.
Ruder, S. (2017). An overview of gradient descent optimization algorithms. http://arxiv.org/abs/
1609.04747.
Rush, A. (2018). The annotated transformer. In E. L. Park, M. Hagiwara, D. Milajevs, & L.
Tan (Eds.), Proceedings of Workshop for NLP Open Source Software (NLP-OSS) (pp. 52–60).
Association for Computational Linguistics.
Salton, G. (1988). Automatic text processing: The transformation, analysis, and retrieval of
information by computer. Addison-Wesley.
Salton, G., & Buckley, C. (1987). Term weighting approaches in automatic text retrieval. Technical
Report TR87-881, Department of Computer Science, Cornell University, Ithaca.
Sanh, V., Debut, L., Chaumond, J., & Wolf, T. (2019). DistilBERT, a distilled version of BERT:
Smaller, faster, cheaper and lighter. In 5th Workshop on Energy Efficient Machine Learning
and Cognitive Computing @ NeurIPS 2019.
Saporta, G. (2011). Probabilités, analyse des données et statistiques (3rd edn.). Éditions Technip.
Schuster, M., & Nakajima, K. (2012). Japanese and Korean voice search. In 2012 IEEE
International Conference on Acoustics, Speech and Signal Processing (ICASSP), Kyoto (pp.
5149–5152)
Sennrich, R., Haddow, B., & Birch, A. (2016). Neural machine translation of rare words with
subword units. In Proceedings of the 54th Annual Meeting of the Association for Computa-
tional Linguistics (Volume 1: Long Papers) (pp. 1715–1725). Association for Computational
Linguistics.
Shannon, C. E. (1948). A mathematical theory of communication. Bell System Technical Journal,
27, 398–403, 623–656.
Shao, Y., Hardmeier, C., & Nivre, J. (2018). Universal word segmentation: Implementation and
interpretation. Transactions of the Association for Computational Linguistics, 6, 421–435.
Shazeer, N. (2020). Glu variants improve transformer. https://arxiv.org/abs/2002.05202.
510 References
Silveira, N., Dozat, T., de Marneffe, M.-C., Bowman, S., Connor, M., Bauer, J., & Manning,
C. D. (2014). A gold standard dependency corpus for English. In Proceedings of the Ninth
International Conference on Language Resources and Evaluation (LREC-2014).
Simone, R. (2007). Fondamenti di linguistica (10th edn.). Laterza.
Sinclair, J. (Ed.). (1987). Collins COBUILD english language dictionary. Collins.
Socher, R., Perelygin, A., Wu, J., Chuang, J., Manning, C. D., Ng, A., & Potts, C. (2013). Recursive
deep models for semantic compositionality over a sentiment treebank. In Proceedings of the
2013 Conference on Empirical Methods in Natural Language Processing (pp. 1631–1642).
Association for Computational Linguistics.
Spärck Jones, K. (1972). A statistical interpretation of term specificity and its application in
retrieval. Journal of Documentation, 28(1), 11–21.
Sproat, R. (1992). Morphology and computation. MIT Press.
Stevens, E., Antiga, L., & Viehmann, T. (2020). Deep learning with PyTorch. Manning Publica-
tions.
Strubell, E., Ganesh, A., & McCallum, A. (2019). Energy and policy considerations for deep
learning in NLP. In A. Korhonen, D. Traum, & L. Màrquez (Eds.), Proceedings of the 57th
Annual Meeting of the Association for Computational Linguistics (pp. 3645–3650). Association
for Computational Linguistics.
Su, J., Lu, Y., Pan, S., Murtadha, A., Wen, B., & Liu, Y. (2024). Roformer: Enhanced transformer
with rotary position embedding. https://doi.org/10.1016/j.neucom.2023.127063.
Suits, D. B. (1957). Use of dummy variables in regression equations. Journal of the American
Statistical Association, 52(280), 548–551.
Sutton, C., & McCallum, A. (2011). An introduction to conditional random fields. Foundations
and Trends in Machine Learning, 4(4), 267–373.
Taylor, W. L. (1953). “cloze procedure”: A new tool for measuring readability. Journalism
Quarterly, 30, 415–433.
TEI Consortium, e. (2023). TEI P5: Guidelines for electronic text encoding and interchange.
Retrieved September 29, 2023, from https://www.tei-c.org/Guidelines/P5/.
The Unicode Consortium. (2012). The unicode standard, version 6.1 – Core specification. Unicode
Consortium, Mountain View.
The Unicode Consortium. (2022). The unicode standard, version 15.0 – Core specification.
Unicode Consortium, Mountain View.
Thompson, K. (1968). Regular expression search algorithm. Communications of the ACM, 11(6),
419–422.
Tjong Kim Sang, E. F. (2002). Introduction to the CoNLL-2002 shared task: Language-
independent named entity recognition. In Proceedings of CoNLL-2002, Taipei (pp. 155–158).
Tjong Kim Sang, E. F., & Buchholz, S. (2000). Introduction to the CoNLL-2000 shared task:
Chunking. In Proceedings of CoNLL-2000 and LLL-2000, Lisbon (pp. 127–132).
Tjong Kim Sang, E. F., & De Meulder, F. (2003). Introduction to the CoNLL-2003 shared task:
Language-independent named entity recognition. In Proceedings of CoNLL-2003, Edmonton
(pp. 142–147).
Tjong Kim Sang, E. F., & Veenstra, J. (1999). Representing text chunks. In Ninth Conference of
the European Chapter of the ACL, Bergen (pp. 173–179).
Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Roziè re, B., Goyal,
N., Hambro, E., Azhar, F., Rodriguez, A., Joulin, A., Grave, E., & Lample, G. (2023a). Llama:
Open and efficient foundation language models. arxiv:2302.13971.
Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bashlykov, N., Batra, S.,
Bhargava, P., Bhosale, S., Bikel, D., Blecher, L., Ferrer, C. C., Chen, M., Cucurull, G., Esiobu,
D., Fernandes, J., Fu, J., Fu, et al. (2023b). Llama 2: Open foundation and fine-tuned chat
models. https://doi.org/10.48550/arXiv.2307.09288.
Tunstall, L., Von Werra, L., & Wolf, T. (2022). Natural language processing with transformers,
Revised Edition. O’Reilly Media.
van Noord, G., & Gerdemann, D. (2001). An extendible regular expression compiler for finite-
state approaches in natural language processing. In O. Boldt & H. Jürgensen (Eds.), Automata
References 511
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 513
P. M. Nugues, Python for Natural Language Processing, Cognitive Technologies,
https://doi.org/10.1007/978-3-031-57549-5
514 Index
X
W XML attribute, 104
Wang, A., 3, 460, 461 XML element, 104
Wang, K., 270 XML entity, 105
Wang, W., viii XML-TEI, 338
Warstadt, A., 461
Web scraping, 106
Weight vector, 176 Y
Weka, 160 Young, G., 287
Wenzek, G., 498 Yu, H.F., 249
Whistler, K., 97, 98, 109
Wikipedia, 8
Williams, A., 461 Z
Witten, I. H., 160 Zampolli, A., 83
Wolf, T., 447, 465 Zaragoza, H., 248
Word embeddings, 286, 294 Zhang, B., 497
Word preference measurements, 276 Zip, 33
word2vec, 304, 322