PluColl - The UNIPEN/NICI/HP data collection of
Summer/Autumn 1994 *
Lambert Schomaker



August 1994

1

Introduction

The necessity for cataloguing Western handwriting styles becomes more and more apparent
as on-line handwriting recognition algorithms currently reach an asymptote in their performance, and a limited generalization from laboratory training set to real life conditions is
observed. Although the algorithms as such still need to be refined, and an optimal approach
has not as yet been identified, performance improvement is most likely to result from the
availability of much larger training sets of on-line handwriting data than is current practice.
Indeed, in the comparable fields of speech recognition and optical recognition of handwriting the situation is different. The speech recognition area already has a large, commonly
accepted test bed for evaluating recognizers, like the TIMIT database. In the optical recognition of handwriting, the main international post companies all have a huge base of scanned
texts from actual mail envelopes, and the continuous flow of data is regularly sampled to
retrain recognizers in order to capture trends in change of styles. Consequently, the research
area of off-line optical recognition but especially that of speech recognition is in a more
advanced technological state than is the case in on-line handwriting recognition. In the
HP/NICI collaboration project, the problem of handwriting style has been analyzed as to
consist of two components:
1. Between-writer Style Variation
2. Within-writer Variability

* Supported by Hewlett-Packard, Bristol. The name PluColl refers to Pluto collection, a tongue-in-cheek
reference to the mars.dll software component of Microsoft PenWindows
 NICI, Nijmegen Institute for Cognition and Information, University of Nijmegen, P.O.Box 9104, 6500
HE Nijmegen, The Netherlands, Tel: +31 80 616029 / Fax: +31 80 616066, E-mail: schomaker@nici.kun.nl

1

Figure 1: Style variation between writers. Different samples of the word optimum for 32
different writers. The plot has been produced with the PRINT button in the program Upview
V1.03, which generates a PostScript file. The words from several files were combined into a
single Unipen file with the program Upread.

1.1

Between-writer variation

Ad 1. In Western culture, a huge variation in writing styles exists. Between different European countries there are clear style differences. Even within a country, there are style
variations (Figure 1) caused, e.g., by differences in writing methods at primary school. As a
consequence, there may also be clear differences between writers from different school generations. Apart from work in forensic handwriting analysis (e.g., the German B.K.A. system
FISH), there exists no catalogue of Western handwriting styles and little is know about
algorithms to calculate quantitative measures which can be utilized in on-line recognition
systems.

1.2

Within-writer variability

Ad 2. In addition to differences between writers, however, there is also the phenomenon of
variability of handwriting within an individual writer. Four types of variability exist:
2

(a) geometrical variability without change in the ”topological” characteristics of characters;
(b) omission of strokes (fusion) due to fast or careless writing;
(c) insertion of strokes or ligatures, in elaborate writing or in the case of hesitations or
spurious pen movement;
(d) letter shape (allograph) variability due to stylistic choice.
The first type of variability (a) comes from the neural noise in the human motor output
system, and leads to geometrical variability in the form of slant and roundness deviations
per stroke, essentially however, preserving the ”topology” of the characters (Figure 2).

Figure 2: Within-writer variation: the case of limited human-motor noise. Several samples
of the word /algebra/. Rows represent eight different writers, the four columns represent different replications of the word, written at different points in time. Words written in column
1 vs 2 (and 3 vs 4) are separated maximally 2 hours in time. The two leftmost columns (1
and 2) are separated minimally two weeks in time from the two rightmost columns. In row
1, (cursive) the loop in the /g/ is missing, whereas the other three replications of /g/ are
looped. In row 2, (mixed cursive) the pen is lifted at different points in different replications.
A closed and three extremely open variants of /a/ are produced. In row 8, (mixed cursive)
two allographs of the /r/ are used.

3

The second type of variability (b), stroke fusion, can theoretically be explained as follows.
Let us assume that we can make a distinction between a central pattern generator and a
pipeline of transforming filters, initially being neural, but the final filter being composed of
the biomechanical effector system. The filtering properties of the output channel as a whole
are essentially of a low-pass nature. The observed bandwidth of handwriting is about 10
Hz (Teulings & Maarse, 1984). According to the minimized-jerk theory (Flash & Hogan,
1985), the movement trajectory is generated on the basis of the constraint that so-called ”via
points” are reached (in our case, topologically important points in a single character), and
that the rms value of the first derivative of acceleration is minimized. The pattern generator
plans the sequence of x,y via points. Under conditions of reduced mental concentration or
speed requirements, the central pattern generator (partially) omits some via points in its
output, leading to fused strokes, yielding less prominent character details (Figure 3).

a

b

Figure 3: Fusion of strokes, differences within a single writer. (3a) The words /borax/ and
/bouquet/ show that the /or/ transition leads to a fusion of the last stroke of the /o/ into
the connection stroke with the /r/ in /borax/, whereas the /o/ in /bouquet/ is neat and
complete. A similar phenomenon occurs in the /ax/ transition in /borax/.
(3b) The word fjord shows a similar stroke fusion in /or/.
The third (c) form of within-writer variability is caused either by similar high-level processes as in (b), this time however inserting strokes at will, or, alternatively by interruption
of the central patterning process. The latter can be self-induced, when the writer thinks
about the formulation of the text to come. This phenomenon is called ”phonemic-graphemic
interference”. The phonemes of words-to-come are activated subliminally (i.e., without giving rise to speech musculature activation), but with sufficient levels of activation to produce
a premature spelling process activation. The resulting allograph ”breaks into the current
motor output buffer”. Other causes of inserted erroneous strokes are external events, such
as loud noises, doors opening, phones ringing etc., after which the writing process resumes.
The fourth (d) form of within-writer variability originates at a higher, cognitive level in
the human writing system and has to do with the choice of letter shapes (allographs). For

4

example, it is often observed in user-trainable systems that writers enter different shapes in
the training stage compared to the letter and word shapes entered in the actual use of an
application. Within a single writer, there may even be a seemingly random choice of styles
as different as isolated handprint and connected cursive.
Both components of variability in handwriting: Between-Writer Style Variation and
Within-Writer Shape Variability can only be handled effectively by on-line recognition algorithms if more is known about their statistics: Which variables are essential, and what are
their distributions, and can we identify clusters of generic writing styles?
In order to approach this problem in the areas of on-line recognition of handwriting,
the HP/NICI collaborating project team has designed a data collection setup fulfilling the
following purposes.

2

Criteria for an on-line handwriting data set suitable
for addressing the variability problem

The data set to be collected:
1. must capture style variation among writers,
2. must capture style variability within a writer, as measured at occasions sufficiently
spaced apart in time,
3. must be large enough to allow for a number of large-scale training/testing experiments,
4. must be compatible with the UNIPEN project, so that data from other institutions
may also be used in such massive training and testing,
5. must be of high quality as regards the signal properties, since deteriorated signal
conditions can easily be imposed post hoc.

2.1

Additional constraints: input unit scope

The data collection is WORD-oriented, since recognizers at both HP and NICI are based on
isolated word recognition. Also, this is the input chunk size currently handled by most free
style or connected cursive recognition systems. The letter level is only suited for isolated
handprint and digit data. The sentence level and higher (paragraphs, pages) impose
additional word segmentation problems which are difficult to handle at the moment. It is
not completely possible to compute word segmentation on the basis of bottom-up features
like white space or ink clustering: Often lexical or even syntactical top-down information
would be necessary to disambiguate here. In many applications, however, the word-based
input is already useful, especially if recognition speed can be fast enough to not disturb the
human word production process (”train of thought”) (Nakagawa et al., 1993). The WORDS
will consist of lower case characters.

5

2.2

Additional constraints: word lexicon

The elements of a word list in handwriting collection setups is usually a subject of hot debate
due to the large number of possible criteria for inclusion (size, word length, character content,
digram content, trigram content, linguistic frequency of usage, etc.). In the collection setup,
two basic constraints were chosen, sacrificing some other criteria:
2.2.1

Bilinguality

The list must be bilingual in the sense that the same list can be written by Dutch and
English writers. This allows for the incremental collection of words in both Nijmegen and
Bristol. It will ensure that the Dutch writers will not feel uneasy writing a foreign language.
2.2.2

Maximized digram coverage

In connected-cursive and mixed-cursive handwriting, the current character shape is determined by both predecessor and successor. The connecting strokes come from a previous
character, retaining effects from the starting position and the angular velocity (clockwise,
sharp, counter-clockwise), and may exert an effect on the first strokes of the current character itself. Similarly, the anticipation of the next character may lead to distortions of the final
stroke(s) of the current character. To obtain a reliable overview on character production
strategies, as much digrams from the 26x26 transition matrix must be present in the word
list. Actually, there are 27 symbols, including the space symbol (identifying Begin-Of-Word
and End-Of-Word conditions).
In order to build a word list that fulfills the aforementioned criteria, the following approach was taken.
2.2.3

Steps in determining the word list

 Word List 1: 50k Dutch words.
 Word List 2: 50k English words.
 These two word lists were ran through Unix comm, yielding a list with 3251 words
common to both languages.
 As the resulting list was too large for the data collection process, it condensed with a
dedicated program in C which created a subset of words with the criterion of maximum
digram coverage. This means that all (27x27) digrams present in the input list will be
present in the output list. The program is based on stochastic optimization, iteratively
picking a word from the input list with a low probability, and only adding it to the
output list if it contains new unseen digrams. This was done several times, choosing a
final list which was acceptable (decency, not too difficult to spell, etc.). The resulting
word list contained 210 words. Due to the selection algorithm, the words are slightly
longer than average English words.
 A number of words was manually added because of their interesting (but low frequent)
digrams. An example is the /x-y/ digram in ”xylophone”. For this word, the English

6

spelling was used which is more acceptable to Dutch writers than ”xylofoon” would
be for English writers. The final list consists of 210 words (Appendix I).
The word list contains many international concepts (e.g., ”algebra”), geographical names,
technical terms, latin-origin words, french-origin words, as well as words which happen to
be spelled the same in both languages, but may have a different meaning (”trekking”).
After the writing sessions, the subjects were asked from which (unmentioned) language they
thought the word list was, and also they were asked to mark words which they thought were
difficult to write. The list appears to be of medium difficulty, and there were no specific
complaints by the subjects.

3

Recording Setup

Since a representative ”real-life” application does not yet exist, it was decided to collect
words in a visually prompted word setup with a provision for rewriting words the subject
considers badly legible him/herself. Words are randomized on each session. Writers sat at a
table in a room with dimly lit fluorescent lamps to prevent glare from the Wacom PL-100V
LCD screen. The Wacom was placed on a normal desktop in an orientation preferred by
the subject. A separation panel was placed between experimenter and subject to prevent
additional stress or performance pressure which often develops in experimental setups. Subjects are eager to please experimenters, and sometimes weary of hidden motives (intelligence
or personality tests). For our purpose it was important that writers used their own, i.e.,
their mostly-used handwriting, rather than a style they thought was acceptable. There was
an introductory text on a sheet of paper, and writers were allowed to get accustomed to
the setup by writing 20 habituation words. Classical music was presented on background to
maintain a pleasant atmosphere during this more or less dull writing condition.

4

Session Schedule

The subject came to the lab three times (Sessions), spaced two weeks apart. At each Session,
two Sets of the 210 words were produced, yielding six Sets (totalling 1260 words written per
writer). Within a Set, the writer was allowed to pause after 100 words.
Run 1, assistant Natasha.
Session 1:
Set 1
Set 2
(two weeks)
Session 2:
Set 3
Set 4
(two weeks)
Session 3:
Set 5
Set 6
7

Data from 19 subjects has been collected, writers producing the word list 6 times each.
The result is a total of 19 * 6 * 210 = 23940 words,
19 * 6 * 1514 = 172596 letters.
The second run in the collection process was done according to the following schedule:
Run 2, assistant Eliane.
Session 1:
Set 1
Set 2
(two weeks)
Session 2:
Set 3
Set 4
For the second set, data from 16 subjects, writing the word list 4 times each has been
collected thus far. Subjects were asked if they were available for later collection occasions.
The result is a total of 16 * 4 * 210 =
16 * 4 * 1514 =

13440 words,
96896 letters.

Currently the totals for the NICI collection are:
37380 words (269492 letters).

5

Recording Software

The recording software consists of a Visual Basic application (PLUCOLL) and a DLL package written in C (PLUTO) for the actual sampling of the pen-tip coordinates. The output
consists of individual UNIPEN-format files per word. (the .INK files), as well as a writer
description and a setup description file, written to the local hard disk on the PC. After
each session the collected .INK files and information files are combined in a single UNIPEN
file for a set (e.g. SET1.DAT). This is done by the program UNIWRAP, which produces
a UNIPEN file on the basis of a checklist of constituent file names. PC-NFS was used for
Unix disk access (the UNIWRAP output files are written to a remote disk on a HP 9000/735
workstation.
Environment: DOS 6.x, Windows 3.1, Windows for Pen Computing 1.01a,
Visual Basic V3. VBXs, PC-NFS V5.0a.

6

Recording Hardware

8

PC: IBM 486SLC2-66 MHz motherboard, 4 MB.
Tablet: PL-100V
3COM 3C509 Ethernet adaptor.
Tablet details are contained in the UNIPEN files.

7

Subject Group

In this data collection setup, we tried to avoid the usual population of co-researchers and
students. The target group was older than 25 years, and a number of professions in which
writing is a usual activity was included. This was done by recruiting people through a
newspaper advertisement in a medium-sized Dutch paper. The average age is about 30 years.
Handedness L/R is distributed proportional to the whole population (approx 1 in 10 left
handed). The average computer experience is 5.5 years, this is partly due to three subjects
having more than 10 years experience. Two subjects have no computer experience. About
half of the subjects have university training, the other half having various backgrounds.
The profession was mainly from ”Services” (other categories were: Medical, Industrial,
Education, Office, Technical, Research, None). The majority of the subject wrote mixed
cursive, according to their own judgment. The others claimed to write cursive (They were
shown four words samples from the categories Block print, Handprint, Mixed cursive, and
Cursive).

8

Data Annotation

The UNIPEN program UPVIEW was used to annotate the SETx.DAT files word by word.
By clicking on a word box in UPVIEW, a flat text editor appears with on the first line the
label of the word that should have been written. The annotator can place remarks in this
file. The following categories of special, non-optimal word quality cases were defined:
Coding Category
/spelling/
/stroking/
/punctuation/
/capitals/
/disconnected/

Explanation
This is the worst possible category: human readers read a
different word from what has been written.
This category refers to fused or omitted strokes
Refers to unsollicited punctuation/diacritics
lower case characters were sollicited only
as in /cl/ or /ol/ denoting /d/,
with a very clear white space
in between two components.

The annotation appears in individual files, e.g., the fifth word of set1.dat will be annotated in a separate file set1.dat-segment-4.log More details are given in Appendix II.

9

9

State of the Work in Progress

Currently, individual character labeling is performed interactively. Words are sent to the
NICI script recognizer. The recognizer is set to a strict recognition mode, i.e., individual
characters must have a posteriori probability of p > 0.05. Furthermore, all individual characters in a word must be identified, yielding a contiguous letter path representing the correct
word, never missing more than two strokes between two letters. If the word is recognized,
the resulting labels are stored (in wordnnn.lbR files, where ”R” stands for Recognized). If a
word is not recognized, the operator labels all the characters in a word manually, including
the connecting strokes. If characters are illegible by human or if the words are misspelled,
the corresponding characters are not labeled. The labels produced by the human operator
are stored in separate files (named wordnnn.lbl). In order to maintain a consistent labeling
strategy, there is regular supervision on the process.

10

References

Flash, T., & Hogan, N. (1985). The coordination of arm movement: An experimentally
confirmed mathematical model. Journal of Neuroscience, 5, 1688-1703.
Nakagawa, M., Machii, K., & Kato, N. (1993). Lazy Recognition as a Principle of Pen
Interfaces. Conference handout (nakagawa@tuatg.tuat.ac.jp).
Teulings, H.L. & Maarse, F.J. (1984). Digital recording and processing of handwriting
movements. Human Movement Science, 3, 193-217.

11

Appendices

In Appendix A, the list of used words is shown, dubbed the NLUK-210 list. Also the digram
frequency table is given for this word list.
In Appendix B, the coding categories in global word annotation are given. These codes were
used in truthing the word labels.
In Appendix C, some basic statistics of a subset of the collected data are shown, such as
slant, and number of pen-down pieces. Look at the GrandMean, which is the average of the
writer averages over each 210-word set.
Appendix D summarizes the database quantities and the state of the data.
In Appendix E, ficticious writer names are shown which will be used to identify these sets
in the future. In the development of knowledge on style clusters, it will be easier to refer to
such styles using these names (as a kind of ”font” name).
Appendix F shows the correspondence between what writers thought was their handwriting
style, and a simple measure of ”connected-cursiveness”, i.e., the average number of pen-down
ink pieces per word (N piece), for each writer. Indeed, writers who claim to write cursive,
have the lowest average values of N piece ≈ 1.8, whereas writers claiming to write handprint
yield an average of N piece ≈ 8.6.

10

A

The 210-word NLUK list

abdomen
abstinent
adherent
adjunct
advocate
afghanistan
album
aldehyde
algebra
alluvium
alp
amanuensis
analyst
anecdote
angst
antecedent
aorta
appendix
aqua
arcsin
auschwitz
backup
badminton
bangkok
batik
bauhaus
bazaar
bhagwan
bijouterie
bladder
bobby
bodyguard
bolster
borax
bouquet
boutique
bradford
breakdown
brisbane
budget
buffet
byte

calcium
charisma
checklist
chevron
chloride
cockpit
cocktail
colonnade
comfort
concubine
conjunct
copywriter
cornwall
corps
cowboy
crawl
croquet
cycle
czerny
darwin
dashboard
deadline
debugger
dejeuner
delhi
delinquent
deodorant
diagnose
disjunct
dixieland
dizzy
dozen
drink
edelweiss
entertainment
equilibrium
equipment
essay
excellent
exodus
export
extract

exuberant
fascist
feedback
finland
fjord
flipflop
frankfurt
fuchsia
genre
gladiator
god
guyana
gymnast
halfback
halve
hamster
hoffman
hotdog
hulk
huxley
hyena
hypotheses
immigrant
inconvenient
inexact
informant
inhumane
input
interviews
israeli
istanbul
jacques
jitter
jujube
kafka
kamchatka
keyboard
kidnapping
kiwi
knowhow
kremlin
landcode

The list contains 1514 characters.

11

larynx
lincoln
lunchroom
luxe
macbeth
magtape
major
masker
maxwell
mazurka
megahertz
mysteries
native
newton
nihilist
object
ohm
onyx
optimum
oxford
paperback
papyrus
partner
persistent
pigment
pneumococcus
poet
popcorn
portfolio
potpourri
potsdam
projector
prospectus
quota
reflex
rembrandt
revue
rhesus
samovar
sandwich
scherzo
sheriffs

showman
shuttle
sightseeing
sleep
snob
society
software
squaw
stanza
stewards
stockholm
stopwatch
strychnine
studio
stuttgart
sweatshirt
symposium
tableau
teamwork
tokyo
tomahawk
tonic
transfer
trapezium
trekking
triplet
turf
turquoise
update
upgrade
vacuum
virgin
voltmeter
walrus
wonderland
workshop
wyoming
xylophone
yoga
yucca
zigzag
zwei

Digram Frequency Table for the NLUK-210 List.
#
#
a
b
c
d
e
f
g
h
i
j
k
l
m
n
o
p
q
r
s
t
u
v
w
x
y
z

14
1
1
10
29
1
6
3
5
9
5
12
16
4
5
16
16
40
1
2
5
7
2

a

b

c

d

e

f

g

h

i&

21 21 19 14 10
1 3 9 10 1
10 1 - 1 3
3 1 2 1 2
4 1 1 1 16
5 2 6 3 3
1 1 - - 3
3 - - - 4
9 1 - - 8
3 1 2 2 6
1 - - - 3
5 - - 1 2
7 1 1 1 8
13 1 1 - 6
7 1 7 8 13
2 3 7 5 1
3 - 1 1 6
- - - - 14 1 1 7 9
3 1 3 1 4
9 - 1 1 17
3 3 2 2 7
2 - - - 3
6 1 - - 4
1 - 1 - 1
1 1 2 1 1
3 - - - 2

7
2
1
1
3
1
1
1
1
1
2
1
1
1
1
1
1
-

5 9 9
4 2 2
- 1 2
- 12 3
1 1 7
1 1 3
1 - 1
1 2 1
- - 3
4 1 - - 1
- 1 3
1 1 10
- - 3
6 1 5
2 1 1
1 1 3
- - 1 1 11
- 6 6
1 2 5
1 1 2
- - 3
- 1 4
- - 1
1 - - - 2

j
3
1
1
1
1
1
1
1
1
1
1
-

k

l

m

7 5 8
1 11 6
- 2 8 1 - 1 1
1 7 2
1 3 1
1 1 1
- 1 1
1 3 2
- - 1 1 1 4 1
- 1 1
2 1 1
2 6 5
- 1 1
- - 3 1 1
1 1 1
1 1 1
- 2 10
- - 1 1 1
- 1 - 1 2
- - -

n

o

p

3 5 13
28 1 6
- 9 - 15 1 6 18 1 1
- 5 1 1 1 7 24 2 3
- 3 1 1 1
1 4 1
1 2 1
1 3 1
11 1 7
1 9 2
- - 3 5 1
1 2 1
1 9 1
5 2 3
- 2 1 3 - 1 1
1 3 1
- 1 -

q

r

s

t

u

1 4 21 11 2
1 13 4 8 4
- 6 1 - 5
1 2 1 8 3
- 1 1 1 1
2 24 6 9 2
- 1 1 1 2
- 2 1 1 2
- 1 1 1 4
1 2 13 4 5
- - - - 5
- 1 1 1 1
- 1 1 1 3
- - 1 - 1
1 1 2 18 1
1 19 3 6 4
- 2 1 1 1
- - - - 11
1 1 1 10 2
1 1 2 20 1
- 6 3 3 5
1 5 8 5 1
- 1 - - 1
- 1 1 1 - - - 1 1
- 1 2 1 1
- - - - 1

v
3
1
2
1
1
1
1
1
1
-

w
4
3
1
3
1
1
1
1
1
1
5
1
1
1
1
1
1
1

x
1
2
7
2
1
1
2
1
-

Legend:
The ”#” code denotes a blank. A − denotes a zero count, and was used in this table
instead of 0 because of its lower perceptual density

12

y
2
1
2
1
1
2
1
3
1
1
1
2
1
2
2
1
1
1
1
1
1

z
2
2
1
1
1
1
1
1
1
2
1

B

Coding Categories in Annotation

For the remarks in the log-files the following remark-categories were used:

- CAPS To indicate the use of (a) capital(s).
- CONNECTED If the connection of two or more
characters could result into ambiguity. Example:

/
/
\
\

----/

|
/

/
/
---/

= "or"

/
\--/

- DISCONNECTED For a character that is not properly connected, for example, a
"d" -> "o l".
- PUNCT To indicate the use of -not requested- punctuation-marks.
- SPELLING(ADD/DEL/SUBST) ADD: If a character was added;
DEL: If a character was missing;
SUBST: If a character had been
replaced by another character.
ADD, DEL and SUBST are notated
in order of occurence.
For example,
SPELLING(ADD,SUBST): "all(l)uviu(n)", where it should have
been "alluvium".
- STROKE(AMB) STROKE to indicate that a stroke
of a character was not (properly) finished or to indicate an
irregular stroke.
STROKE(AMB) to indicate that a
stroke could result into visual
13

ambiguity. For example, a "c"
looking like an "e" and vice
versa.

14

C

Some basic statistics of the collected data

Analysis for 172 sets (210 words each)

Variable:

nstrok

Min
Max
GrandMean
SD

22.2
35.8
27.2
2.5

npiece

ycorp

1.5
9.2
5.2
2.4

1.0
5.2
2.1
0.8

slant
51.9
110.8
83.8
15.8

width

nbars

ndots

12.1
36.6
22.2
5.6

0.0
1.0
0.1
0.2

0.0
1.2
0.5
0.2

Legend:
nstrok
npiece
ycorp
slant
width
nbars
ndots

D

Average
Average
Average
Average
Average
Average
Average

number of velocity-based strokes/word
number of pen-down segments/word
vertical size of small letters (corpus, "x"-size) in [mm]
angle of downstrokes at point of max. velocity [degrees]
horizontal size of words in [mm]
number of vertical bar strokes/word
number of dots/word

Database Quantities / State of the data

There are two sets: the 6-pack, collected by assistant Natasha,
with data from 19 subjects, writing the word list 6 times each.
The result is a total of 19 * 6 * 210 = 23940 words,
19 * 6 * 1514 = 172596 letters.
The second set, collected by assistant Eliane, with data from
16 subjects, writing the word list 4 times each.
The result is a total of 16 * 4 * 210 = 13440 words,
16 * 4 * 1514 = 96896 letters.
Currently the totals for the NICI collection are:
37380 words (269492 letters).

15

List of produced data (6-pack: Natasha)
Writer
01 aa
02 as
03 ax
04 az
05 ch
06 eh
07 fe
08 hk
09 jf
10 jn
11 mj
12 mk
13 ph
14 px
15 pz
16 sm
17 ss
18 tn
19 ts

set1
XA
XA
XA
XA
XA
XA
XA
XA
XA
XA
XA
XA
XA
XA
XA
XA
XA
XA
XA

set2
XA
XA
XA
XA
XA
XA
XA
XA
XA
XA
XA
XA
XA
XA
XA
XA
XA
XA
XA

set3
XA
XA
XA
XA
XA
X
X
X
X
X
X
X
X
XA
XA
X
X
X
X

set4
XA
XA
XA
XA
XA
X
X
X
X
X
X
X
X
XA
X
X
X
X
X

set5
XA
XA
XA
XA
XA
X
X
X
X
X
X
X
X
X
X
X
X
X
X

set6
XA
XA
XA
XA
XA
X
X
X
X
X
X
X
X
X
X
X
X
X

List of produced data (4-pack: Eliane)
Writer
20 cb
21 cs
22 db
23 es
24 jj
25 kd
26 pa
27 rh
28 ah
29 cm
30 jh
31 jr
32 lr
33 mh
34 rd
35 tb

set1
XAL
XAL
XAL
XA*
XA
XAL
XAL
XA
XAL
XA
XA
XAL
XA
XA
XA*
XA

set2
XAL
XAL
XA*
XAL
XA
XAL
XAL
XA
XA*
XA
XA
XAL
XA
XA
XAL
XA

set3
XA*
XAL
XAL
XAL
X
XA*
XAL
XA
XAL
XA
XA
XAL
XA
XA
XAL
XA

set4
XAL
XA*
XAL
XAL
X
XAL
XA*
XA
XAL
XA
XA
XA*
XA
XA
XAL
XA

X=collected, A=annotated, L=labeled, *=testset

16

E

Writer Names

Typical Dutch names were attached to the writer sets, to be able to refer to the specific
styles later.
Internal
Writer
Code
aa
ah
as
ax
az
cb
ch
cm
cs
db
eh
es
fe
hk
jf
jh
jj
jn
jr
kd
lr
mh
mj
mk
pa
ph
px
pz
rd
rh
sm
ss
tb
tn
ts

Sex

F
M
M
M
F
F
F
M
F
M
F
M
M
F
M
M
F
M
M
F
F
F
M
M
M
M
F
F
M
M
F
M
F
F
M

Dutch
Writer
Name
BEATRIJS
WILLEM
KLAAS
PIET
ANNEMIEK
MARIEKE
INEKE
KAREL
JANNEKE
TEUN
CORRIE
JOHAN
ONNO
SASKIA
EELCO
ANTON
MONIEK
FLORIS
GERRIT
JULIANA
MIEP
KATRIEN
MARTIJN
RUUD
JOOST
MARK
KLAARTJE
LOESJE
KEES
JEROEN
HELEEN
KOEN
HANNIE
ANGELIEN
KOOS

17

F

Coarse writing style classification on the basis of the
average number of pen-down pieces per word

writer
ineke
angelien
onno
floris
jeroen
ruud
johan
willem
gerrit
koos
miep
piet
loesje
mark
marieke
heleen
corrie
juliana
martijn
hannie
klaas
janneke
klaartje
saskia
katrien
moniek
kees
eelco
annemiek
anton
teun
joost
karel
koen
beatrijs

Npiece
/word
1.49
1.56
1.60
1.79
1.86
2.27
2.32
2.59
2.69
2.82
3.49
4.09
4.65
4.91
5.47
5.58
5.70
6.10
6.17
6.30
6.44
6.55
6.58
6.77
6.96
7.26
7.55
7.60
8.00
8.05
8.22
8.32
8.60
8.80
8.89

standard
deviation
0.69
0.74
0.72
0.85
0.93
1.19
1.35
1.38
1.49
1.70
1.65
1.74
1.72
1.82
1.96
1.93
2.11
2.01
2.20
1.99
1.93
2.31
2.18
2.21
2.21
2.48
2.43
2.06
2.36
2.20
2.48
2.39
2.51
2.61
2.56

self-reported
style
CURSIVE
CURSIVE
CURSIVE
CURSIVE
CURSIVE
CURSIVE
CURSIVE
CURSIVE
MIXED
CURSIVE
CURSIVE
MIXED
MIXED
MIXED
MIXED
CURSIVE
MIXED
MIXED
MIXED
MIXED
MIXED
MIXED
MIXED
MIXED
MIXED
MIXED
MIXED
MIXED
MIXED
MIXED
PRINT
PRINT
PRINT
MIXED
PRINT

18