arXiv:1512.00103v2 [cs.CL] 2 Apr 2016

Page 1

Multilingual Language Processing From Bytes

Dan Gillick, Cliff Brunk, Oriol Vinyals, Amarnag Subramanya

Google Research

{dgillick, cliffbrunk, vinyals, asubram}@google.com

Abstract

We describe an LSTM-based model which

we call Byte-to-Span (BTS) that reads text

as bytes and outputs span annotations of the

form [start, length, label] where start posi-

tions, lengths, and labels are separate entries

in our vocabulary. Because we operate di-

rectly on unicode bytes rather than language-

specific words or characters, we can analyze

text in many languages with a single model.

Due to the small vocabulary size, these multi-

lingual models are very compact, but produce

results similar to or better than the state-of-

the-art in Part-of-Speech tagging and Named

Entity Recognition that use only the provided

training datasets (no external data sources).

Our models are learning “from scratch” in that

they do not rely on any elements of the stan-

dard pipeline in Natural Language Processing

(including tokenization), and thus can run in

standalone fashion on raw text.

1 Introduction

The long-term trajectory of research in Natural

Language Processing has seen the replacement of

rules and specific linguistic knowledge with ma-

chine learned components. Perhaps the most stan-

dardized way that knowledge is still injected into

largely statistical systems is through the processing

pipeline: Some set of basic language-specific tokens

are identified in a first step. Sequences of tokens

are segmented into sentences in a second step. The

resulting sentences are fed one at a time for syntac-

tic analysis: Part-of-Speech (POS) tagging and pars-

ing. Next, the predicted syntactic structure is typi-

cally used as features in semantic analysis, Named

Entity Recognition (NER), Semantic Role Labeling,

etc. While each step of the pipeline now relies more

on data and models than on hand-curated rules, the

pipeline structure itself encodes one particular un-

derstanding of how meaning attaches to raw strings.

One motivation for our work is to try removing

this structural dependence. Rather than rely on the

intermediate representations invented for specific

subtasks (for example, Penn Treebank tokenization),

we are allowing the model to learn whatever internal

structure is most conducive to producing the annota-

tions of interest. To this end, we describe a Recur-

rent Neural Network (RNN) model that reads raw in-

put string segments, one byte at a time, and produces

output span annotations corresponding to specific

byte regions in the input1. This is truly language

annotation from scratch (see Collobert et al. (2011)

and Zhang and LeCun (2015)).

Two key innovations facilitate this approach.

First, Long Short Term Memory (LSTM) models

(Hochreiter and Schmidhuber, 1997) allow us to re-

place the traditional independence assumptions in

text processing with structural constraints on mem-

ory. While we have long known that long-term de-

pendencies are important in language, we had no

mechanism other than conditional independence to

keep sparsity in check. The memory in an LSTM,

however, is not constrained by any explicit assump-

tions of independence. Rather, its ability to learn

patterns is limited only by the structure of the net-

work and the size of the memory (and of course the

1Our span annotation model can be applied to any sequence

labeling task; it is not immediately applicable to predicting

more complex structures like trees.

arXiv:1512.00103v2 [cs.CL] 2 Apr 2016

Page 2

L13

PER

S26

L11

LOC

L13

PER

S26

L11

LOC STOP

…

BTS

SPANS

[S0, L13, PER] [S26, L11, LOC]

SEGMENT

Óscar Romero was born in El Salvador.

0xc3

0x93 0x73

…

0x63

Figure 1: A diagram showing the way the Byte-to-Span (BTS) model converts an input text segment to a sequence of span

annotations. The model reads the input segment one byte at a time (this can involve multibyte unicode characters), then a special

Generate Output (GO) symbol, then produces the argmax output of a softmax over all possible start positions, lengths, and labels

(as well as STOP, signifying no additional outputs). The prediction from the previous time step is fed as an input to the next time

step.

amount of training data).

Second, sequence-to-sequence models (Sutskever

et al., 2014), allow for flexible input/output dynam-

ics. Traditional models, including feedforward neu-

ral networks, read fixed-length inputs and generate

fixed-length outputs by following a fixed set of com-

putational steps. Instead, we can now read an entire

segment of text before producing an arbitrary num-

ber of outputs, allowing the model to learn a function

best suited to the task.

We leverage these two ideas with a basic strategy:

Decompose inputs and outputs into their component

pieces, then read and predict them as sequences.

Rather than read words, we are reading a sequence

of unicode bytes2; rather than producing a label for

each word, we are producing triples [start, length,

label], that correspond to the spans of interest, as a

sequence of three separate predictions (see Figure

1). This forces the model to learn how the compo-

nents of words and labels interact so all the structure

typically imposed by the NLP pipeline (as well as

the rules of unicode) are left to the LSTM to model.

Decomposed inputs and outputs have a few im-

portant benefits. First, they reduce the size of the

2We use the variable length UTF-8 encodings to keep the

vocabulary as small as possible.

vocabulary relative to word-level inputs, so the re-

sulting models are extremely compact (on the or-

der of a million parameters). Second, because uni-

code is essentially a universal language, we can train

models to analyze many languages at once. In fact,

by stacking LSTMs, we are able to learn represen-

tations that appear to generalize across languages,

improving performance significantly (without using

any additional parameters) over models trained on

a single language. This is the first account, to our

knowledge, of a multilingual model that achieves

good results across many languages, thus bypass-

ing all the language-specific engineering usually re-

quired to build models in different languages3. We

describe results similar to or better than the state-

of-the-art in Part-of-Speech tagging and Named En-

tity Recognition that use only the provided training

datasets (no external data sources).

The rest of this paper is organized as follows. Sec-

tion 2 discusses related work; Section 3 describes

our model; Section 4 gives training details includ-

ing a new variety of dropout (Hinton et al., 2012);

3These multilingual models are able to handle code-mixed

text, an important practical problem that’s received relatively

little attention. However, we do not have any annotated data

that contains code switching, so we cannot report any results.

Page 3

Section 5 gives inference details; Section 6 presents

results on POS tagging and NER across many lan-

guages; Finally, we summarize our contributions in

section 7.

2 Related Work

One important feature of our work is the use of

byte inputs. Character-level inputs have been used

with some success for tasks like NER (Klein et al.,

2003), parallel text alignment (Church, 1993), and

authorship attribution (Peng et al., 2003) as an ef-

fective way to deal with n-gram sparsity while still

capturing some aspects of word choice and mor-

phology. Such approaches often combine char-

acter and word features and have been especially

useful for handling languages with large character

sets (Nakagawa, 2004). However, there is almost

no work that explicitly uses bytes – one exception

uses byte n-grams to identify source code author-

ship (Frantzeskou et al., 2006) – but there is noth-

ing, to the best of our knowledge, that exploits bytes

as a cross-lingual representation of language. Work

on multilingual parsing using Neural Networks that

share some subset of the parameters across lan-

guages (Duong et al., 2015) seems to benefit the

low-resource languages; however, we are sharing all

the parameters among all languages.

Recent work has shown that modeling the se-

quence of characters in each token with an LSTM

can more effectively handle rare and unknown words

than independent word embeddings (Ling et al.,

2015; Ballesteros et al., 2015). Similarly, language

modeling, especially for morphologically complex

languages, benefits from a Convolutional Neural

Network (CNN) over characters to generate word

embeddings (Kim et al., 2015). Rather than de-

compose words into characters, Rohan and Denero

(2015) encode rare words with Huffman codes, al-

lowing a neural translation model to learn something

about word subcomponents. In contrast to this line

of research, our work has no explicit notion of to-

kens and operates on bytes rather than characters.

Our work is philosophically similar to Col-

lobert et al.’s (2011) experiments with “almost from

scratch” language processing. They avoid task-

specific feature engineering, instead relying on a

multilayer feedforward (or convolutional) Neural

Network to combine word embeddings to produce

features useful for each task. In the Results sec-

tion, below, we compare NER performance on the

same dataset they used. The “almost” in the ti-

tle actually refers to the use of preprocessed (low-

ercased) tokens as input instead of raw sequences

of letters. Our byte-level models can be seen as a

realization of their comment: “A completely from

scratch approach would presumably not know any-

thing about words at all and would work from letters

only.” Recent work with convolutional neural net-

works that read character-level inputs (Zhang et al.,

2015) shows some interesting results on a variety of

classification tasks, but because their models need

very large training sets, they do not present compar-

isons to established baselines on standard tasks.

Finally, recent work on Automatic Speech Recog-

nition (ASR) uses a similar sequence-to-sequence

LSTM framework to produce letter sequences di-

rectly from acoustic frame sequences (Chan et al.,

2015; Bahdanau et al., 2015). Just as we are dis-

carding the usual intermediate representations used

for text processing, their models make no use of pho-

netic alignments, clustered triphones, or pronunci-

ation dictionaries. This line of work – discarding

intermediate representations in speech – was pio-

neered by Graves and Jaitly (2014) and earlier, by

Eyben et al. (2009).

3 Model

Our model is based on the sequence-to-sequence

model used for machine translation (Sutskever et al.,

2014), an adaptation of an LSTM that encodes a

variable length input as a fixed-length vector, then

decodes it into a variable number of outputs4.

Generally, the sequence-to-sequence LSTM is

trained to estimate the conditional probability

P(y1, ...,yT |x1, ...,xT ) where (x1, ...,xT ) is an

input sequence and (y1, ...,yT ) is the correspond-

ing output sequence whose length T/ may dif-

fer from T. The encoding step computes a

fixed-dimensional representation v of the input

(x1, ...,xT ) given by the hidden state of the LSTM

4Related translation work adds an attention mechanism

(Bahdanau et al., 2014), allowing the decoder to attend directly

to particularly relevant inputs. We tried adding the same mech-

anism to our model but saw no improvement in performance on

the NER task, though training converged in fewer steps.

Page 4

after reading the last input xT . The decoding step

computes the output probability P(y1, ...,yT ) with

the standard LSTM formulation for language mod-

eling, except that the initial hidden state is set to v:

P(y1, ...,yT |x1, ...,xT ) =

∏

t=1

P(yt|v, y1, ...,yt-1)

(1)

Sutskever et al. used a separate LSTM for the en-

coding and decoding tasks. While this separation

permits training the encoder and decoder LSTMs

separately, say for multitask learning or pre-training,

we found our results were no worse if we used a sin-

gle set of LSTM parameters for both encoder and

decoder.

3.1 Vocabulary

The primary difference between our model and the

translation model is our novel choice of vocabulary.

The set of inputs include all 256 possible bytes, a

special Generate Output (GO) symbol, and a spe-

cial DROP symbol used for regularization, which

we will discuss below. The set of outputs include

all possible span start positions (byte 0..k), all pos-

sible span lengths (0..k), all span labels (PER, LOC,

ORG, MISC for the NER task), as well as a special

STOP symbol. A complete span annotation includes

a start, a length, and a label, but as shown in Fig-

ure 1, the model is trained to produce this triple as

three separate outputs. This keeps the vocabulary

size small and in practice, gives better performance

(and faster convergence) than if we use the cross-

product space of the triples.

More precisely, the prediction at time t is condi-

tioned on the full input and all previous predictions

(via the chain rule). By splitting each span anno-

tation into a sequence [start, length, label], we are

making no independence assumption; instead we are

relying on the model to maintain a memory state that

captures the important dependencies.

Each output distribution P(yt|v, y1, ...,yt-1) is

given by a softmax over all possible items in the out-

put vocabulary, so at a given time step, the model is

free to predict any start, any length, or any label (in-

cluding STOP). In practice, because the training data

always has these complete triples in a fixed order,

we seldom see malformed or incomplete spans (the

decoder simply ignores such spans). During train-

ing, the true label yt-1 is fed as input to the model

at step t (see Figure 1), and during inference, the

argmax prediction is used instead. Note also that

the training procedure tries to maximize the proba-

bility in Equation 1 (summed over all the training

examples). While this does not quite match our task

objectives (F1 over labels, for example), it is a rea-

sonable proxy.

3.2 Independent segments

Ideally, we would like our input segments to cover

full documents so that our predictions are condi-

tioned on as much relevant information as possible.

However, this is impractical for a few reasons. From

a training perspective, a Recurrent Neural Network

is unrolled to resemble a deep feedforward network,

with each layer corresponding to a time step. It

is well-known that running backpropagation over a

very deep network is hard because it becomes in-

creasingly difficult to estimate the contribution of

each layer to the gradient, and further, RNNs have

trouble generalizing to different length inputs (Er-

han et al., 2009).

So instead of document-sized input segments,

we make a segment-independence assumption: We

choose some fixed length k and train the model on

segments of length k (any span annotation not com-

pletely contained in a segment is ignored). This has

the added benefit of limiting the range of the start

and length label components. It can also allow for

more efficient batched inference since each segment

is decoded independently. Finally, we can generate a

large number of training segments by sliding a win-

dow of size k one byte at a time through a document.

Note that the resulting training segments can begin

and end mid-word, and indeed, mid-character. For

both tasks described below, we set the segment size

k = 60.

3.3 Sequence ordering

Our model differs from the translation model in one

more important way. Sutskever et al. found that

feeding the input words in reverse order and gen-

erating the output words in forward order gave sig-

nificantly better translations, especially for long sen-

tences. In theory, the predictions are conditioned on

the entire input, but as a practical matter, the learn-

Page 5

ing problem is easier when relevant information is

ordered appropriately since long dependencies are

harder to learn than short ones.

Because the byte order is more meaningful in the

forward direction (the first byte of a multibyte char-

acter specifies the length, for example), we found

somewhat better performance with forward order

than reverse order (less than 1% absolute). But un-

like translation, where the outputs have a complex

order determined by the syntax of the language, our

span annotations are more like an unordered set. We

tried sorting them by end position in both forward

and backward order, and found a small improvement

(again, less than 1% absolute) using the backward

ordering (assuming the input is given in the forward

order). This result validates the translation ordering

experiments: the modeling problem is easier when

the sequence-to-sequence LSTM is used more like a

stack than a queue.

3.4 Model shape

We experimented with a few different architectures

and found no significant improvements in using

more than 320 units for the embedding dimension

and LSTM memory and 4 stacked LSTMs (see Table

4). This observation holds for both models trained

on a single language and models trained on many

languages. Because the vocabulary is so small, the

total number of parameters is dominated by the size

of the recurrent matrices. All the results reported

below use the same architecture (unless otherwise

noted) and thus have roughly 900k parameters.

4 Training

We trained our models with Stochastic Gradient De-

scent (SGD) on mini-batches of size 128, using an

initial learning rate of 0.3. For all other hyper-

parameter choices, including random initialization,

learning rate decay, and gradient clipping, we fol-

low Sutskever et al. (2014). Each model is trained

on a single CPU over a period of a few days, at

which point, development set results have stabilized.

Distributed training on GPUs would likely speed up

training to just a few hours.

4.1 Dropout and byte-dropout

Neural Network models are often trained using

dropout (Hinton et al., 2012), which tends to im-

prove generalization by limiting correlations among

hidden units. During training, dropout randomly ze-

roes some fraction of the elements in the embedding

layer and the model state just before the softmax

layer (Zaremba et al., 2014).

We were able to further improve generalization

with a technique we are calling byte-dropout: We

randomly replace some fraction of the input bytes in

each segment with a special DROP symbol (without

changing the corresponding span annotations). Intu-

itively, this results in a more robust model, perhaps

by forcing it to use longer-range dependencies rather

than memorizing particular local sequences.

It is worth noting that noise is often added at

training time to images in image classification and

speech in speech recognition where the added noise

does not fundamentally alter the input, but rather

blurs it. By using a byte representation of language,

we are now capable of achieving something like

blurring with text. Indeed, if we removed 20% of

the characters in a sentence, humans would be able

to infer words and meaning reasonably well.

5 Inference

We perform inference on a segment by (greedily)

computing the most likely output at each time step

and feeding it to the next time step. Experiments

with beam search show no meaningful improve-

ments (less than 0.2% absolute). Because we as-

sume that each segment is independent, we need to

choose how to break up the input into segments and

how to stitch together the results.

The simplest approach is to divide up the input

into segments with no overlapping bytes. Because

the model is trained to ignore incomplete spans, this

approach misses all spans that cross segment bound-

aries, which, depending on the choice of k, can be a

significant number. We avoid the missed-span prob-

lem by choosing segments that overlap such that

each span is likely to be fully contained by at least

one segment.

For our experiments, we create segments with a

fixed overlap (k/2 = 30). This means that with

the exception of the first segment in a document, the

model reads 60 bytes of input, but we only keep pre-

dictions about the last 30 bytes.

Page 6

6 Results

Here we describe experiments on two datasets that

include annotations across a variety of languages.

The multilingual datasets allow us to highlight the

advantages of using byte-level inputs: First, we can

train a single compact model that can handle many

languages at once. Second, we demonstrate some

cross-lingual abstraction that improves performance

of a single multilingual model over each single-

language model. In the experiments, we refer to

the LSTM setup described above as Byte-to-Span or

BTS.

Most state-of-the-art results in POS tagging and

NER leverage unlabeled data to improve a super-

vised baseline. For example, word clusters or word

embeddings estimated from a large corpus are of-

ten used to help deal with sparsity. Because our

LSTM models are reading bytes, it is not obvious

how to insert information like a word cluster iden-

tity. Recent results with sequence-to-sequence auto-

encoding (Dai and Le, 2015) seem promising in this

regard, but here we limit our experiments to use just

annotated data.

Each task specifies separate data for training, de-

velopment, and testing. We used the development

data for tuning the dropout and byte-dropout pa-

rameters (since these likely depend on the amount

of available training data), but did not tune the re-

maining hyperparameters. In total, our training set

for POS Tagging across 13 languages included 2.87

million tokens and our training set for NER across

4 languages included 0.88 million tokens. Recall,

though, that our training examples are 60-byte seg-

ments obtained by sliding a window through the

training data, shifting by 1 byte each time. This re-

sults in 25.3 million and 6.0 million training seg-

ments for the two tasks.

6.1 Part-of-Speech Tagging

Our part-of-speech tagging experiments use Version

1.1 of the Universal Dependency data5, a collection

of treebanks across many languages annotated with

a universal tagset (Petrov et al., 2011). The most

relevant recent work (Ling et al., 2015) uses differ-

ent datasets, with different finer-grained tagsets in

each language. Because we are primary interested

5http://universaldependencies.github.io/docs/

in multilingual models that can share language-

independent parameters, the universal tagset is im-

portant, and thus our results are not immediately

comparable. However, we provide baseline results

(for each language separately) using a Conditional

Random Field (Lafferty et al., 2001) with an exten-

sive collection of features with performance compa-

rable to the Stanford POS tagger (Manning, 2011).

For our experiments, we chose the 13 languages that

had at least 50k tokens of training data. We did not

subsample the training data, though the amount of

data varies widely across languages, but rather shuf-

fled all training examples together. These languages

represent a broad range of linguistic phenomena and

character sets so it was not obvious at the outset that

a single multilingual model would work.

Table 1 compares the baselines with (CRF+) and

without (CRF) externally trained cluster features

with our model trained on all languages (BTS) as

well as each language separately (BTS*). The single

BTS model improves on average over the CRF mod-

els trained using the same data, though clearly there

is some benefit in using external resources. Note

that BTS is particularly strong in Finnish, surpass-

ing even CRF+ by nearly 1.5% (absolute), probably

because the byte representation generalizes better to

agglutinative languages than word-based models, a

finding validated by Ling et al. (2015). In addi-

tion, the baseline CRF models, including the (com-

pressed) cluster tables, require about 50 MB per lan-

guage, while BTS is under 10 MB. BTS improves

on average over BTS*, suggesting that it is learning

some language-independent representation.

6.2 Named Entity Recognition

Our main motivation for showing POS tagging re-

sults was to demonstrate how effective a single BTS

model can be across a wide range of languages. The

NER task is a more interesting test case because,

as discussed in the introduction, it usually relies

on a pipeline of processing. We use the 2002 and

2003 ConLL shared task datasets6 for multilingual

NER because they contain data in 4 languages (En-

glish, German, Spanish, and Dutch) with consistent

annotations of named entities (PER, LOC, ORG,

and MISC). In addition, the shared task competition

6http://www.cnts.ua.ac.be/conll200{2,3}/ner

Page 7

Language

CRF+ CRF

BTS BTS*

Bulgarian

97.97 97.00 97.84 97.02

Czech

98.38 98.00 98.50 98.44

Danish

95.93 95.06 95.52 92.45

German

93.08 91.99 92.87 92.34

Greek

97.72 97.21 97.39 96.64

English

95.11 94.51 93.87 94.00

Spanish

96.08 95.03 95.80 95.26

Farsi

96.59 96.25 96.82 96.76

Finnish

94.34 92.82 95.48 96.05

French

96.00 95.93 95.75 95.17

Indonesian

92.84 92.71 92.85 91.03

Italian

97.70 97.61 97.56 97.40

Swedish

96.81 96.15 95.57 93.17

AVERAGE

96.04 95.41 95.85 95.06

Table 1: Part-of-speech tagging accuracy for two CRF base-

lines and 2 versions of BTS. CRF+ uses resources external to

the training data (word clusters) and CRF uses only the training

data. BTS (unlike CRF+ and CRF) is a single model trained

on all the languages together, while BTS* is a separate Byte-to-

Span model for each language.

produced strong baseline numbers for comparison.

However, most published results use extra informa-

tion beyond the provided training data which makes

fair comparison with our model more difficult.

The best competition results for English and Ger-

man (Florian et al., 2003) used a large gazetteer

and the output of two additional NER classifiers

trained on richer datasets. Since 2003, better results

have been reported using additional semi-supervised

techniques (Ando and Zhang, 2005) and more re-

cently, Passos et al. (2014) claimed the best En-

glish results (90.90% F1) using features derived

from word-embeddings. The 1st place submission

in 2002 (Carreras et al., 2002) comment that with-

out extra resources for Spanish, their results drop by

about 2% (absolute).

Perhaps the most relevant comparison is the over-

all 2nd place submission in 2003 (Klein et al., 2003).

They use only the provided data and report results

with character-based models which provide a useful

comparison point to our byte-based LSTM. The per-

formance of a character HMM alone is much worse

than their best result (83.2% vs 92.3% on the En-

glish development data), which includes a variety of

word and POS-tag features that describe the context

(as well as some post-processing rules). For English

(assuming just ASCII strings), the character HMM

uses the same inputs as BTS, but is hindered by

some combination of the independence assumption

and smaller capacity.

Collobert et al.’s (2011) convolutional model (dis-

cussed above) gives 81.47% F1 on the English test

set when trained on only the gold data. However, by

using carefully selected word-embeddings trained

on external data, they are able to increase F1 to

88.67%. Huang et al. (2015) improve on Collobert’s

results by using a bidirectional LSTM with a CRF

layer where the inputs are features describing the

words in each sentence. Either by virtue of the more

powerful model, or because of more expressive fea-

tures, they report 84.26% F1 on the same test set

and 90.10% when they add pretrained word embed-

ding features. Dos Santos et al. (2015) represent

each word by concatenating a pretrained word em-

bedding with a character-level embedding produced

by a convolutional neural network.

There is relatively little work on multilingual

NER, and most research is focused on building sys-

tems that are unsupervised in the sense that they use

resources like Wikipedia and Freebase rather than

manually annotated data. Nothman et al. (2013) use

Wikipedia anchor links and disambiguation pages

joined with Freebase types to create a huge amount

of somewhat noisy training data and are able to

achieve good results on many languages (with some

extra heuristics). These results are also included in

Table 2.

While BTS does not improve on the state-of-

the-art in English, its performance is better than

the best previous results that use only the provided

training data. BTS improves significantly on the

best known results in German, Spanish, and Dutch

even though these leverage external data. In addi-

tion, the BTS* models, trained separately on each

language, are worse than the single BTS model

(with the same number of parameters as each single-

language model) trained on all languages combined,

again suggesting that the model is learning some

language-independent representation of the task.

One interesting shortcoming of the BTS model is

that it is not obvious how to tune it to increase re-

call. In a standard classifier framework, we could

simply increase the prediction threshold to increase

Page 8

Model

Passos

90.90

–

Ando

89.31 75.27

–

Florian

88.76 72.41

–

Carreras

–

81.39 77.05

dos Santos

–

82.21

–

Nothman

85.2

66.5

79.6

78.6

Klein

86.07 71.90

–

Huang

84.26

–

Collobert

81.47

–

BTS

86.50 76.22 82.95 82.84

BTS*

84.57 72.08 81.83 78.08

Table 2: A comparison of NER systems. The results are F1

scores, where a correct span annotation exactly matches a gold

span annotation (start, length, and entity type must all be cor-

rect). Results of the systems described in the text are shown for

English, German, Spanish, and Dutch. BTS* shows the results

of the BTS model trained separately on each language while

BTS is a single model trained on all 4 languages together. The

top set of results leverage resources beyond the training data;

the middle set do not, and thus are most comparable to our re-

sults (bottom set).

precision and decrease the prediction threshold to in-

crease recall. However, because we only produce

annotations for spans (non-spans are not annotated),

we can adjust a threshold on total span probability

(the product of the start, length, and label probabili-

ties) to increase precision, but there is no clear way

to increase recall. The untuned model tends to pre-

fer precision over recall already, so some heuristic

for increasing recall might improve our overall F1

results.

6.3 Dropout and Stacked LSTMs

There are many modeling options and hyperparam-

eters that significantly impact the performance of

Neural Networks. Here we show the results of a

few experiments that were particularly relevant to

the performance obtained above.

First, Table 3 shows how dropout and byte-

dropout improve performance for both tasks. With-

out any kind of dropout, the training process starts to

overfit (development data perplexity starts increas-

ing) relatively quickly. For POS tagging, we set

dropout and byte-dropout to 0.2, while for NER, we

set both to 0.3. This significantly reduces the over-

fitting problem.

BTS Training

POS Accuracy NER F1

Vanilla

94.78

74.75

+ Dropout

95.35

78.76

+ Byte-dropout

95.85

82.13

Table 3: BTS Part-of-speech tagging average accuracy across

all 13 evaluated languages and Named Entity Recognition aver-

age F1 across all 4 evaluated languages with various modifica-

tions to the vanilla training setup. Dropout is standard in Neural

Network model training because it often improves generaliza-

tion; Byte-dropout randomly replaces input bytes with a special

DROP marker.

Depth Width=320 Width=640

76.15

77.59

79.40

79.73

81.44

81.93

82.13

82.18

Table 4: Macro-averaged (across 4 languages) F1 for the NER

task using different model architectures.

Second, Table 4 shows how performance im-

proves as we increase the size of the model in two

ways: the number of units in the model’s state

(width) and the number of stacked LSTMs (depth).

Increasing the width of the model improves perfor-

mance less than increasing the depth, and once we

use 4 stacked LSTMs, the added benefit of a much

wider model has disappeared. This result suggests

that rather than learning to partition the space of in-

puts according to the source language, the model is

learning some lanugage-independent representation

at the deeper levels.

To validate our claim about language-independent

representation, Figure 2 shows the results of a tSNE

plot of the LSTM’s memory state when the output

is one of PER, LOC, ORG, MISC across the four

languages. While the label clusters are neatly sepa-

rated, the examples of each individual label do not

appear to be clustered by language. Thus rather than

partitioning each (label, language) combination, the

model is learning unified label representations that

are independent of the language.

7 Conclusions

We have described a model that uses a sequence-to-

sequence LSTM framework that reads a segment of

Page 9

LOC en

LOC de

LOC es

LOC nl

MISC en

MISC de

MISC es

MISC nl

ORG en

ORG de

ORG es

ORG nl

PER en

PER de

PER es

PER nl

Figure 2: A tSNE plot of the BTS model’s memory state just before the softmax layer produces one of the NER labels.

text one byte at a time and then produces span anno-

tations over the inputs. This work makes a number

of novel contributions:

First, we use the bytes in variable length unicode

encodings as inputs. This makes the model vocab-

ulary very small and also allows us to train a mul-

tilingual model that improves over single-language

models without using additional parameters. We in-

troduce byte-dropout, an analog to added noise in

speech or blurring in images, which significantly im-

proves generalization.

Second, the model produces span annotations,

where each is a sequence of three outputs: a start

position, a length, and a label. This decomposi-

tion keeps the output vocabulary small and marks a

significant departure from the typical Begin-Inside-

Outside (BIO) scheme used for labeling sequences.

Finally, the models are much more compact than

traditional word-based systems and they are stan-

dalone – no processing pipeline is needed. In par-

ticular, we do not need a tokenizer to segment text

in each of the input languages.

Acknowledgments

Many thanks to Fernando Pereira and Dan Ramage

for their insights about this project from the outset.

Thanks also to Cree Howard for creating Figure 1.

References

[AndoandZhang2005] Rie Kubota Ando and Tong

Zhang. 2005. A framework for learning predictive

structures from multiple tasks and unlabeled data. The

Journal of Machine Learning Research, 6:1817–1853.

[Bahdanauetal.2014] Dzmitry Bahdanau, Kyunghyun

Cho, and Yoshua Bengio. 2014. Neural machine

translation by jointly learning to align and translate.

arXiv preprint arXiv:1409.0473.

[Bahdanau et al.2015] Dzmitry

Bahdanau,

Jan

Chorowski, Dmitriy Serdyuk, Philemon Brakel,

and Yoshua Bengio. 2015. End-to-end attention-

based large vocabulary speech recognition.

arXiv

preprint arXiv:1508.04395.

[Ballesteros et al.2015] Miguel Ballesteros, Chris Dyer,

and Noah A Smith. 2015. Improved transition-based

parsing by modeling characters instead of words with

lstms. arXiv preprint arXiv:1508.00657.

Page 10

[Carreras et al.2002] Xavier Carreras, Lluıs M`arques, and

Lluıs Padró. 2002. Named entity extraction using ad-

aboost. In Proceedings of CoNLL-2002, pages 167–

170. Taipei, Taiwan.

[Chan et al.2015] William Chan, Navdeep Jaitly, Quoc V

Le, and Oriol Vinyals. 2015. Listen, attend and spell.

arXiv preprint arXiv:1508.01211.

[Chitnis and DeNero2015] Rohan Chitnis and John DeN-

ero. 2015. Variable-length word encodings for neural

translation models. In Proceedings of the 2015 Con-

ference on Empirical Methods in Natural Language

Processing, pages 2088–2093.

[Church1993] Kenneth Ward Church. 1993. Char align:

a program for aligning parallel texts at the character

level. In Proceedings of the 31st annual meeting on

Association for Computational Linguistics, pages 1–8.

Association for Computational Linguistics.

[Collobertetal.2011] Ronan Collobert, Jason Weston,

Léon Bottou, Michael Karlen, Koray Kavukcuoglu,

and Pavel Kuksa. 2011. Natural language process-

ing (almost) from scratch. The Journal of Machine

Learning Research, 12:2493–2537.

[Dai and Le2015] Andrew M Dai and Quoc V Le. 2015.

Semi-supervised sequence learning. arXiv preprint

arXiv:1511.01432.

[dos Santos et al.2015] Cıcero

dos Santos,

Victor

Guimaraes, RJ Niterói, and Rio de Janeiro. 2015.

Boosting named entity recognition with neural char-

acter embeddings. In Proceedings of NEWS 2015 The

Fifth Named Entities Workshop, page 25.

[Duongetal.2015] Long Duong, Trevor Cohn, Steven

Bird, and Paul Cook. 2015. Low resource depen-

dency parsing: Cross-lingual parameter sharing in a

neural network parser. In 53rd Annual Meeting of the

Association for Computational Linguistics and the 7th

International Joint Conference on Natural Language

Processing (Volume 2: Short Papers), pages 845–850.

[Erhan et al.2009] Dumitru Erhan, Pierre-Antoine Man-

zagol, Yoshua Bengio, Samy Bengio, and Pascal Vin-

cent. 2009. The difficulty of training deep architec-

tures and the effect of unsupervised pre-training. In

International Conference on artificial intelligence and

statistics, pages 153–160.

[Eyben et al.2009] Florian Eyben, Martin Wöllmer, Björn

Schuller, and Alex Graves. 2009. From speech to

letters-using a novel neural network architecture for

grapheme based asr. In Automatic Speech Recognition

& Understanding, 2009. ASRU 2009. IEEE Workshop

on, pages 376–380. IEEE.

[Florian et al.2003] Radu Florian, Abe Ittycheriah,

Hongyan Jing, and Tong Zhang. 2003. Named

entity recognition through classifier combination. In

Proceedings of the seventh conference on Natural

language learning at HLT-NAACL 2003-Volume

4, pages 168–171. Association for Computational

Linguistics.

[Frantzeskou et al.2006] Georgia Frantzeskou, Efstathios

Stamatatos, Stefanos Gritzalis, and Sokratis Katsikas.

2006. Effective identification of source code authors

using byte-level information. In Proceedings of the

28th international conference on Software engineer-

ing, pages 893–896. ACM.

[Graves and Jaitly2014] Alex Graves and Navdeep Jaitly.

2014. Towards end-to-end speech recognition with re-

current neural networks. In Proceedings of the 31st In-

ternational Conference on Machine Learning (ICML-

14), pages 1764–1772.

[Hintonetal.2012] Geoffrey E Hinton, Nitish Srivas-

tava, Alex Krizhevsky, Ilya Sutskever, and Ruslan R

Salakhutdinov. 2012. Improving neural networks by

preventing co-adaptation of feature detectors. arXiv

preprint arXiv:1207.0580.

[Hochreiter and Schmidhuber1997] Sepp Hochreiter and

Jürgen Schmidhuber. 1997. Long short-term memory.

Neural computation, 9(8):1735–1780.

[Huang et al.2015] Zhiheng Huang, Wei Xu, and Kai Yu.

2015. Bidirectional lstm-crf models for sequence tag-

ging. arXiv preprint arXiv:1508.01991.

[Kim et al.2015] Yoon Kim, Yacine Jernite, David Son-

tag, and Alexander M Rush. 2015. Character-

aware neural language models.

arXiv preprint

arXiv:1508.06615.

[Klein et al.2003] Dan Klein, Joseph Smarr, Huy Nguyen,

and Christopher D Manning. 2003. Named entity

recognition with character-level models. In Proceed-

ings of the seventh conference on Natural language

learning at HLT-NAACL 2003-Volume 4, pages 180–

183. Association for Computational Linguistics.

[Lafferty et al.2001] John Lafferty, Andrew McCallum,

and Fernando CN Pereira. 2001. Conditional random

fields: Probabilistic models for segmenting and label-

ing sequence data.

[Ling et al.2015] Wang Ling, Tiago Luıs, Luıs Marujo,

Ramón Fernandez Astudillo, Silvio Amir, Chris Dyer,

Alan W Black, and Isabel Trancoso. 2015. Finding

function in form: Compositional character models for

open vocabulary word representation. arXiv preprint

arXiv:1508.02096.

[Manning2011] Christopher D Manning. 2011. Part-of-

speech tagging from 97% to 100%: is it time for some

linguistics? In Computational Linguistics and Intelli-

gent Text Processing, pages 171–189. Springer.

[Nakagawa2004] Tetsuji Nakagawa. 2004. Chinese

and japanese word segmentation using word-level and

character-level information. In Proceedings of the

Page 11

20th international conference on Computational Lin-

guistics, page 466. Association for Computational Lin-

guistics.

[Nothmanetal.2013] Joel Nothman, Nicky Ringland,

Will Radford, Tara Murphy, and James R Curran.

2013. Learning multilingual named entity recognition

from wikipedia. Artificial Intelligence, 194:151–175.

[Passos et al.2014] Alexandre Passos, Vineet Kumar, and

Andrew McCallum. 2014. Lexicon infused phrase

embeddings for named entity resolution.

arXiv

preprint arXiv:1404.5367.

[Peng et al.2003] Fuchun Peng, Dale Schuurmans, Shao-

jun Wang, and Vlado Keselj. 2003. Language in-

dependent authorship attribution using character level

language models. In Proceedings of the tenth confer-

ence on European chapter of the Association for Com-

putational Linguistics-Volume 1, pages 267–274. As-

sociation for Computational Linguistics.

[Petrov et al.2011] Slav Petrov, Dipanjan Das, and Ryan

McDonald. 2011. A universal part-of-speech tagset.

arXiv preprint arXiv:1104.2086.

[Sutskever et al.2014] Ilya Sutskever, Oriol Vinyals, and

Quoc V Le. 2014. Sequence to sequence learning with

neural networks. In Advances in neural information

processing systems, pages 3104–3112.

[Zaremba et al.2014] Wojciech Zaremba, Ilya Sutskever,

and Oriol Vinyals. 2014. Recurrent neural network

regularization. arXiv preprint arXiv:1409.2329.

[Zhang et al.2015] Xiang Zhang, Junbo Zhao, and Yann

LeCun. 2015. Character-level convolutional networks

for text classification. In Advances in Neural Informa-

tion Processing Systems, pages 649–657.