Layout XLM

LayoutXLM: Multimodal Pre-training for
Multilingual Visually-rich Document Understanding

Yiheng Xu1∗, Tengchao Lv1 , Lei Cui1 , Guoxin Wang2 , Yijuan Lu2 ,
Dinei Florencio2 , Cha Zhang2 , Furu Wei1
1
Microsoft Research Asia
2
Microsoft Azure AI
{t-yihengxu,tengchaolv,lecu,guow,yijlu,dinei,chazhang,fuwei}@microsoft.com
Abstract it is often not satisfactory due to the poor transla-

tion quality on document images (Afli and Way,
Multimodal pre-training with text, layout, and
image has achieved SOTA performance for
2016). Therefore, it is vital to pre-train the Lay-
visually-rich document understanding tasks re- outLM model using real document datasets around
arXiv:2104.08836v3 [cs.CL] 9 Sep 2021
cently, which demonstrates the great poten- the world for the multilingual VrDU task.
tial for joint learning across different modali- Multilingual pre-trained models such as
ties. In this paper, we present LayoutXLM, mBERT (Devlin et al., 2018), XLM (Lample
a multimodal pre-trained model for multilin-
and Conneau, 2019), XLM-RoBERTa (Conneau
gual document understanding, which aims to
bridge the language barriers for visually-rich
et al., 2020), mBART (Liu et al., 2020), and the
document understanding. To accurately eval- recent InfoXLM (Chi et al., 2020) and mT5 (Xue
uate LayoutXLM, we also introduce a multi- et al., 2020) have pushed many SOTA results
lingual form understanding benchmark dataset on cross-lingual natural language understanding
named XFUND, which includes form under- tasks by pre-training the Transformer models
standing samples in 7 languages (Chinese, on different languages. These models have
Japanese, Spanish, French, Italian, German, successfully bridged the language barriers in a
Portuguese), and key-value pairs are manually
number of cross-lingual transfer benchmarks such
labeled for each language. Experiment results
show that the LayoutXLM model has signifi- as XNLI (Conneau et al., 2018) and XTREME (Hu
cantly outperformed the existing SOTA cross- et al., 2020). Although a large amount of multilin-
lingual pre-trained models on the XFUND gual text data has been used in these cross-lingual
dataset. The pre-trained LayoutXLM model pre-trained models, text-only multilingual models
and the XFUND dataset are publicly available cannot be easily used in the VrDU tasks because
at https://aka.ms/layoutxlm. they are usually fragile in analyzing the documents
1 Introduction due to the format/layout diversity of documents in
different countries, and even different regions in
Multimodal pre-training for visually-rich Docu- the same country. Hence, to accurately understand
ment Understanding (VrDU) has achieved new these visually-rich documents in various languages,
SOTA performance on several public benchmarks it is crucial to pre-train the multilingual models
recently (Xu et al., 2020a,b), including form under- with not only textual information but also layout
standing (Jaume et al., 2019), receipt understand- and image information in a multimodal framework.
ing (Park et al., 2019), complex layout understand- To this end, we present a multimodal pre-trained
ing (Graliński et al., 2020), document image classi- model for multilingual VrDU tasks in this paper,
fication (Harley et al., 2015) and document VQA aka LayoutXLM, which is a multilingual exten-
task (Mathew et al., 2020), due to the advantage sion of the recent LayoutLMv2 model (Xu et al.,
that text, layout and image information is jointly 2020a). LayoutLMv2 integrates the image informa-
learned end-to-end in a single framework. Mean- tion in the pre-training stage by taking advantage
while, we are well aware of the demand from the of the Transformer architecture to learn the cross-
non-English world since nearly 40% of digital doc- modality interaction between visual and textual in-
uments on the web are in non-English languages. formation. In addition, LayoutLMv2 uses two new
Simply translating these documents automatically training objectives in addition to the masked visual-
with machine translation services might help, but language model, which are the image-text match-
∗
Work done during internship at Microsoft Research Asia. ing and image masking prediction tasks. In this
Relation
None None KV
Extraction
Biaffine Attention Classifier
E1 & E2 E1 & E3 E2 & E3
Semantic
Entity O B-Header I-Header I-Header B-Question O B-Answer I-Answer I-Answer O
Recognition
Multi-Modal Transformer Encoder Layers with Spatial-Aware Self-Attention
Position Embedding 0 1 2 3 0 1 2 3 4 5 6 7 8 9
2D Position Embedding
Visual & Text Embedding <s> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? </s>
Feature
Text
Map
Visual OCR
Layout
Encoder System
Figure 1: Architecture of the LayoutXLM Model, where the semantic entity recognition and relation extraction
tasks are also demonstrated.
way, the pre-trained models absorb cross-modal benchmark dataset, which also demonstrates the
knowledge from different document types, where potential of the multimodal pre-training strategy
the local invariance among the layout and formats for multilingual document understanding.
is preserved. Inspired by the LayoutLMv2 model, The contributions of this paper are summarized
LayoutXLM adopts the same architecture for the as follows:
multimodal pre-training initialized by a SOTA mul-
tilingual pre-trained InfoXLM model (Chi et al., • We propose LayoutXLM, a multimodal pre-
2020). In addition, we pre-train the model with the trained model for multilingual document un-
IIT-CDIP dataset (Lewis et al., 2006) as well as derstanding, which is trained with large-scale
a great number of publicly available digital-born real-world scanned/digital-born documents.
multilingual PDF files from the internet, which
• We also introduce XFUND, a multilingual
helps the LayoutXLM model to learn from real-
form understanding benchmark dataset that
world documents. In this way, the model obtains
includes human-labeled forms with key-value
textual and visual signals from a variety of doc-
pairs in 7 languages (Chinese, Japanese, Span-
ument templates/layouts/formats in different lan-
ish, French, Italian, German, Portuguese).
guages, thereby taking advantage of the local in-
variance property from both textual, visual and lin- • LayoutXLM has outperformed other SOTA
guistic perspectives. Furthermore, to facilitate the multilingual baseline models on the XFUND
evaluation of the pre-trained LayoutXLM model, dataset, which demonstrates the great poten-
we employ human annotators to label a multilin- tial for the multimodal pre-training for the
gual form understanding dataset, which contains multilingual VrDU task. The pre-trained Lay-
7 languages, including Chinese, Japanese, Span- outXLM model and the XFUND dataset are
ish, French, Italian, German, Portuguese, and in- publicly available at https://aka.ms/
troduces a multilingual benchmark dataset named layoutxlm.
XFUND for each language where key-value pairs
are annotated. Experiment results show that the 2 Approach
pre-trained LayoutXLM outperforms several SOTA
In this section, we introduce the model architecture,
cross-lingual pre-trained models on the XFUND
pre-training objectives, and pre-training dataset.
Figure 2: Language distribution of the dataset for pre-training LayoutXLM
We follow the LayoutLMv2 (Xu et al., 2020a) ar- this pre-training objective, the model is required
chitecture and transfer the model to large-scale mul- to predict the masked text token based on its re-
tilingual document datasets. maining text context and whole layout clues. Sim-
ilar to the LayoutLM/LayoutLMv2, we train the
2.1 Model Architecture LayoutXLM with the Multilingual Masked Visual-
Similar to the LayoutLMv2 framework, we built Language Modeling objective (MMVLM).
the LayoutXLM model with a multimodal Trans- In LayoutLM/LayoutLMv2, an English word is
former architecture. The framework is shown in treated as the basic unit, and its layout information
Figure 1. The model accepts information from is obtained by extracting the bounding box of each
three different modalities, including text, layout, word with OCR tools, then subtokens of each word
and image, which are encoded respectively with share the same layout information. However, for
text embedding, layout embedding, and visual em- LayoutXLM, this strategy is not applicable because
bedding layers. The text and image embeddings the definition of the linguistic unit is different from
are concatenated, then plus the layout embedding language to language. To prevent the language-
to get the input embedding. The input embeddings specific pre-processing, we decide to obtain the
are encoded by a multimodal Transformer with the character-level bounding boxes. After the tokeniza-
spatial-ware self-attention mechanism. Finally, the tion using SentencePiece with a unigram language
output contextual representation can be utilized for model, we calculate the bounding box of each token
the following task-specific layers. For brevity, we by merging the bounding boxes of all characters it
refer to (Xu et al., 2020a) for further details on contains. In this way, we can efficiently unify the
architecture. multilingual multimodal inputs.
2.2 Pre-training Text-Image Alignment The Text-Image Align-
The pre-training objectives of LayoutLMv2 have ment (TIA) task is designed to help the model
shown effectiveness in modeling visually-rich capture the fine-grained alignment relationship be-
documents. Therefore, we naturally adapt this tween text and image. We randomly select some
pre-training framework to multilingual document text lines and then cover their corresponding image
pre-training. Following the idea of cross-modal regions on the document image. The model needs
alignment, our pre-training framework for docu- to predict a binary label for each token based on
ment understanding contains three pre-training ob- whether it is covered or not.
jectives, which are Multilingual Masked Visual-
Text-Image Matching For Text-Image Match-
Language Modeling (text-layout alignment), Text-
ing (TIM), we aim to align the high-level semantic
Image Alignment (fine-grained text-image align-
representation between text and image. To this end,
ment), and Text-Image Matching (coarse-grained
we require the model to predict whether the text
text-image alignment).
and image come from the same document page.
Multilingual Masked Visual-Language Mod-
eling The Masked Visual-Language Modeling 2.3 Pre-training Data
(MVLM) is originally proposed in the vanilla Lay- The LayoutXLM model is pre-trained with docu-
outLM and also used in LayoutLMv2, aiming to ments in 53 languages. Figure 2 shows the distri-
model the rich text in visually-rich documents. In bution of pre-training languages. In this section,
we briefly describe the pipeline for preparing the 3 XFUND: A Multilingual Form
large-scale multilingual document collection. Understanding Benchmark
Data Collection To collect a large-scale multilin- In recent years, many evaluation datasets for docu-
gual visually-rich document collection, we down- ment understanding tasks have been proposed, such
load and process publicly available multilingual as PublayNet (Zhong et al., 2019), FUNSD (Jaume
digital-born PDF documents following the prin- et al., 2019), SROIE4 , TableBank (Li et al., 2020a),
ciples and policies of Common Crawl1 . Using DocBank (Li et al., 2020b), DocVQA (Mathew
digital-born PDF documents can benefit the collect- et al., 2020) etc. They have successfully helped
ing and pre-processing steps. On the one hand, we to evaluate the proposed neural network models
do not have to identify scanned documents among and show the performance gap between the deep
the natural images. On the other hand, we can learning models and human beings, which signifi-
directly extract accurate text with corresponding cantly empowers the development of document un-
layout information with off-the-shelf PDF parsers derstanding research. However, almost all of these
and save time for running expensive OCR tools. evaluations and benchmarks are solely focused on
English documents, which limits the research for
Pre-processing The pre-processing step is non-English document understanding tasks. To
needed to clean the dataset since the raw mul- this end, we introduce a new benchmark for mul-
tilingual PDFs are often noisy. We use an tilingual Form Understanding, or XFUND, by ex-
open-source PDF parser called PyMuPDF2 to tending the FUNSD dataset to 7 other languages,
extract text, layout, and document images from including Chinese, Japanese, Spanish, French, Ital-
PDF documents. After PDF parsing, we discard ian, German, and Portuguese, where sampled doc-
the documents with less than 200 characters. We uments are shown in Figure 3. Next, we introduce
use the language detector from the BlingFire3 the key-value extraction task in our benchmark, as
library and split data per language. Following well as data collection, labeling pipeline, and the
CCNet (Wenzek et al., 2019), we classify the data statistics.
document as the corresponding language if the 3.1 Task description
language score is higher than 0.5. Otherwise,
unclear PDF files with a language score of less Key-value extraction is one of the most critical
than 0.5 are discarded. tasks in form understanding. Similar to FUNSD,
we define this task with two sub-tasks, which are
Data Sampling After splitting the data per lan- semantic entity recognition and relation extraction.
guage, we use the same sampling probability pl ∝ Semantic Entity Recognition Given a visually-
(nl /n)α as XLM (Lample and Conneau, 2019) to rich document D, we acquire discrete token set
sample the batches from different languages. Fol- t = {t0 , t1 , ..., tn }, where each token ti =
lowing InfoXLM (Chi et al., 2020), we use alpha (w, (x0 , y0 , x1 , y1 )) consists of a word w and its
= 0.7 for LayoutXLM to make a reasonable com- bounding box coordinates (x0 , y0 , x1 , y1 ). C =
promise between performance on high- and low- {c0 , c1 , .., cm } is the semantic entity labels where
resource languages. The brief language distribution the tokens are classified into. Semantic entity recog-
is shown in Figure 2. Finally, we follow this distri- nition is the task of extracting semantic entities
bution and sample a multilingual document dataset and classifying them into given entity types. In
with 22 million visually rich documents. In ad- other words, we intend to find a function FSER :
dition, we also sample 8 million scanned English (D, C) → E, where E is the predicted semantic
documents from the IIT-CDIP dataset so that we entity set:
totally use 30 million documents to pre-train the
LayoutXLM, where the model can benefit from the E = {({t00 , ..., tn0 0 }, c0 ), ..., ({t0k , ..., tnk k }, ck )}
visual information of both scanned and digital-born
document images. Relation Extraction Equipped with the docu-
ment D and the semantic entity label set C, relation
1
https://commoncrawl.org extraction aims to predict the relation between any
2
3
https://github.com/pymupdf/PyMuPDF two predicted semantic entities. Defining R =
https://github.com/microsoft/
4
BlingFire https://rrc.cvc.uab.es/?ch=13
(a) Chinese (b) Italian
Figure 3: Two sampled forms from the XFUND benchmark dataset (Chinese and Italian), where red denotes the
headers, green denotes the keys and blue denotes the values.
{r0 , r1 , .., rm } as the semantic relation labels, we forms are finally scanned into document images for
intend to find a function FRE : (D, C, R, E) → L, further OCR processing and key-value labeling.
where L is the predicted semantic relation set:
Key-value Pairs Key-value pairs are also anno-
L = {(head0 , tail0 , r0 ), ..., (headk , tailk , rk )} tated by human annotators. Equipped with the
synthetic forms, we use Microsoft Read API5 to
where headi and taili are two semantic entities. generate OCR tokens with bounding boxes. With
In this work, we mainly focus on the key-value a GUI annotation tool, annotators are shown the
relation extraction. original document images and the bounding boxes
visualization of all OCR tokens. The annotators
3.2 Data Collection and Labeling are asked to group the discrete tokens into entities
Form Templates Forms are usually used to col- and assign pre-defined labels to the entities. Also,
lect information in different business scenarios. In if two entities are related, they should be linked
order to avoid the sensitive information leak with together as a key-value pair.
the real-world documents, we collect the docu-
3.3 Dataset Statistics
ments publicly available on the internet and re-
move the content within the documents while only The XFUND benchmark includes 7 languages
keeping the templates to manually fill in synthetic with 1,393 fully annotated forms. Each language
information. We collect form templates in 7 lan- includes 199 forms, where the training set includes
guages from the internet. After that, the human 149 forms, and the test set includes 50 forms. De-
annotators manually fill synthetic information into tailed information is shown in Table 1.
these form templates following corresponding re-
quirements. Each template is allowed to be used 3.4 Baselines
only once, which means each form is different from Semantic Entity Recognition For this task,
the others. Besides, since the original FUNSD doc- we simply follow the typical sequence labeling
uments contain both digitally filled-out forms and 5
https://docs.microsoft.com/
handwritten forms, we also ask annotators to fill in en-us/azure/cognitive-services/
the forms by typing or handwriting. The completed computer-vision/overview-ocr
paradigm with BIO labeling format and build task- lang split header question answer other total
specific layers over the text part of LayoutXLM. ZH
training 229 3,692 4,641 1,666 10,228
testing 58 1,253 1,732 586 3,629
Relation Extraction Following Bekoulis et al. training 150 2,379 3,836 2,640 9,005
JA
(2018) , we first incrementally construct the set of testing 58 723 1,280 1,322 3,383
relation candidates by producing all possible pairs
training 253 3,013 4,254 3,929 11,449
of given semantic entities. For every pair, the repre- ES
testing 90 909 1,218 1,196 3,413
sentation of the head/tail entity is the concatenation
training 183 2,497 3,427 2,709 8,816
of the first token vector in each entity and the entity FR
testing 66 1,023 1,281 1,131 3,501
type embedding obtained with a specific type em-
training 166 3,762 4,932 3,355 12,215
bedding layer. After respectively projected by two IT
testing 65 1,230 1,599 1,135 4,029
FFN layers, the representations of head and tail are
training 155 2,609 3,992 1,876 8,632
concatenated and then fed into a bi-affine classifier. DE
testing 59 858 1,322 650 2,889
4 Experiments training 185 3,510 5,428 2,531 11,654

PT
testing 59 1,288 1,940 882 4,169
In this section, we introduce the experiment set-
tings for pre-training LayoutXLM. To verify the Table 1: Statistics of the XFUND dataset. Each num-
effectiveness of the pre-trained LayoutXLM model, ber in the table indicates the number of entities in each
we evaluate all the pre-trained models on our category.
human-labeled XFUND benchmark.
4.1 Settings guage. (3) Multitask fine-tuning requires the model

to train on data in all languages. We evaluate mod-
Pre-training LayoutXLM Following the orig- els in these three settings over two sub-tasks in
inal LayoutLMv2 recipe, we train LayoutXLM XFUND: semantic entity recognition and relation
models with two model sizes. For the extraction, and compare LayoutXLM to two strong
LayoutXLMBASE model, we use a 12-layer Trans- cross-lingual language models: XLM-R and In-
former encoder with 12 heads and set the hidden foXLM.
size to d = 768. For the LayoutXLMLARGE model,
we increase the layer number to 24 with 16 heads 4.2 Results
and hidden size to d = 1, 024. ResNeXt101-FPN
We evaluate the LayoutXLM model on language-
is used as a visual backbone in both models. Fi-
specific fine-tuning tasks, and the results are shown
nally, the number of parameters in these two mod-
in Table 2. Compared with the pre-trained models
els are approximately 345M and 625M. During
such as XLM-R and InfoXLM, the LayoutXLM
the pre-training stage, we first initialize the Trans-
LARGE model achieves the highest F1 scores in
former encoder along with text embeddings from
both SER and RE tasks. The significant improve-
InfoXLM and initialize the visual embedding layer
ment shows LayoutXLM’s capability to transfer
with a Mask-RCNN model trained on PubLayNet.
knowledge obtained from pre-training to down-
The rest of the parameters are initialized randomly.
stream tasks, which further confirms the effective-
Our models are trained with 64 Nvidia V100 GPUs.
ness of our multilingual pre-training framework.
Fine-tuning on XFUND We conduct experi- For the cross-lingual zero-shot transfer, we
ments on the XFUND benchmark. Besides the ex- present the evaluation results in Table 3. Although
periments of typical language-specific fine-tuning, the model are only fine-tuned on FUNSD dataset
we also design two additional settings to demon- (in English), it can still transfer the knowledge to
strate the ability to transfer knowledge among dif- different languages. In addition, it is observed that
ferent languages, which are zero-shot transfer learn- the LayoutXLM model significantly outperforms
ing and multitask fine-tuning. Specifically, (1) the other text-based models. This verifies that Lay-
language-specific fine-tuning refers to the typical outXLM can capture the common layout invariance
fine-tuning paradigm of fine-tuning on language among different languages and transfer to other lan-
X and testing on language X. (2) Zero-shot trans- guages for form understanding.
fer learning means the models are trained on En- Finally, Table 4 shows the evaluation results
glish data only and evaluated on each target lan- on the multitask learning. In this setting, the pre-
Model FUNSD ZH JA ES FR IT DE PT Avg.
XLM-RoBERTaBASE 0.667 0.8774 0.7761 0.6105 0.6743 0.6687 0.6814 0.6818 0.7047
InfoXLMBASE 0.6852 0.8868 0.7865 0.6230 0.7015 0.6751 0.7063 0.7008 0.7207
LayoutXLMBASE 0.794 0.8924 0.7921 0.7550 0.7902 0.8082 0.8222 0.7903 0.8056
SER
XLM-RoBERTaLARGE 0.7074 0.8925 0.7817 0.6515 0.7170 0.7139 0.711 0.7241 0.7374
InfoXLMLARGE 0.7325 0.8955 0.7904 0.6740 0.7140 0.7152 0.7338 0.7212 0.7471
LayoutXLMLARGE 0.8225 0.9161 0.8033 0.7830 0.8098 0.8275 0.8361 0.8273 0.8282
XLM-RoBERTaBASE 0.2659 0.5105 0.5800 0.5295 0.4965 0.5305 0.5041 0.3982 0.4769
InfoXLMBASE 0.2920 0.5214 0.6000 0.5516 0.4913 0.5281 0.5262 0.4170 0.4910
LayoutXLMBASE 0.5483 0.7073 0.6963 0.6896 0.6353 0.6415 0.6551 0.5718 0.6432
RE
InfoXLMLARGE 0.3679 0.6775 0.6604 0.6346 0.6096 0.6659 0.6057 0.5800 0.6002
LayoutXLMLARGE 0.6404 0.7888 0.7255 0.7666 0.7102 0.7691 0.6843 0.6796 0.7206
Table 2: Language-specific fine-tuning accuracy (F1) on the XFUND dataset (fine-tuning on X, testing on X),
where “SER” denotes the semantic entity recognition and “RE” denotes the relation extraction.

XLM-RoBERTaBASE 0.667 0.4144 0.3023 0.3055 0.371 0.2767 0.3286 0.3936 0.3824
InfoXLMBASE 0.6852 0.4408 0.3603 0.3102 0.4021 0.2880 0.3587 0.4502 0.4119
LayoutXLMBASE 0.794 0.6019 0.4715 0.4565 0.5757 0.4846 0.5252 0.539 0.5561
SER
InfoXLMLARGE 0.7325 0.5536 0.4132 0.3689 0.4909 0.3598 0.4363 0.5126 0.4835
LayoutXLMLARGE 0.8225 0.6896 0.519 0.4976 0.6135 0.5517 0.5905 0.6077 0.6115
XLM-RoBERTaBASE 0.2659 0.1601 0.2611 0.2440 0.2240 0.2374 0.2288 0.1996 0.2276
InfoXLMBASE 0.2920 0.2405 0.2851 0.2481 0.2454 0.2193 0.2027 0.2049 0.2423
LayoutXLMBASE 0.5483 0.4494 0.4408 0.4708 0.4416 0.4090 0.3820 0.3685 0.4388
RE
InfoXLMLARGE 0.3679 0.3156 0.3364 0.3185 0.3189 0.2720 0.2953 0.2554 0.3100
LayoutXLMLARGE 0.6404 0.5531 0.5696 0.5780 0.5615 0.5184 0.4890 0.4795 0.5487
Table 3: Zero-shot transfer accuracy (F1) on the XFUND dataset (fine-tuning on FUNSD, testing on X), where
“SER” denotes the semantic entity recognition and “RE” denotes the relation extraction.
trained LayoutXLM model is fine-tuned with all to take both visual and linguistic embedded fea-
8 languages simultaneously and evaluated on each tures as input. Li et al. (2019) propose Visual-
specific language, in order to investigate whether BERT consists of a stack of Transformer layers
improvements can be obtained by multilingual fine- that implicitly align elements of an input text and
tuning. We observe that the multitask learning fur- regions in an associated input image with self-
ther improves the model performance compared to attention. Chen et al. (2020) introduced UNITER
the language-specific fine-tuning, which also con- that learns through large-scale pre-training over
firms that document understanding can benefit from four image-text datasets (COCO, Visual Genome,
the layout invariance among different languages. Conceptual Captions, and SBU Captions), which
can power heterogeneous downstream V+L tasks
5 Related Work with joint multimodal embeddings. Li et al. (2020c)
proposed a new learning method Oscar (Object-
Multimodal pre-training has become popular in Semantics Aligned Pre-training), which uses ob-
recent years due to its successful applications ject tags detected in images as anchor points to
in vision-language representation learning. Lu significantly ease the learning of alignments. In-
et al. (2019) proposed ViLBERT for learning task- spired by these vision-language pre-trained models,
agnostic joint representations of image content and we would like to introduce the vision-language
natural language by extending the popular BERT pre-training into the document intelligence area,
architecture to a multimodal two-stream model. Su where the text, layout, and image information can
et al. (2020) proposed VL-BERT that adopts the be jointly learned to benefit the VrDU tasks.
Transformer model as the backbone, and extends it Multilingual pre-trained models have pushed
XLM-RoBERTaBASE 0.6633 0.883 0.7786 0.6223 0.7035 0.6814 0.7146 0.6726 0.7149
InfoXLMBASE 0.6538 0.8741 0.7855 0.5979 0.7057 0.6826 0.7055 0.6796 0.7106
LayoutXLMBASE 0.7924 0.8973 0.7964 0.7798 0.8173 0.821 0.8322 0.8241 0.8201
SER
InfoXLMLARGE 0.7246 0.8919 0.7998 0.6702 0.7376 0.7180 0.7523 0.7332 0.7534
LayoutXLMLARGE 0.8068 0.9155 0.8216 0.8055 0.8384 0.8372 0.853 0.8650 0.8429
XLM-RoBERTaBASE 0.3638 0.6797 0.6829 0.6828 0.6727 0.6937 0.6887 0.6082 0.6341
InfoXLMBASE 0.3699 0.6493 0.6473 0.6828 0.6831 0.6690 0.6384 0.5763 0.6145
LayoutXLMBASE 0.6671 0.8241 0.8142 0.8104 0.8221 0.8310 0.7854 0.7044 0.7823
RE
InfoXLMLARGE 0.4543 0.7311 0.7510 0.7644 0.7549 0.7504 0.7356 0.6875 0.7037
LayoutXLMLARGE 0.7683 0.9000 0.8621 0.8592 0.8669 0.8675 0.8263 0.8160 0.8458
Table 4: Multitask fine-tuning accuracy (F1) on the XFUND dataset (fine-tuning on 8 languages all, testing on X),
where “SER” denotes the semantic entity recognition and “RE” denotes the relation extraction.
many SOTA results on cross-lingual natural lan- large-scale monolingual corpora in many languages
guage understanding tasks by pre-training the using the BART objective. Xue et al. (2020) in-
Transformer models on different languages. These troduced mT5, a multilingual variant of T5 that
models have successfully bridged the language bar- was pre-trained on a new Common Crawl-based
riers in a number of cross-lingual transfer bench- dataset covering 101 languages. The pre-trained
marks such as XNLI (Conneau et al., 2018) and LayoutXLM model is built on the multilingual tex-
XTREME (Hu et al., 2020). Devlin et al. (2018) tual models as the initialization, which benefits the
introduced a new language representation model VrDU tasks in different languages worldwide.
called BERT and extend to a multilingual version
called mBERT, which is designed to pre-train deep
bidirectional representations from the unlabeled 6 Conclusion
text by jointly conditioning on both left and right
context in all layers. As a result, the pre-trained In this paper, we present LayoutXLM, a multi-
BERT model can be fine-tuned with just one addi- modal pre-trained model for multilingual visually-
tional output layer to create state-of-the-art mod- rich document understanding. The LayoutXLM
els for a wide range of tasks. Lample and Con- model is pre-trained with 30 million scanned and
neau (2019) proposed two methods to learn cross- digital-born documents in 53 languages. Mean-
lingual language models (XLMs): one unsuper- while, we also introduce the multilingual form un-
vised that only relies on monolingual data, and derstanding benchmark XFUND, which includes
one supervised that leverages parallel data with a key-value labeled forms in 7 languages. Experi-
new cross-lingual language model objective. Con- mental results have illustrated that the pre-trained
neau et al. (2020) proposed to train a Transformer- LayoutXLM model has significantly outperformed
based masked language model on one hundred lan- the SOTA baselines for multilingual document un-
guages, using more than two terabytes of filtered derstanding, which bridges the language gap in real-
CommonCrawl data, which significantly outper- world document understanding tasks. We make
forms mBERT on a variety of cross-lingual bench- LayoutXLM and XFUND publicly available to
marks. Recently, Chi et al. (2020) formulated cross- advance the document understanding research.
lingual language model pre-training as maximiz- For future research, we will further enlarge the
ing mutual information between multilingual-multi- multilingual training data to cover more languages
granularity texts. The unified view helps to bet- as well as more document layouts and templates.
ter understand the existing methods for learning In addition, as there are a great number of business
cross-lingual representations, and the information- documents with the same content but in different
theoretic framework inspires to propose a pre- languages, we will also investigate how to leverage
training task based on contrastive learning. Liu the contrastive learning of parallel documents for
et al. (2020) presented mBART – a sequence-to- the multilingual pre-training.
sequence denoising auto-encoder pre-trained on
References International Conference on Document Analysis and
Recognition Workshops (ICDARW).
Haithem Afli and Andy Way. 2016. Integrating optical
character recognition and machine translation of his- Guillaume Lample and Alexis Conneau. 2019. Cross-
torical documents. In Proceedings of the Workshop lingual language model pretraining.
on Language Technology Resources and Tools for
Digital Humanities (LT4DH), pages 109–116, Os- D. Lewis, G. Agam, S. Argamon, O. Frieder, D. Gross-
aka, Japan. The COLING 2016 Organizing Commit- man, and J. Heard. 2006. Building a test collection
tee. for complex document information processing. In
Proceedings of the 29th Annual International ACM
Giannis Bekoulis, Johannes Deleu, Thomas Demeester,
SIGIR Conference on Research and Development in
and Chris Develder. 2018. Joint entity recogni-
Information Retrieval, SIGIR ’06, page 665–666,
tion and relation extraction as a multi-head selection
New York, NY, USA. Association for Computing
problem. Expert Syst. Appl., 114:34–45.
Machinery.
Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El
Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, and Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui
Jingjing Liu. 2020. Uniter: Universal image-text Hsieh, and Kai-Wei Chang. 2019. Visualbert: A
representation learning. simple and performant baseline for vision and lan-
guage.
Zewen Chi, Li Dong, Furu Wei, Nan Yang, Saksham
Singhal, Wenhui Wang, Xia Song, Xian-Ling Mao, Minghao Li, Lei Cui, Shaohan Huang, Furu Wei, Ming
Heyan Huang, and Ming Zhou. 2020. Infoxlm: An Zhou, and Zhoujun Li. 2020a. TableBank: Ta-
information-theoretic framework for cross-lingual ble benchmark for image-based table detection and
language model pre-training. recognition. In Proceedings of the 12th Language
Resources and Evaluation Conference, pages 1918–
Alexis Conneau, Kartikay Khandelwal, Naman Goyal, 1925, Marseille, France. European Language Re-
Vishrav Chaudhary, Guillaume Wenzek, Francisco sources Association.
Guzmán, Edouard Grave, Myle Ott, Luke Zettle-
moyer, and Veselin Stoyanov. 2020. Unsupervised Minghao Li, Yiheng Xu, Lei Cui, Shaohan Huang,
cross-lingual representation learning at scale. Furu Wei, Zhoujun Li, and Ming Zhou. 2020b.
DocBank: A benchmark dataset for document lay-
Alexis Conneau, Ruty Rinott, Guillaume Lample, Ad- out analysis. In Proceedings of the 28th Inter-
ina Williams, Samuel R. Bowman, Holger Schwenk, national Conference on Computational Linguistics,
and Veselin Stoyanov. 2018. Xnli: Evaluating cross- pages 949–960, Barcelona, Spain (Online). Interna-
lingual sentence representations. In Proceedings of tional Committee on Computational Linguistics.
the 2018 Conference on Empirical Methods in Natu-
ral Language Processing. Association for Computa- Xiujun Li, Xi Yin, Chunyuan Li, Pengchuan Zhang,
tional Linguistics. Xiaowei Hu, Lei Zhang, Lijuan Wang, Houdong
Hu, Li Dong, Furu Wei, Yejin Choi, and Jianfeng
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Gao. 2020c. Oscar: Object-semantics aligned pre-
Kristina Toutanova. 2018. Bert: Pre-training of deep training for vision-language tasks.
bidirectional transformers for language understand-
ing. arXiv preprint arXiv:1810.04805. Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey
Edunov, Marjan Ghazvininejad, Mike Lewis, and
Filip Graliński, Tomasz Stanisławek, Anna
Luke Zettlemoyer. 2020. Multilingual denoising
Wróblewska, Dawid Lipiński, Agnieszka Kaliska,
pre-training for neural machine translation.
Paulina Rosalska, Bartosz Topolski, and Prze-
mysław Biecek. 2020. Kleister: A novel task for
Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan
information extraction involving long documents
Lee. 2019. Vilbert: Pretraining task-agnostic visi-
with complex layout.
olinguistic representations for vision-and-language
Adam W Harley, Alex Ufkes, and Konstantinos G Der- tasks.
panis. 2015. Evaluation of deep convolutional nets
for document image classification and retrieval. In Minesh Mathew, Dimosthenis Karatzas, R. Manmatha,
International Conference on Document Analysis and and C. V. Jawahar. 2020. Docvqa: A dataset for vqa
Recognition (ICDAR). on document images.
Junjie Hu, Sebastian Ruder, Aditya Siddhant, Graham Seunghyun Park, Seung Shin, Bado Lee, Junyeop Lee,
Neubig, Orhan Firat, and Melvin Johnson. 2020. Jaeheung Surh, Minjoon Seo, and Hwalsuk Lee.
Xtreme: A massively multilingual multi-task bench- 2019. Cord: A consolidated receipt dataset for post-
mark for evaluating cross-lingual generalization. ocr parsing.
Guillaume Jaume, Hazim Kemal Ekenel, and Jean- Weijie Su, Xizhou Zhu, Yue Cao, Bin Li, Lewei Lu,
Philippe Thiran. 2019. Funsd: A dataset for form Furu Wei, and Jifeng Dai. 2020. Vl-bert: Pre-
understanding in noisy scanned documents. 2019 training of generic visual-linguistic representations.
Guillaume Wenzek, Marie-Anne Lachaux, Alexis Con-
neau, Vishrav Chaudhary, Francisco Guzmán, Ar-
mand Joulin, and Edouard Grave. 2019. Ccnet: Ex-
tracting high quality monolingual datasets from web
crawl data. CoRR, abs/1911.00359.
Yang Xu, Yiheng Xu, Tengchao Lv, Lei Cui, Furu
Wei, Guoxin Wang, Yijuan Lu, Dinei Florencio,
Cha Zhang, Wanxiang Che, Min Zhang, and Li-
dong Zhou. 2020a. Layoutlmv2: Multi-modal pre-
training for visually-rich document understanding.
Yiheng Xu, Minghao Li, Lei Cui, Shaohan Huang,
Furu Wei, and Ming Zhou. 2020b. Layoutlm:
Pre-training of text and layout for document im-
age understanding. In Proceedings of the 26th
ACM SIGKDD International Conference on Knowl-
edge Discovery & Data Mining, KDD ’20, page
1192–1200, New York, NY, USA. Association for
Computing Machinery.
Linting Xue, Noah Constant, Adam Roberts, Mi-

hir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya
Barua, and Colin Raffel. 2020. mt5: A massively
multilingual pre-trained text-to-text transformer.
Xu Zhong, Jianbin Tang, and Antonio Jimeno Yepes.

2019. Publaynet: largest dataset ever for document
layout analysis. In 2019 International Conference
on Document Analysis and Recognition (ICDAR),
pages 1015–1022. IEEE.

Layout XLM

Uploaded by

Copyright:

Available Formats

Layout XLM

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Layout XLM

Uploaded by

Copyright:

Available Formats

What problem does LayoutXLM aim to address?

What datasets are used to evaluate LayoutXLM?

LayoutXLM: Multimodal Pre-training for

Multilingual Visually-rich Document Understanding

Abstract it is often not satisfactory due to the poor transla-

Biaffine Attention Classifier

E1 & E2 E1 & E3 E2 & E3

Multi-Modal Transformer Encoder Layers with Spatial-Aware Self-Attention

Visual & Text Embedding <s> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? </s>

4 Experiments training 185 3,510 5,428 2,531 11,654

4.1 Settings guage. (3) Multitask fine-tuning requires the model

Model FUNSD ZH JA ES FR IT DE PT Avg.

Linting Xue, Noah Constant, Adam Roberts, Mi-

Xu Zhong, Jianbin Tang, and Antonio Jimeno Yepes.

You might also like