The National Library of Luxembourg (BnL) started its first initiative in digitizing newspapers, with layout recognition and OCR on article level, back in 2006. Service providers were asked to create images of excellent quality, to run an optical layout recognition process, to identify articles and to run OCR on them. The data was modeled according to the METS/ALTO standard. In the meantime however, the potential of OCR software increased.
Developed by BnL in the context of its Open Data initiative, Nautilus-OCR uses these improvements in technology and the already structured data to rerun and enhance OCR. Nautilus-OCR can be used in two ways:
- Main purpose: Enhance the OCR quality of original (ori) METS/ALTO packages.
Nautilus-OCR METS/ALTO to METS/ALTO pipeline:
- Extracts all ori images/text pairs
- Targets a specific set of block types
- Uses enhancement prediction on every target to possibly run OCR
- Integrates new outputs into an updated METS/ALTO package - Alternatively: Use as a regular OCR engine that is applied on a set of images.
Nautilus-OCR provides the possibility to visually compare ori (left) to new (right) outputs.
Key features:
- Custom model training.
- Included pre-trained OCR, font recognition and enhancement prediction models.
- METS/ALTO to METS/ALTO using enhancement prediction.
- Fast, multi-font OCR pipeline.
Nautilus-OCR is mainly built on open-source libraries combined with some proprietary contributions. Please note that the project is trying to be a generalized version of a tailored implementation for the specific needs of BnL.
- Quick Start
- Requirements
- Installation
- Workflow
- Modules
- Models
- Ground Truth
- Articles
- Libraries
- License
- Credits
- Contact
After having followed the installation instructions, Nautilus-OCR can be run by using the included BnL models and example METS/ALTO data.
With nautilusocr/
as the current working directory, first copy the BnL models to the final/
folder.1
cp models/bnl/* models/final/
Next, run enhance on the examples/
directory, containg a single mets-alto-package/
python3 src/main.py enhance -d examples/ -r 0.02
to generate new ALTO files for every block with a minimum enhancement prediction of 2%. Finally, the newly generated files can be located in output/
.
1 As explained in models/final/README.md
, the models within models/final/
are automatically applied when executing the enhance, train-epr, ocr and test-ocr actions. Models outside of models/final/
are supposed to be stored for testing and comparison purposes.
Nautilus-OCR requires:
- Linux / macOS
The software requires dependencies that only work on Linux and macOS. Windows is not supported at the moment. - Python 3.8+
The software has been developed using Python 3.8.5. - Dependencies
Access to the libraries listed inrequirements.txt
. - METS/ALTO
METS/ALTO packages as data, or alternatively TextBlock images representing single-column snippets of text.
With Python3 (tested on version 3.8.5) installed, clone this repostitory and install the required dependencies:
git clone https://github.com/natliblux/nautilusocr
cd nautilusocr
pip3 install -r requirements.txt
Hunspell dependency might require:
apt-get install libhunspell-dev
brew install hunspell
OpenCV dependency might require:
apt install libgl1-mesa-glx
apt install libcudart10.1
You can test that all dependencies have been sucessfully installed by running
python3 src/main.py -h
and looking for the following output:
Starting Nautilus-OCR
usage: main.py [-h] {set-ocr,train-ocr,test-ocr,enhance,ocr,set-fcr,train-fcr,test-fcr,test-seg,train-epr,test-epr} ...
Nautilus-OCR Command Line Tool
positional arguments:
{set-ocr,train-ocr,test-ocr,enhance,ocr,set-fcr,train-fcr,test-fcr,test-seg,train-epr,test-epr}
sub-command help
optional arguments:
-h, --help show this help message and exit
The command-line tool consists of four different modules, with each one exposing a predefined set of actions:
- ocr - optical character recognition
- seg - text line segmentation
- fcr - font class recognition
- epr - enhancement prediction
To get started, one should take note of the options available in config.ini
and most importantly set the device (CPU/GPU) parameter and decide on the set of font_classes and supported_languages. Next, a general workflow could looks as follows:
- Test the seg algorithm using test-seg to see whether any parameters need to be adjusted.
- Create a fcr train set using set-fcr based on font ground truth information.
- Train a fcr model using train-fcr.
- Test the fcr model accurcy using test-fcr.
- Create an ocr train set using set-ocr based on ocr ground truth information.
- Train an ocr model for every font class using train-ocr.
- Test the ocr model for every font class using test-ocr.
- Train an epr model based on ground truth and ori data using train-epr.
- Test the epr model accuracy using test-epr.
- Enhance METS/ALTO packages using enhance.
- Alternatively: Run ocr on a set of images using ocr.
This id done by calling main.py
followed by the desired action and options:
python3 src/main.py [action] [options]
The following module sections will list all available actions and options.
Creates an ocr train set consisting of image/text line pairs. Every pair is of type New, Artificial or Existing:
- New: Extracted using an image and ALTO file.
- Generated: Image part of pair is generated artificially based on given input text.
- Existing: Pair exists already (has been prepared beforehand) and is included in the train set.
Option | Default | Explanation |
---|---|---|
-j --jsonl | Path to jsonl file referencing image and ALTO files 1 2 | |
-c --confidence | 9 (max tolerant) | Highest tolerated confidence value for every character in line |
-m --model | fcr-model | Name of fcr model to be used in absence of font class indication 3 |
-e --existing | Path to directory containing existing pairs 4 5 | |
-g --generated | 0 (none) | Number of artificially generated pairs to be added per font class 6 7 |
-t --text | Path to text file containing text for artificial pairs 8 | |
-n --nlines | -1 (max) | Maximum number of pairs per font class |
-s --set | ocr-train-set | Name of ocr train set |
1 Example lines:
{"image": "/path/image1.png", "gt": "/path/alto1.xml"}
{"image": "/path/image2.png", "gt": "/path/alto2.xml", "gt-block-id": "TB1"}
{"image": "/path/image3.png", "gt": "/path/alto3.xml", "gt-block-id": "TB2", "font": "fraktur"}
2 Key gt-block-id can optionally reference a single block in a multi-block ALTO file.
3 Absence of font key means that -m option must be set to automatically determine the font class.
4 Naming convention for existing pairs: [pair-name].png/.tif
& [pair-name].gt.txt
.
5 Image part of existing pairs is supposed to be unbinarized.
6 Artificially generated lines represent lower quality examples for the model to learn from.
7 Fonts in fonts/artificial/
are being randomly used and can be adjusted per font class.
8 Text is given by a .txt
file with individual words delimited by spaces and line breaks.
Trains an ocr model for a specific font using an ocr train set.
Option | Default | Explanation |
---|---|---|
-s --set | Name of ocr train set to be used | |
-f --font | Name of font that ocr model should be trained on | |
-m --model | ocr-model | Name of ocr model to be created |
Tests models in models/final/
on a test set defined by a jsonl file.
A comparison to the original ocr data can optionally be drawn.
Option | Default | Explanation |
---|---|---|
-j --jsonl | Path to jsonl file referencing image and ground truth ALTO files 1 2 | |
-i --image | False | Generate output image comparing ocr output with source image |
-c --confidence | False | Add ocr confidence (through font greyscale level) to output image |
1 Example lines:
{"id": "001", "image": "/path/image1.png", "gt": "/path/alto1.xml"}
{"id": "002", "image": "/path/image2.png", "gt": "/path/alto2.xml", "gt-block-id": "TB1"}
{"id": "003", "image": "/path/image3.png", "gt": "/path/alto3.xml", "gt-block-id": "TB2", "ori": "/path2/alto3.xml"}
{"id": "004", "image": "/path/image4.png", "gt": "/path/alto4.xml", "gt-block-id": "TB3", "ori": "/path2/alto4.xml", "ori-block-id": "TB4"}
2 Keys ori and ori-block-id can optionally reference original ocr output for comparison purposes.
Applies ocr on a set of original METS/ALTO packages, while aiming to enhance ocr accuracy.1
An optional enhancement prediction model can prevent running ocr for some target blocks.
Models in models/final/
are automatically used for this action.2
Option | Default | Explanation |
---|---|---|
-d --directory | Path to directory containing all orignal METS/ALTO packages 3 4 | |
-r --required | 0.0 | Value for minimum required enhancement prediction 5 |
1 Target text block types can be adjusted in config.ini
.
2 The presence of an epr model is optional.
3 METS files need to end in -mets.xml
.
4 Every package name should be unique and is defined as the directory name of the METS file.
5 Enhancement predictions are in range [-1,1], set to -1 to disable epr and automatically reprocess all target blocks.
Applies ocr on a directory of images while using the models in models/final/
.
Option | Default | Explanation |
---|---|---|
-d --directory | Path to directory containing target ocr source images 1 | |
-a --alto | False | Output ocr in ALTO format |
-i --image | False | Generate output image comparing ocr with source image |
-c --confidence | False | Add ocr confidence (through font greyscale level) to output image |
1 Subdirectories possible, images should be in .png
or .tif
format.
Tests the CombiSeg segmentation algorithm on a test set defined by a jsonl file.
The correct functionning of the segmentation algorithm is essential for most other modules and actions.
The default parameters should generally work well, however they can be adjusted. 1
Explanations concerning the parameters, as well as the algorithm itself, can be found here.
Option | Default | Explanation |
---|---|---|
-j --jsonl | path to jsonl file referencing image and ALTO files 2 |
1 Algorithm parameters can be adjusted in config.ini
in case of unsatisfactory performance.
2 Example lines:
{"image": "/path/image1.png", "gt": "/path/alto1.xml"}
{"image": "/path/image2.png", "gt": "/path/alto2.xml", "gt-block-id": "TB1"}
Creates a fcr train set consisting of individual character images.
Option | Default | Explanation |
---|---|---|
-j --jsonl | Path to jsonl file referencing image files and the respective font classes 1 | |
-n --nchars | max | Maximum number of characters extracted from every image 2 |
-s --set | fcr-train-set | Name of fcr train set |
1 Example line:
{"image": "/path/image.png", "font": "fraktur"}
2 Fewer extracted chars for a larger amount of images generally leads to a more diverse train set.
Trains a fcr model using a fcr train set.
Option | Default | Explanation |
---|---|---|
-s --set | Name of fcr train set | |
-m --model | fcr-model | Name of fcr model to be created |
Tests a fcr model on a test set defined by a jsonl file.
Option | Default | Explanation |
---|---|---|
-j --jsonl | Path to jsonl file referencing image files and the respective font classes 1 | |
-m --model | fcr-model | Name of fcr model to be tested |
1 Example line:
{"image": "/path/image.png", "font": "fraktur"}
This module requires language dictionaries. For all language xx in supported_languages in config.ini
, please either add a list of words as xx.txt
or the Hunspell files xx.dic
and xx.aff
to dicts/
.
Trains an epr model (for use in enhance) that predicts the enhancement in ocr accuracy (from ori to new) and can hence be used to prevent ocr from running on all target blocks.
Please take note of the parameters in config.ini
before starting training.
This action uses the models in models/final/
.
Option | Default | Explanation |
---|---|---|
-j --jsonl | Path to jsonl file referencing image, ground truth ALTO and original ALTO files 1 | |
-m --model | epr-model | Name of epr model to be created |
1 Example lines:
{"image": "/path/image1.png", "gt": "/path/alto1.xml", "ori": "/path/alto1.xml", "year": 1859}
{"image": "/path/image2.png", "gt": "/path/alto2.xml", "gt-block-id": "TB1", "ori": "/path/alto2.xml", "year": 1859}
{"image": "/path/image3.png", "gt": "/path/alto3.xml", "gt-block-id": "TB2", "ori": "/path/alto3.xml", "ori-block-id": "TB2", "year": 1859}
Tests an epr model and returns the mean average error after applying leave-one-out cross-validation (kNN algorithm).
Option | Default | Explanation |
---|---|---|
-m --model | epr-model | Name of epr model to be tested |
Nautilus-OCR encloses four pre-trained models:
- bnl-ocr-antiqua.mlmodel
OCR model built with kraken and trained on the antiqua data (70k pairs) of an extended version of bnl-ground-truth-newspapers-before-1878 that is not limited to the cut-off date of 1878.
- bnl-ocr-fraktur.mlmodel
OCR model built with kraken and trained on the fraktur data (43k pairs) of an extended version of bnl-ground-truth-newspapers-before-1878 that is not limited to the cut-off date of 1878.
- bnl-fcr.h5
Binary font recognition model built with TensorFlow and trained to perform classification using font classes [antiqua, fraktur]. Please note that the fcr module automatically extends the set of classes to [antiqua, fraktur, unknown], to cover for the case where the neural network input preprocessing fails. The model has been trained on 50k individual character images and showed 100% accuracy on a 200 image test set.
- bnl-epr-de-fr-lb.jsonl
Enhancement prediction model trained on more than 4.5k text blocks for the language set [de, fr, lb]. Training data has been published between 1840 and 1960. Enhancement is predicted for the application of bnl-ocr-antiqua.mlmodel and bnl-ocr-fraktur.mlmodel, therefore based on font class set [antiqua, fraktur]. The model makes use of the dictionaries for all three languages within dicts/
. Using leave-one-out cross-validation (kNN algorithm), mean average error of 0.024 was achieved.
bnl-ground-truth-newspapers-before-1878
OCR ground truth dataset including more than 33k text line image/text pairs, split in antiqua (19k) and fraktur (14k) font classes. The set is based on Luxembourg historical newspapers in the public domain (published before 1878), written generally in German, French and Luxembourgish. Transcription was done using a double-keying technique with a minimum accuracy of 99.95%. Font class was automatically determined using bnl-fcr.h5.
Combining Morphological and Histogram based Text Line Segmentation in the OCR Context
Journal of Data Mining & Digital Humanities
Text line segmentation is one of the pre-stages of modern optical character recognition systems. The algorithmic approach proposed by this paper has been designed for this exact purpose. Its main characteristic is the combination of two different techniques, morphological image operations and horizontal histogram projections. The method was developed to be applied on a historic data collection that commonly features quality issues, such as degraded paper, blurred text, or presence of noise. For that reason, the segmenter in question could be of particular interest for cultural institutions, that want access to robust line bounding boxes for a given historic document. Because of the promising segmentation results that are joined by low computational cost, the algorithm was incorporated into the OCR pipeline of the National Library of Luxembourg, in the context of the initiative of reprocessing their historic newspaper collection. The general contribution of this paper is to outline the approach and to evaluate the gains in terms of accuracy and speed, comparing it to the segmentation algorithm bundled with the used open source OCR software.
Rerunning OCR: A Machine Learning Approach to Quality Assessment and Enhancement Prediction
Journal of Data Mining & Digital Humanities
Iterating with new and improved OCR solutions enforces decision making when it comes to targeting the right candidates for reprocessing. This especially applies when the underlying data collection is of considerable size and rather diverse in terms of fonts, languages, periods of publication and consequently OCR quality. This article captures the efforts of the National Library of Luxembourg to support those targeting decisions. They are crucial in order to guarantee low computational overhead and reduced quality degradation risks, combined with a more quantifiable OCR improvement. In particular, this work explains the methodology of the library with respect to text block level quality assessment. Through extension of this technique, a regression model, that is able to take into account the enhancement potential of a new OCR engine, is also presented. They both mark promising approaches, especially for cultural institutions dealing with historical data of lower quality.
Nautilus - An End-To-End METS/ALTO OCR Enhancement Pipeline
LIBER Quarterly
When a digital collection has been processed by OCR, the usability expectations of patrons and researchers are high. While the former expect full text search to return all instances of terms in historical collections correctly, the latter are more familiar with the impacts of OCR errors but would still like to apply big data analysis or machine-learning methods. All of these use cases depend on high quality textual transcriptions of the scans. This is why the National Library of Luxembourg (BnL) has developed a pipeline to improve OCR for existing digitised documents. Enhancing OCR in a digital library not only demands improved machine learning models, but also requires a coherent reprocessing strategy in order to apply them efficiently in production systems. The newly developed software tool, Nautilus, fulfils these requirements using METS/ALTO as a pivot format. The BnL has open-sourced it so that other libraries can re-use it on their own collections. This paper covers the creation of the ground truth, the details of the reprocessing pipeline, its production use on the entirety of the BnL collection, along with the estimated results. Based on a quality prediction measure, developed during the project, approximately 28 million additional text lines now exceed the quality threshold.
Nautilus-OCR is mostly built on open-source libraries, with the most important ones being:
- kraken : http://kraken.re/
- TensorFlow : https://www.tensorflow.org/
- openCV : https://opencv.org/
See COPYING
to see full text.
Thanks and credits go to the Lexicolux project, whose work is the basis for the generation of dicts/lb.txt
.
If you want to get in touch, please contact us here.