Releases: OpenNMT/Tokenizer
Releases · OpenNMT/Tokenizer
Tokenizer 1.37.1
Fixes and improvements
- Consider escaped characters as single characters in BPE
- Ignore undefined scripts when resolving inherited or common scripts
Tokenizer 1.37.0
New features
- Add tokenization option
allow_isolated_marks
to allow combining marks to appear isolated in the tokenization output in specific conditions
Fixes and improvements
- Fix infinite loop when the text contains an invalid Unicode character
- Fix segmentation fault when the
BPELearner
does not not find any pairs of characters in the tokenized data - [Python] Update ICU to 72.1
Tokenizer 1.36.0
New features
- [Python] Add argument
vocabulary
in theTokenizer
constructor to set the vocabulary with a list of tokens instead of using a file - [Python] Add function
pyonmttok.is_valid_language
to check if a language code is valid and can be passed to theTokenizer
constructor
Tokenizer 1.35.0
New features
- [Python] Add pickling support to
pyonmttok.Vocab
Fixes and improvements
- Update pybind11 to 2.10.1
- Update cibuildwheel to 2.11.2
Tokenizer 1.34.0
Changes
- [Python] Wheels are now built under
manylinux2014
and requirespip
>= 19.3 for installation
New features
- [Python] Build wheels for Python 3.11
Fixes and improvements
- Improve error handling when reading token frequencies in the vocabulary file
- [Python] Fix possible crash when
pyonmttok
is imported beforetorch
- [Python] Update ICU to 71.1
- [C++] Fix static compilation with
-DBUILD_SHARED_LIBS=OFF
- [C++] Fix CMake warning when compiling the tests
Tokenizer 1.33.0
New features
- [Python] Build ARM64 wheels for macOS
Fixes and improvements
- [CLI] Fix error when the option
--segment_alphabet
is not set - Fix SentencePiece build warning when compiling with Clang
Tokenizer 1.32.0
New features
- Add property
pyonmttok.Vocab.counters
to retrieve the number of occurrences of each token
Fixes and improvements
- Update pybind11 to 2.10.0
- Update cxxopts to 3.0.0
Tokenizer 1.31.0
New features
- Add utilities to build and use vocabularies:
pyonmttok.Vocab
pyonmttok.build_vocab_from_tokens
pyonmttok.build_vocab_from_lines
- Define the method
Tokenizer.__call__
to simplify the tokenizer usage when additional features are unused:
tokens = tokenizer(text)
Fixes and improvements
- Update pybind11 to 2.9.1
Tokenizer 1.30.1
Fixes and improvements
- Fix deprecated languages codes in ICU that are incorrectly considered as invalid (e.g. "tl" for Tagalog)
Tokenizer 1.30.0
New features
- [Python] Build wheels for AArch64 Linux
Fixes and improvements
- [Python] Update ICU to 70.1