Releases · OpenNMT/Tokenizer

01 Mar 13:09

guillaumekln

v1.37.1

e52317c

Tokenizer 1.37.1 Latest

Latest

Fixes and improvements

Consider escaped characters as single characters in BPE
Ignore undefined scripts when resolving inherited or common scripts

Assets 2

28 Feb 15:06

guillaumekln

v1.37.0

5a6c087

Tokenizer 1.37.0

New features

Add tokenization option allow_isolated_marks to allow combining marks to appear isolated in the tokenization output in specific conditions

Fixes and improvements

Fix infinite loop when the text contains an invalid Unicode character
Fix segmentation fault when the BPELearner does not not find any pairs of characters in the tokenized data
[Python] Update ICU to 72.1

Assets 2

13 Jan 15:05

guillaumekln

v1.36.0

bf9c1af

Tokenizer 1.36.0

New features

[Python] Add argument vocabulary in the Tokenizer constructor to set the vocabulary with a list of tokens instead of using a file
[Python] Add function pyonmttok.is_valid_language to check if a language code is valid and can be passed to the Tokenizer constructor

Assets 2

06 Dec 10:41

guillaumekln

v1.35.0

003e7da

Tokenizer 1.35.0

New features

[Python] Add pickling support to pyonmttok.Vocab

Fixes and improvements

Update pybind11 to 2.10.1
Update cibuildwheel to 2.11.2

Assets 2

13 Sep 09:31

guillaumekln

v1.34.0

c7cb612

Tokenizer 1.34.0

Changes

[Python] Wheels are now built under manylinux2014 and requires pip >= 19.3 for installation

New features

[Python] Build wheels for Python 3.11

Fixes and improvements

Improve error handling when reading token frequencies in the vocabulary file
[Python] Fix possible crash when pyonmttok is imported before torch
[Python] Update ICU to 71.1
[C++] Fix static compilation with -DBUILD_SHARED_LIBS=OFF
[C++] Fix CMake warning when compiling the tests

Assets 2

29 Aug 12:34

guillaumekln

v1.33.0

f22a8a7

Tokenizer 1.33.0

New features

[Python] Build ARM64 wheels for macOS

Fixes and improvements

[CLI] Fix error when the option --segment_alphabet is not set
Fix SentencePiece build warning when compiling with Clang

Assets 2

25 Jul 09:56

guillaumekln

v1.32.0

4807909

Tokenizer 1.32.0

New features

Add property pyonmttok.Vocab.counters to retrieve the number of occurrences of each token

Fixes and improvements

Update pybind11 to 2.10.0
Update cxxopts to 3.0.0

Assets 2

07 Mar 10:10

guillaumekln

v1.31.0

559b8e7

Tokenizer 1.31.0

New features

Add utilities to build and use vocabularies:
- pyonmttok.Vocab
- pyonmttok.build_vocab_from_tokens
- pyonmttok.build_vocab_from_lines
Define the method Tokenizer.__call__ to simplify the tokenizer usage when additional features are unused:

tokens = tokenizer(text)

Fixes and improvements

Update pybind11 to 2.9.1

Assets 2

25 Jan 15:59

guillaumekln

v1.30.1

4c66c81

Tokenizer 1.30.1

Fixes and improvements

Fix deprecated languages codes in ICU that are incorrectly considered as invalid (e.g. "tl" for Tagalog)

Assets 2

29 Nov 14:58

guillaumekln

v1.30.0

ebec281

Tokenizer 1.30.0

New features

[Python] Build wheels for AArch64 Linux

Fixes and improvements

[Python] Update ICU to 70.1

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixes and improvements

New features

Fixes and improvements

New features

New features

Fixes and improvements

Changes

New features

Fixes and improvements

New features

Fixes and improvements

New features

Fixes and improvements

New features

Fixes and improvements

Fixes and improvements

New features

Fixes and improvements

Releases: OpenNMT/Tokenizer

Tokenizer 1.37.1

Fixes and improvements

Tokenizer 1.37.0

New features

Fixes and improvements

Tokenizer 1.36.0

New features

Tokenizer 1.35.0

New features

Fixes and improvements

Tokenizer 1.34.0

Changes

New features

Fixes and improvements

Tokenizer 1.33.0

New features

Fixes and improvements

Tokenizer 1.32.0

New features

Fixes and improvements

Tokenizer 1.31.0

New features

Fixes and improvements

Tokenizer 1.30.1

Fixes and improvements

Tokenizer 1.30.0

New features

Fixes and improvements