Apache Tika 1.22
The most notable changes in Tika 1.22 over the previous release are:
- NOTE: Known regression: PDFBOX-4587 -- PDF passwords with codepoints between 0xF000 and 0XF0000 will cause an exception.
- Add parser for HWP v5 files via SooMyung Lee (soomyung) and JinSup Kim (ddoleye) (TIKA-2909).
- Fix order of closing streams to avoid "Failed to close temporary resource" exception in TesseractOCRParser (TIKA-2908).
- Improve AutoDetectReader performance by caching the encoding detector (TIKA-1568).
- Prevent RTFParser from outputting illegal tag combinations (TIKA-2889).
- Fix RereadableInputStream to release all resources (TIKA-2903).
- Implement custom language identifier in the tika-eval module based on OpenNLP's language detector; add 18 languages and add common wordslists for all 121 languages (TIKA-2790).
- Fix NPE in MimeTypesReader.releaseParser() via Eamonn Saunders (TIKA-2896).
- Fix RTFParser to extract more content (TIKA-2883).
- Add clientSubmitTime to the metadata extracted from PST files (TIKA-2898).
The following people have contributed to Tika 1.22 by submitting or commenting on the issues resolved in this release:
- Andrzej Bialecki
- Eamonn Saunders
- Kevin Ng
- Luis Filipe Nassif
- Marichi Gupta
- Mike Cantrell
- Pandurang
- Paul Woods
- Peter Fassev
- Richard Lehane
- Rohit Sureshrao Shelhalkar
- Sebb
- T Craig
- T. Schmidt
- Tim Allison
- ddoleye
- mungeol heo
- soomyung
See https://s.apache.org/zpngc for more details on these contributions.