Apache Tika 1.19
The most notable changes in Tika 1.19 over the previous release are:
- Require Java 8 (TIKA-2679).
- Enable building with Java 11 (TIKA-2668)
- Add an option to make tika-server robust against infinite loops, OOMs, and memory leaks (TIKA-2725).
- Allow configuration of the Tesseract parser via the standard tika-config.xml options (TIKA-2705).
- Improve handling of empty cells across table-based formats (TIKA-2479).
- Add a Standards compliant HTML encoding detector via Gerard Bouchar (TIKA-2673).
- Improved XML parsing -- limited default entity expansions to 20. To raise this limit, add -Djdk.xml.entityExpansionLimit=XXX to your commandline.
- Mime magic improvements for Olympus RAW (TIKA-2658), interpreted server-side languages via HTTP (TIKA-2648), MHTML (TIKA-2723)
- Add absolute timeout to ForkParser rather than testing for active (TIKA-2656).
- Make the RecursiveParserWrapper work with the ForkParser (TIKA-2655).
- Allow the ForkParser to specify a directory containing tika-app.jar for use by the ForkServer. This allows users to keep most of the parser dependencies out of their code; and it allows for an easy addition of optional jars for Parser dependencies, such as the xerial sqlite jar (TIKA-2653).
- Use a pool for SAXParsers and DOMBuilders rather than creatinga new parser/builder for every parse. For better performance, set XMLReaderUtils.setPoolSize() to the number of threads you're using with Tika (TIKA-2645).
- Add the RecursiveParserWrapperHandler to improve the RecursiveParserWrapperAPI slightly (TIKA-2644).
- Upgraded to Commons-Compress 1.18 (TIKA-2707).
- Upgraded to Apache POI 4.0.0 (TIKA-2552).
- Upgraded to Apache PDFBox 2.0.11 (TIKA-2681).
- Upgraded to deeplearning4j 1.0.0-beta2 (TIKA-2672).
- Upgraded jmatio to 1.4 (TIKA-2667)
- Upgraded Apache Lucene to 7.4.0 in tika-eval and tika-examples (TIKA-2695).
- Upgraded junrar to 1.0.1 (TIKA-2664).
- Numerous other upgrades (TIKA-2692).
- Excluded Spring as a transitive dependency (TIKA-2721).
The following people have contributed to Tika 1.19 by submitting or commenting on the issues resolved in this release:
- Abhijit Rajwade
- Adam Rauch
- Andreas Meier
- Annie Didier
- Celpan Valeria
- Chris A. Mattmann
- Gerard Bouchar
- Hans Brende
- Karanjeet Singh
- Karl Wright
- Ken Krugler
- Konstantin Gribov
- Lewis John McGibbney
- Sebastian Nagel
- Slava G
- Thorsten Schäfer
- Tim Allison
- Vincent van Donselaar
- Yuriy Koval
- Yury Kats
See https://s.apache.org/dG8B for more details on these contributions.