Contribute to Apache Tika
Apache Tika is an Open Source project built and maintained by a diverse range of contributors. We welcome contributions of all types to the project - code, documentation, testing, bug triage, user support, and more! Send an email to the Tika development list if you're looking for somewhere to help.
Source Code
To download the source code for the latest release of Apache Tika, please see the Download page.
The master copy of the Apache Tika source code is held in GIT. You can fetch (clone) the source from GitHub, which you are welcome to fork from and open pull requests to.
You can also clone (checkout) the code from https://gitbox.apache.org/repos/asf/tika.git and you can browse it online through the Git web interface
Reporting Issues
Tika uses the ASF JIRA instance, for issue tracking, under the Tika Project.
When reporting an issue, please try to include the details, steps and documents required to reproduce it. If there are multiple documents that trigger the issue, a small file we can use in unit testing would be great. A JUnit unit test showing the problem can be helpful, but isn't required.
If you're new to reporting problems, you might find the How to Report Bugs Effectively essay (amongst many others) useful for learning more about what makes an effective and helpful bug report.
New Parsers, Detectors and Mime Types
The Parser Quick Start Guide provides instructions on adding new mime types and new parsers to Tika.
If your new Parser or Detector depends on libraries which we cannot include in Tika for license reasons, you are encouraged to list it on the 3rd Party Parser Plugins page on the Tika wiki.
Submitting Enhancements and Fixes
All enhancements and fixes should have a JIRA Issue or Enhancement opened for them. This should describe the problem and the proposed fix / new code. The JIRA can be used for discussions on the code, and provides a single identifier for the change.
Git - Git users can run git diff --no-prefix to generate a patch of changed and new files, including binaries, which can then be attached to an issue.
GitHub Pull Requests - If you are working from our GitHub mirror, it is possible to open a pull request for your change. Please include the JIRA Issue number in the pull request, so it can be linked by the ASF GitHub bot.
ReviewBoard - If you have a Work-In-Progress patch for which you would like feedback / review / assistance, you can use the Apache ReviewBoard Instance to post your code. Please reference the JIRA Issue number from the review request, and add a link to it to the JIRA Issue.
Unit tests, License Headers - Wherever possible, we like new functionality and fixes to include small-ish unit tests. Whenever you make changes, please re-run the unit test suite (mvn install is one way to trigger this), and ensure your changes don't break anything. If adding new files, please include the Apache License v2 license header at the top of the file.
Dependencies
Any new dependencies introduced must be under a suitable license. Broadly, they must be Open Source, and must not place restrictions on larger works they are incorporated within. A list of the allowed licenses is maintained by the ASF Legal Affairs Committee. If in doubt, check on the dev list.
All new and updated dependencies must be in Maven Central. (It is not possible for Apache releases to depend on additional repositories in their poms). If possible, the project producing the dependency should be asked to publish it to Central, such as through the Sonatype OSS Maven Repo. If that isn't possible, someone will need to upload it via the Sonatype 3rd Party OSS Artifacts process. This will need to be completed before any patches depending on the new library can be committed to Tika.
Code Formatting
Java code should be indented with 4 spaces, no tabs. Opening brackets should normally be on the same line as the statement. Java coding standards are normally followed, but if in doubt follow what the existing code does!
Imports should normally be explicit, wildcard (foo.*) imports should not normally be used. The imports should be ordered by javax, then java, then other.
From time to time, you may find that code you are working on doesn't follow these rules. If you find that, please don't submit a single patch with logic changes + formatting together, as those are very hard to review. Instead, please submit two patches, one to correct formatting problems, and a second for your logic changes / fixes.
Other Resources
- The Apache Community Development project (ComDev) provide general advice on getting started with contributing to Apache projects
- The Apache Nutch project provide a comprehensive guide on becoming a Nutch Developer, much of which applies equally for Apache Tika too
- The book Tika in Action has a lot of great information on how Tika works, and how to extend it