Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increase OpenSearch mapping limit dynamically during indexing of csv/jsonl data #3257

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

jkppr
Copy link
Collaborator

@jkppr jkppr commented Jan 8, 2025

This change dynamically increases the OpenSearch mapping limit during the indexing process to ensure successful data ingestion.

Problem:

Timesketch, when indexing timelines with a large number of unique fields, can encounter OpenSearch's default mapping limit (typically 1000 fields). This results in indexing failures and data loss.

Solution:

This PR introduces a mechanism to:

  1. Track unique fields: Monitor the number of unique fields encountered during the indexing of each timeline.
  2. Calculate new limit: Dynamically calculate a new mapping limit based on the number of unique fields, adding a configurable buffer percentage (default 20%) to account for future growth.
  3. Update OpenSearch settings: Update the index.mapping.total_fields.limit setting in OpenSearch to the newly calculated limit if it exceeds the current limit.
  4. Enforce upper limit: Implement an upper mapping limit (configurable, default 2000) to prevent uncontrolled growth and potential performance issues. If the calculated limit exceeds this upper limit, the indexing process will fail with a clear error message.
  5. Improve error reporting: Enhance error messages to provide more specific information about mapping limit issues, including the current number of mapped fields, the calculated limit, and the upper limit.

Configuration:

Two new configuration options are added to timesketch.conf:

  • OPENSEARCH_MAPPING_BUFFER: A float representing the percentage buffer to add to the calculated mapping limit (default: 0.2 = 20%).
  • OPENSEARCH_MAPPING_UPPER_LIMIT: An integer representing the maximum allowed mapping limit (default: 2000).

Benefits:

  • Reduced indexing failures: Prevents indexing failures due to exceeding the default mapping limit.
  • Improved data ingestion: Allows Timesketch to handle timelines with a larger number of unique fields.
  • Controlled growth: Prevents uncontrolled mapping growth by enforcing an upper limit.
  • Better error handling: Provides more informative error messages to help users troubleshoot mapping-related issues.

Note:

Increasing the mapping limit can impact OpenSearch cluster performance and storage requirements. Users should carefully consider the OPENSEARCH_MAPPING_UPPER_LIMIT setting and monitor their cluster's resource usage.

Alternatives considered

  • Hard fail when the default limit of 100 field mappings is hit and require the user to reduce the data import.
    • This would have the least resource impact on the OpenSearch cluster but the most impact on the Analyst and data pipeline.
  • Increase the mapping limit for the whole OpenSearch cluster as default for each new index.
    • Testing has shown that this will also increase the resource consumption of the cluster even for indices that don't make use of the additional mapping limit.

@jkppr jkppr added the Data import All things that are with importing data label Jan 8, 2025
@jkppr jkppr self-assigned this Jan 8, 2025
@jkppr jkppr requested a review from jaegeral January 8, 2025 13:28
Copy link
Collaborator

@jaegeral jaegeral left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add either / or:

  • unit tests
  • e2e tests

otherwise looks good imho, I have fixed a small typo already

@berggren berggren self-requested a review January 8, 2025 15:47
Copy link
Contributor

@berggren berggren left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - let's make sure we monitor this for unforeseen cluster consequences.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Data import All things that are with importing data
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants