Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: SimpleDirectoryReader metadata handles timezones inconsistently #17715

Closed
yozlet opened this issue Feb 5, 2025 · 3 comments · Fixed by #17724
Closed

[Bug]: SimpleDirectoryReader metadata handles timezones inconsistently #17715

yozlet opened this issue Feb 5, 2025 · 3 comments · Fixed by #17724
Labels
bug Something isn't working triage Issue needs to be triaged/prioritized

Comments

@yozlet
Copy link
Contributor

yozlet commented Feb 5, 2025

Bug Description

SimpleDirectoryReader's methods for returning a file's date & time metadata (e.g. last modified date) will occasionally output timestamps using UTC instead of the local timezone.

The cause is the date-formatting function _format_file_metadata which uses datetime.fromtimestamp() when it needs to output just the date, but datetime.utcfromtimestamp() for outputting the whole datetime.

I only noticed the problem when this test failed on my machine. The test was merged over a month ago so it's interesting that it didn't fail for anyone else; I assume it's because most contributors are developing inside Docker containers or other environments which default to UTC.

Version

latest

Steps to Reproduce

My computer and I are on Pacific Time (UTC-08:00). I have a document that I modified at 18:39 local time, which was after midnight UTC. Let's read its metadata with SimpleDirectoryReader:

from datetime import datetime
from llama_index.core import SimpleDirectoryReader

reader = SimpleDirectoryReader(input_dir="data")

# most data loading methods use the default metadata loader, which calls
# _format_file_metadata(ts, include_time: False)
lmd_1 = reader.load_data()[0].metadata['last_modified_date']
print(f"Document last modified date (document metadata): {lmd_1}")

# ... but reader.get_resource_info() calls
# _format_file_metadata(ts, include_time: True)
lmd_2 = reader.get_resource_info('data/my_document.txt')['last_modified_date']
print(f"Document last modified date (get_resource_info): {lmd_2[0:10]}")
print(f"Document last modified datetime (get_resource_info): {lmd_2}")

When running the code against the document, note the difference in dates:

% python test_tz.py
Document last modified date (document metadata): 2025-02-04
Document last modified date (get_resource_info): 2025-02-05
Document last modified datetime (get_resource_info): 2025-02-05T02:39:35Z

Relevant Logs/Tracbacks

@yozlet yozlet added bug Something isn't working triage Issue needs to be triaged/prioritized labels Feb 5, 2025
Copy link

dosubot bot commented Feb 5, 2025

Hey @yozlet! I'm here to help you with any bugs, questions, or contributions you have. Let's tackle this issue together!

To address the inconsistency in timezone handling in the SimpleDirectoryReader, you can modify the _format_file_timestamp function to ensure that all timestamps are consistently formatted using UTC. Here's a suggested modification:

from datetime import datetime, timezone

def _format_file_timestamp(
    timestamp: float | None, include_time: bool = False
) -> str | None:
    """
    Format file timestamp to a %Y-%m-%d string with timezone awareness.

    Args:
        timestamp (float): timestamp in float
        include_time (bool): whether to include time in the formatted string

    Returns:
        str: formatted timestamp
        None: if the timestamp passed was None
    """
    if timestamp is None:
        return None

    dt = datetime.fromtimestamp(timestamp, tz=timezone.utc)
    if include_time:
        return dt.strftime("%Y-%m-%dT%H:%M:%SZ")
    return dt.strftime("%Y-%m-%d")

This change ensures that both date-only and full datetime outputs are consistently in UTC, which should resolve the inconsistency you're experiencing [1].

To continue talking to Dosu, mention @dosu.


Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Other

@logan-markewich
Copy link
Collaborator

Feel free to open a pr to fix, seems like an easy change

@yozlet
Copy link
Contributor Author

yozlet commented Feb 5, 2025

@logan-markewich Are you happy with defaulting to UTC? Could cause confusion amongst those who aren't expecting it, but it avoids accidental time-travel around daylight saving switchovers. I'll put something in the docs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working triage Issue needs to be triaged/prioritized
Projects
None yet
2 participants