-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Destination S3: date type written as dictionary for parquet format #14028
Comments
I was able to reproduce issue with integration account. Other fields are unnest correctly. |
Comment made from Zendesk by Marcos Marx on 2022-06-22 at 20:39:
|
I'm wondering if there has been any progress on this issue yet? We are currently using the S3 destination to populate our data lake environment with Parquet files and this is preventing us from being able to execute queries on any tables that contain a datetime value. |
This issue is very strange in that:
The current There is no progress in this issue so far. Unfortunately we currently don't have enough bandwidth for S3. So I also cannot give an estimation when this can be fixed. |
For my use case we are actually using Postgres as the source, so the format conversion seems to be specific to the S3 destination and not anything to do with the source format, unless there is a common issue with the intermediate representation and allocating the proper type information in the schema being passed to the S3 destination. If anyone has a suggestion of where in the S3 plugin to look I am happy to try my hand at debugging and fixing the issue. |
I've been doing some further debugging to narrow down the potential source of the issue. So far I have the following information:
Looking at the code in the S3 connector it seems that Airbyte is relying on the upstream Apache Parquet Java library for translating from Avro to Parquet. This suggests that there is either a bug in the upstream implementation, or a bug in how it is being used in Airbyte. It is entirely possible that there is another source of the problem that I am overlooking, but this is what I have discovered so far. |
Looking again at the dependencies, the S3 destination is using the |
I believe that this change in the upstream Parquet library will resolve this bug, so updating to 1.12.3 will be the necessary fix |
@marcosmarxm @tuliren I found what appears to be the necessary fix to this bug and pushed a PR that bumps the upstream Parquet dependency to the latest version. Can you take a look and see about getting a new release pushed so I can test it? Thanks! |
Thanks @blarghmatey ! |
Taking a closer look at the Avro file that I generated while testing I noticed that the actual schema for the timestamp with TZ fields is an Avro union type, which is not supported in Parquet, so it seems that what is happening is that the conversion is treating the Avro Union as a Parquet Struct, resulting in the behavior that we're seeing. Avro schema that I'm working:
Generated Parquet schema:
|
The block of code in the upstream Parquet library that generates the struct from a Union type is https://github.com/apache/parquet-mr/blob/e990eb3f14c39273e46a9fce07ec85d2edf7fccb/parquet-avro/src/main/java/org/apache/parquet/avro/AvroSchemaConverter.java#L264-L272 It seems that the solution is to modify the Airbyte code that generates the Avro schema from the JSON schema to not include |
Any updates? Currently trying to sync data from MySQL to S3 Parquet, and Timestamp fields are being converted to struct type. |
Hi there, will there be an ongoing fix here or it is intentionally written as a struct? |
Any updates on this? I'm trying to copy a Postgres table into S3 but some of the datetime columns contain the aforementioned struct. |
Observed the same during our MySQL to Parquet S3 testing {'member0': datetime.datetime(2016, 9, 25, 19, 57, 31, tzinfo=), 'member1': None} |
@robertomczak |
Not directly in Airbyte, current idea is to process data in further ETL using Pandas and extract column from nested object to top level object of pandas data frame. It would be nice to fix it, as keeping empty string artifact on intermidiate data lake is not good practice. 🤷♂️ |
Aren't the destination files getting bigger? For example in my case a table of 20MB was transformed to two files of 97 and 20MB each. I bet the struct has something to do with it. |
We have also experienced this using a snowflake connector on any field of type DATE. Reading to a parquet file in S3. Our data is not large (100k rows) chunked into a single file. So don't think it has to do with data scale. |
Hello, I'm experiencing the same issue with postgresql timestamptz data being written into s3 parquet files. Any update on tthe issue since 2022...? |
Incase it helps anyone, I was processing some of the parquet data in Pyspark through AWS Glue and added this transformation to solve struct.member0 -> DateTime field. def extract_timestamp_from_struct(df: DataFrame) -> DataFrame:
'''
This function converts any "structs" that are really datetimes, into datetimes.
This is documented here: https://github.com/airbytehq/airbyte/issues/14028
Parameters:
- df: The input DataFrame.
Returns:
- DataFrame with datetime columns as datetimes
'''
for field in df.schema:
if isinstance(field.dataType, StructType):
if any(child.name == "member0" and isinstance(child.dataType, TimestampType) for child in field.dataType.fields):
df = df.withColumn(field.name, col(f"{field.name}.member0"))
return df |
Thanks @jamsi for sharing, I ended up doing exactly the same in a glue job, but it would be better if it wouldn't be required to add another service in the mix just to fix date formats. |
any new about this? im facing the same behaviour |
I'm also seeing this issue with Snowflake > GCS in Parquet format |
MySQL to S3 connector: |
This is an insanely frustrating bug |
I cling to the tinest ember of hope, and rebuid destination-s3 connector with implementation ('org.apache.parquet:parquet-avro:1.13.1') { exclude group: 'org.slf4j', module: 'slf4j-log4j12'} and
it's still the same ... |
I see the above in both parquet and avro files. Is it on the roadmap to address this? |
Zendesk ticket #5666 has been linked to this issue. |
Hey! Any hope to get this bug fixed soon? :) |
This Github issue is synchronized with Zendesk:
Ticket ID: #1152
Priority: normal
Group: User Success Engineer
Assignee: Marcos Marx
Original ticket description:
The text was updated successfully, but these errors were encountered: