Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to content

Deterministic gzip compressed outputs #28103

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
dhimmel opened this issue Aug 22, 2019 · 6 comments · Fixed by #35645
Closed

Deterministic gzip compressed outputs #28103

dhimmel opened this issue Aug 22, 2019 · 6 comments · Fixed by #35645
Labels
Enhancement IO Data IO issues that don't fit into a more specific label
Milestone

Comments

@dhimmel
Copy link
Contributor

dhimmel commented Aug 22, 2019

GZ-compression writes the filename and timestamp into compressed data's header. This means that compressing the same data at different times will produce outputs that are not byte-for-byte identical.

In the past, this has presented problems for OS distros. It now prtesents problems for data science. Specifically, I frequently track compressed files using Git LFS which detects whether a file has changed by its hash. Therefore, if I run a pipeline to create gzip-compressed dataframes exported from Pandas, the .gz outputs will differ every time.

Currently, user's can use this hack which globally sets gzip.time to a fake time to create deterministic gzip compression from pandas.DataFrame.to_csv. I propose either of the following approaches that would be much cleaner:

  1. changing Pandas' default behavior to set gzip's mtime to a constant but erroneous time. Whatever gzip --no-name sets would probably be best.

  2. Having some module or function level setting that users could activate for deterministic gzip compression.

Personally, I don't see much benefit to gzip's timestamp, and therefore prefer solution 1 to 2. It's pretty confusing to users to see gzip outputs change and have to figure out that it's the changing timestamp.

In either case, we should look into the other supported compression methods and check their determinism. We can check time-dependent output with:

import gzip, time
data = b'data to compress'
output_1 = gzip.compress(data)
time.sleep(2)
output_2 = gzip.compress(data)
output_1 == output_2

Here's the docs for gzip --no-name

-n --no-name
When compressing, do not save the original file name and time stamp by default. (The original name is always saved if the name had to be truncated.) When decompressing, do not restore the original file name if present (remove only the gzip suffix from the compressed file name) and do not restore the original time stamp if present (copy it from the compressed file). This option is the default when decompressing.

@WillAyd
Copy link
Member

WillAyd commented Aug 23, 2019

I don't think pandas should stray from the default of the gzip library. Can you not just construct the file you need to write to manually with the argument you want?

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Aug 23, 2019 via email

@TomAugspurger
Copy link
Contributor

Ah, it seems Python 3.8 is providing an option: https://docs.python.org/3.8/library/gzip.html#gzip.compress

@dhimmel
Copy link
Contributor Author

dhimmel commented Aug 23, 2019

Can you not just construct the file you need to write to manually with the argument you want?

Yes, but it's not pretty:

import gzip
import pandas
import io
df = pandas.DataFrame({'a': [1, 2], 'b': [3, 4]})
with gzip.GzipFile(filename='test-gzip-pandas.csv.gz', mtime=0, mode='wb') as write_file:
    with io.TextIOWrapper(write_file) as write_file:
        df.to_csv(write_file)

The objective here is to give users the consistency and convenience of the builtin compression inference and gzip support of pandas.to_csv.

it seems Python 3.8 is providing an option:

Python 3.8 added the mtime option to gzip.compress in python/cpython#9704. However, the ability to set mtime in gzip.GzipFile has existed since Python 3.1. However, gzip.open still does not have mtime, which is the primary way users interact with the API.

we could add keywords to pass the option through.

This is possible, including prior to Python 3.8. However, what argument would you suggest? Something like reproducible_gzip=True or gzip_no_name=True? or gzip_mtime=0? This is definitely an option, but would it be awkward to have arguments that only have an effect in a very specific case?

@TomAugspurger
Copy link
Contributor

However, what argument would you suggest?

Whatever Python calls it. I'm happy to just follow Python here. If gzip.GzipFile already takes mtime, and if we use GzipFile internally, then passing through mtime seems fine.

@jbrockmendel jbrockmendel added the IO Data IO issues that don't fit into a more specific label label Oct 27, 2019
@jreback jreback added this to the 1.2 milestone Aug 12, 2020
rousik added a commit to rousik/pudl that referenced this issue Dec 14, 2020
gzip files are adding mtimes in the headers which results in
non-deterministic checksums of the resulting files. This change
adds a workaround to ensure that mtime is not set and this should
allow us to generate compressed datapackages with deterministic
checksums.

Unfortunately, due to the existing known-bug, the workaround is somewhat
dirty and requires us to pass the iobuffer wrapped zip files that we
have to open by hand with the necessary parameters instead of relying
on the pandas/gzip integration.

See pandas-dev/pandas#28103 for more details.
rousik added a commit to rousik/pudl that referenced this issue Dec 14, 2020
gzip files are adding mtimes in the headers which results in
non-deterministic checksums of the resulting files. This change
adds a workaround to ensure that mtime is not set and this should
allow us to generate compressed datapackages with deterministic
checksums.

Unfortunately, due to the existing known-bug, the workaround is somewhat
dirty and requires us to pass the iobuffer wrapped zip files that we
have to open by hand with the necessary parameters instead of relying
on the pandas/gzip integration.

See pandas-dev/pandas#28103 for more details.
@dhimmel
Copy link
Contributor Author

dhimmel commented Aug 13, 2021

Noting the method enabled by #35645:

df.to_csv(path, compression={"method": 'gzip', "mtime": 0})

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement IO Data IO issues that don't fit into a more specific label
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants