Deterministic gzip compressed outputs #28103

dhimmel · 2019-08-22T21:04:28Z

GZ-compression writes the filename and timestamp into compressed data's header. This means that compressing the same data at different times will produce outputs that are not byte-for-byte identical.

In the past, this has presented problems for OS distros. It now prtesents problems for data science. Specifically, I frequently track compressed files using Git LFS which detects whether a file has changed by its hash. Therefore, if I run a pipeline to create gzip-compressed dataframes exported from Pandas, the .gz outputs will differ every time.

Currently, user's can use this hack which globally sets gzip.time to a fake time to create deterministic gzip compression from pandas.DataFrame.to_csv. I propose either of the following approaches that would be much cleaner:

changing Pandas' default behavior to set gzip's mtime to a constant but erroneous time. Whatever gzip --no-name sets would probably be best.
Having some module or function level setting that users could activate for deterministic gzip compression.

Personally, I don't see much benefit to gzip's timestamp, and therefore prefer solution 1 to 2. It's pretty confusing to users to see gzip outputs change and have to figure out that it's the changing timestamp.

In either case, we should look into the other supported compression methods and check their determinism. We can check time-dependent output with:

import gzip, time
data = b'data to compress'
output_1 = gzip.compress(data)
time.sleep(2)
output_2 = gzip.compress(data)
output_1 == output_2

Here's the docs for gzip --no-name

-n --no-name
When compressing, do not save the original file name and time stamp by default. (The original name is always saved if the name had to be truncated.) When decompressing, do not restore the original file name if present (remove only the gzip suffix from the compressed file name) and do not restore the original time stamp if present (copy it from the compressed file). This option is the default when decompressing.

The text was updated successfully, but these errors were encountered:

WillAyd · 2019-08-23T14:15:55Z

I don't think pandas should stray from the default of the gzip library. Can you not just construct the file you need to write to manually with the argument you want?

TomAugspurger · 2019-08-23T14:31:57Z

If this were added to the standard library then we could add keywords to pass the option through.

…

On Thu, Aug 22, 2019 at 4:04 PM Daniel Himmelstein ***@***.***> wrote: GZ-compression writes the filename and timestamp into compressed data's header. This means that compressing the same data at different times will produce outputs that are not byte-for-byte identical. In the past, this has presented problems for OS distros <https://wiki.debian.org/ReproducibleBuilds/TimestampsInGzipHeaders>. It now prtesents problems for data science. Specifically, I frequently track compressed files using Git LFS which detects whether a file has changed by its hash. Therefore, if I run a pipeline to create gzip-compressed dataframes exported from Pandas, the .gz outputs will differ every time. Currently, user's can use this hack <https://stackoverflow.com/a/264303/4651668> which globally sets gzip.time to a fake time to create deterministic gzip compression from pandas.DataFrame.to_csv. I propose either of the following approaches that would be much cleaner: 1. changing Pandas' default behavior to set gzip's mtime <https://docs.python.org/3/library/gzip.html#gzip.GzipFile.mtime> to a constant but erroneous time. Whatever gzip --no-name sets would probably be best. 2. Having some module or function level setting that users could activate for deterministic gzip compression. Personally, I don't see much benefit to gzip's timestamp, and therefore prefer solution 1 to 2. It's pretty confusing to users to see gzip outputs change and have to figure out that it's the changing timestamp. In either case, we should look into the other supported compression methods and check their determinism. We can check time-dependent output with: import gzip, time data = b'data to compress' output_1 = gzip.compress(data) time.sleep(2) output_2 = gzip.compress(data) output_1 == output_2 Here's <https://www.systutorials.com/docs/linux/man/1-gzip/> the docs for gzip --no-name -n --no-name When compressing, do not save the original file name and time stamp by default. (The original name is always saved if the name had to be truncated.) When decompressing, do not restore the original file name if present (remove only the gzip suffix from the compressed file name) and do not restore the original time stamp if present (copy it from the compressed file). This option is the default when decompressing. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#28103?email_source=notifications&email_token=AAKAOIU7JA53EWBDP3FAGR3QF35OJA5CNFSM4IOZIH6KYY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4HG4RPCA>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAKAOIXKVVGUNNKDQ54GGOLQF35OJANCNFSM4IOZIH6A> .

TomAugspurger · 2019-08-23T14:52:12Z

Ah, it seems Python 3.8 is providing an option: https://docs.python.org/3.8/library/gzip.html#gzip.compress

dhimmel · 2019-08-23T15:16:58Z

Can you not just construct the file you need to write to manually with the argument you want?

Yes, but it's not pretty:

import gzip
import pandas
import io
df = pandas.DataFrame({'a': [1, 2], 'b': [3, 4]})
with gzip.GzipFile(filename='test-gzip-pandas.csv.gz', mtime=0, mode='wb') as write_file:
    with io.TextIOWrapper(write_file) as write_file:
        df.to_csv(write_file)

The objective here is to give users the consistency and convenience of the builtin compression inference and gzip support of pandas.to_csv.

it seems Python 3.8 is providing an option:

Python 3.8 added the mtime option to gzip.compress in python/cpython#9704. However, the ability to set mtime in gzip.GzipFile has existed since Python 3.1. However, gzip.open still does not have mtime, which is the primary way users interact with the API.

we could add keywords to pass the option through.

This is possible, including prior to Python 3.8. However, what argument would you suggest? Something like reproducible_gzip=True or gzip_no_name=True? or gzip_mtime=0? This is definitely an option, but would it be awkward to have arguments that only have an effect in a very specific case?

TomAugspurger · 2019-08-23T18:21:12Z

However, what argument would you suggest?

Whatever Python calls it. I'm happy to just follow Python here. If gzip.GzipFile already takes mtime, and if we use GzipFile internally, then passing through mtime seems fine.

gzip files are adding mtimes in the headers which results in non-deterministic checksums of the resulting files. This change adds a workaround to ensure that mtime is not set and this should allow us to generate compressed datapackages with deterministic checksums. Unfortunately, due to the existing known-bug, the workaround is somewhat dirty and requires us to pass the iobuffer wrapped zip files that we have to open by hand with the necessary parameters instead of relying on the pandas/gzip integration. See pandas-dev/pandas#28103 for more details.

dhimmel · 2021-08-13T20:46:31Z

Noting the method enabled by #35645:

df.to_csv(path, compression={"method": 'gzip', "mtime": 0})

jbrockmendel added the IO Data IO issues that don't fit into a more specific label label Oct 27, 2019

mroeschke added the Enhancement label May 2, 2020

shntnu mentioned this issue May 31, 2020

Feature selected files available, with more row annotations included broadinstitute/lincs-cell-painting#48

Merged

twoertwein mentioned this issue Aug 9, 2020

BUG/ENH: consistent gzip compression arguments #35645

Merged

5 tasks

jreback added this to the 1.2 milestone Aug 12, 2020

jreback closed this as completed in #35645 Aug 13, 2020

rousik mentioned this issue Dec 14, 2020

Ensure deterministic checksums on csv.gz outputs catalyst-cooperative/pudl#856

Merged

gwaybio mentioned this issue May 27, 2021

Update pycytominer version to ignore timestamp diffs in gz files broadinstitute/pooled-cell-painting-profiling-template#23

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Deterministic gzip compressed outputs #28103

Deterministic gzip compressed outputs #28103

dhimmel commented Aug 22, 2019

WillAyd commented Aug 23, 2019

Uh oh!

TomAugspurger commented Aug 23, 2019 via email

Uh oh!

TomAugspurger commented Aug 23, 2019

Uh oh!

dhimmel commented Aug 23, 2019

Uh oh!

TomAugspurger commented Aug 23, 2019

Uh oh!

dhimmel commented Aug 13, 2021

Uh oh!

Uh oh!

Deterministic gzip compressed outputs #28103

Deterministic gzip compressed outputs #28103

Comments

dhimmel commented Aug 22, 2019

WillAyd commented Aug 23, 2019

Uh oh!

TomAugspurger commented Aug 23, 2019 via email

Uh oh!

TomAugspurger commented Aug 23, 2019

Uh oh!

dhimmel commented Aug 23, 2019

Uh oh!

TomAugspurger commented Aug 23, 2019

Uh oh!

dhimmel commented Aug 13, 2021

Uh oh!