ABSTRACT
We compare a variety of lossless image compression methods on a large sample of astronomical images and show how the compression ratios and speeds of the algorithms are affected by the amount of noise (that is, entropy) in the images. In the ideal case where the image pixel values have a random Gaussian distribution, the equivalent number of uncompressible noise bits per pixel is given by and the lossless compression ratio is given by where is the bit length of the pixel values (typically 16 or 32), and K is a measure of the efficiency of the compression algorithm. We show that real astronomical CCD images also closely follow these same relations, by using a robust algorithm for measuring the equivalent number of noise bits from the dispersion of the pixel values in background regions of the image. We perform image compression tests on a large sample of 16-bit integer astronomical CCD images using the GZIP compression program and using a newer FITS tiled-image compression method that currently supports four compression algorithms: Rice, Hcompress, PLIO, and the same Lempel-Ziv algorithm that is used by GZIP. Overall, the Rice compression algorithm strikes the best balance of compression and computational efficiency; it is 2–3 times faster and produces about 1.4 times greater compression than GZIP (the uncompression speeds are about the same). The Rice algorithm has a measured K value of 1.2 bits pixel-1, and thus produces 75%–90% (depending on the amount of noise in the image) as much compression as an ideal algorithm with K = 0. Hcompress produces slightly better compression but at the expense of three times more CPU time than Rice. Compression tests on a sample of 32-bit integer images show similar results, but the relative speed and compression ratio advantage of Rice over GZIP is even greater. We also briefly discuss a technique for compressing floating point images that converts the pixel values to scaled integers. The image compression and uncompression utility programs used in this study (called fpack and funpack) are publicly available from the HEASARC web site. A simple command-line interface may be used to compress or uncompress any FITS image file.
Export citation and abstract BibTeX RIS
1. INTRODUCTION
The size of astronomical data archives continues to increase enormously, so it is in the interests of both data providers and users to make use of the most effective image compression techniques. Compression reduces the storage media costs and the network bandwidth needed to transmit the files to users. Image compression also reduces the number of bytes of data that are transferred to or from local disks during data analysis operations.
The extensive literature on astronomical image compression can be divided into two main categories: lossy compression techniques, in which some of the hard to compress information (ideally only noise) is discarded; and lossless compression techniques, where all the information is preserved so that the original data can be exactly reconstructed from the compressed data. Lossy compression techniques (e.g., Louys et al. 1999 and references therein) can provide higher compression than lossless techniques; however, users must be careful to ensure that the required amount of photometric and astrometric precision in the compressed image is preserved. Lossless compression (see Grunler et al. 2006, for a comprehensive summary of previous work) by definition preserves all the information in the images and is often preferred or required in situations where the data provider must be certain that no information is lost, even if that means having to deal with larger volumes of data.
In this article we compare several lossless compression techniques on a large sample of astronomical images. We describe the different compression methods used in this study in § 2. Then in § 3 we review how the noise content of an image sets an upper limit on the lossless compression ratio. We use two sets of synthetic images with known noise properties to compare how well the compression ratios agree with the theoretical expectations. This is followed in §§ 4 and 5 with a detailed comparison of how well the different algorithms perform on actual 16-bit and 32-bit integer astronomical images taken mainly with CCD detectors. Section 6 briefly discusses compression methods for floating point images, and § 7 discusses how the choice of tiling pattern affects the compression performance. Section 8 then summarizes the main results of this study. Finally, the appendix gives a derivation of the formula for the equivalent number of noise bits in an image and shows how this relates to the seminal work by Shannon (1948) on entropy in communication theory.
2. COMPRESSION METHODS
In this study we use a relatively new compressed image format that is based on the FITS tiled-image compression convention (Pence et al. 2000; Seaman et al. 2007). Under this convention, the image is first divided into a rectangular grid of "tiles." Usually the image is tiled on a row by row basis, but any other rectangular tile size may be specified. Each tile of pixels is then compressed using one of several available compression algorithms, and the compressed stream of bytes is stored in a variable length array column in a FITS binary table. Each row of the FITS binary table corresponds to one tile in the image. Our software uses the CFITSIO library (Pence 1999) to transparently read and write these compressed files as if they were ordinary FITS images, even though they are physically stored in a table format. One of the advantages of using this tiled image convention, compared to the other technique currently used by most observatories and data archive centers of externally compressing the entire FITS image with the GZIP utility, is that the compressed FITS image is itself a valid FITS file and the image header keywords remain uncompressed, which provides faster read and write access. Another advantage is that each image in a multiextension FITS file is compressed separately and can be read without having to uncompress the entire FITS file. Similarly,in reading a small section of the image, only the corresponding tiles need to be uncompressed.
The current implementation of this convention in the CFITSIO library supports four lossless compression algorithms: Rice, Hcompress, PLIO, and GZIP. In principle, any other compression algorithm, e.g., JPEG 2000 or bzip2, could be added in the future. We note, however, that the JPEG 2000 lossless algorithm, JPEG-LS, uses the Golomb-Rice coding which is similar to the Rice algorithm, and the bzip2 algorithm typically only provides a few percent more compression than GZIP, but requires much more CPU processing time, especially when uncompressing the image (Yang & Koziol 2002). The main features of each of the algorithms used in this study are described here.
Rice.—The Rice algorithm (Rice et al. 1993; White & Becker 1998) is very simple (additions, subtractions, and some bit masking and shifts), making it computationally efficient. In fact, it has been implemented in hardware for use on spacecraft and in embedded systems, and has been considered for use in compressing images from future space telescopes (Nieto-Santisteban et al. 1999). In its usual implementation, it encodes the differences of consecutive pixels using a variable number of bits. Pixel differences near zero are coded with few bits and large differences require more bits. The algorithm adapts to the noise by determining the number of pure noise bits to strip off the bottom of the difference and include directly in the output bitstream (with no coding). The best value for this noise scale is computed independently for each block of 16 or 32 pixels. With such short blocks, the algorithm requires little memory and adapts quickly to any variations in pixel statistics across the image.
Hcompress.—The Hcompress algorithm was written to compress the Space Telescope Science Institute digitized sky survey images (White et al. 1992). It involves (1) a wavelet transform called the H-transform (a Haar transform generalized to two dimensions), followed by (2) an optional quantization that discards noise in the image while retaining the signal on all scales, followed by (3) a quadtree coding of the quantized coefficient bitplanes. In this study we omitted the quantization step, which makes Hcompress lossless. The H-transform computes sums and differences within pixel blocks, starting with small 2 × 2 blocks and then increasing by factors of two to 4 × 4, 8 × 8, etc., blocks. This is an exactly reversible, integer arithmetic operation, so a losslessly encoded set of the H-transform coefficients can be uncompressed and inversely transformed to recover the original image. The H-transform can be performed in-place in memory and requires enough memory to hold the original image (or image tile). To avoid overflow problems when summing the pixel values, the memory array is expanded by a factor of two so that each pixel has twice as many bits as in the original image. The Hcompress bitplane coding, which proceeds by first compressing the most significant bit of each coefficient (mostly zeros) and working down to the least significant bit (usually noise), has the effect of ordering the image description so that the data stream gives a progressively better approximation to the original image as more bits are received. This was used to create an efficient adaptive scheme for image transmission (Percival & White 1996).
PLIO.—The IRAF (Tody 1993) Pixel List I/O (PLIO) algorithm was developed to store integer image masks in a compressed form. This special-purpose run-length encoding algorithm is very effective on typical masks consisting of isolated high or low values embedded in extended regions that have a constant pixel value. Our implementation of this algorithm only supports pixel values in the range 0 to 223. Because of the specialized nature of the PLIO algorithm, we only discuss its use with compressing data masks, in § 4.3.
GZIP.—The popular GZIP file compression utility (Gailly & Adler 1992) is the de facto standard compression method currently used in the astronomical community. Nearly all major observatories and data archive centers distribute their data as GZIP-compressed files. For this reason, GZIP serves as the baseline of comparison for the other compression methods in our study. GZIP uses a variation of the Lempel-Ziv algorithm (Ziv & Lempel 1977) to build a dictionary of repeated sequences of bytes occurring in the input and using a short code for each sequence. The most important distinguishing characteristic of GZIP compared to the other compression algorithms used in this study is that GZIP treats each 8-bit byte of the input data stream as an independent datum, whereas the other compression methods operate on the numerical value of the input image pixels as multibyte quantities. This puts GZIP at a distinct disadvantage when compressing astronomical images with 16-bit or 32-bit pixel values because, unlike the Rice and Hcompress algorithms, GZIP cannot use the numerical difference between adjacent pixels as a means of improving the compression. As a result, it becomes less effective when increasing noise makes repeated bit patterns less common.
It should be noted that the GZIP algorithm has a user-selectable parameter for fine tuning the trade off between speed and compression ratio, where a value of one gives the fastest compression at the expense of file size and nine gives the highest compression at the expense of speed. Using the fastest value of one instead of the default value of six can increase the compression speed by a factor of two or more while only increasing the compressed file size by a few percent, therefore we have used this in all the speed comparison tests in this study. One small side effect, however, is that it increases the subsequent image uncompression time by about 10%.
Within this study, the GZIP algorithm is used in two different processing contexts which have significantly different speeds. In the first context, the GZIP program on the host computer is used to externally compress the FITS image, and in the other context the GZIP algorithm is used within the FITS tiled image convention to compress each image tile. The numerical algorithm is identical in both cases, however the host-GZIP program only takes about half as much CPU time as the tiled-GZIP method to compress the same image. This difference is mainly due to the fact that the host-GZIP program can more efficiently read and write the input and output files as sequential streams of bytes, whereas the tiled image compression method requires random access to the FITS files, which in turn requires that the input and output data be copied to intermediate storage buffers in memory. As will be demonstrated later, in spite of this extra processing overhead the tiled image Rice algorithm can still compress images several times faster than the host GZIP-program.
3. THE EFFECT OF NOISE ON LOSSLESS IMAGE COMPRESSION
The fundamental principle that limits the amount of lossless image compression is the well known fact that noise (that is, entropy) is inherently incompressible. In this section we quantitatively demonstrate how the amount of noise in an image can be measured and used to predict the lossless image compression ratio.
In order to study the affect of noise on image compression in a large sample of images, we need an algorithm that can reliably estimate the amount of noise in any given image without manual interaction or iterations. Such an algorithm was recently developed by Stoehr et al. (2007) for measuring the signal-to-noise ratio (S/N) in spectroscopic data. It is based on the median absolute deviation (MAD) method of quantifying the variations in a sample of values, which is less sensitive to the presence of outlying values than other simple statistical measures such as the standard deviation. To help mitigate against the possible presence of correlations in the noise between adjacent pixels in an image, Stoehr et al. (2007) developed a series of higher-order MAD equations that use the differences between the values of every other pixel in each row of the image. In our study we have adopted their third-order MAD equation (as they did as well) as a good compromise between simplicity and accuracy. This MAD value for each row of the image is given by
where i is the vector index of the pixel within each row of the image, and x is the value of the indicated pixel. The median value is computed over all the pixels in the row. (A C-language implementation of this algorithm is available in our image compression software which can be downloaded as described at the end of this article). The use of the median in this formula is very effective at eliminating any outlying large deviations, for instance, close to the images of star and galaxies in typical astronomical images. Also, this formula is insensitive to large-scale gradients in the mean background across the image. As a result, this formula provides a good measure of the pixel variations relative to the local background regions of the image, in between any localized brighter objects in the image.
It is easier to understand how noise affects image compression by considering a hypothetical image that has bits per pixel (where is usually 8, 16, or 32 in astronomical images) and where the lowest N bits of each pixel contain only noise (i.e., where each bit is randomly assigned a value of 0 or 1), and the higher order bits are all set to 0. A histogram of the integer pixel values in such an image will have a flat distribution ranging from 0 to 2N - 1. Since the N noise bits in each pixel are incompressible, the maximum possible compression ratio that can be achieved, if all the remaining bits are infinitely compressed to 0, is given simply by
where Nbits is the (average) number of noise bits per pixel in the image. Nbits is equivalent to the Shannon (1948) entropy expressed in bits (see Appendix
Unlike in this hypothetical example, the noise in astronomical images is usually not neatly confined to the lowest N bits of each pixel. The actual noise distribution in the types of astronomical CCD images studied in this article is a complex sum of the Poisson photon noise plus multiple noise sources in the detector, such as from the charge transfer and read-out electronics. In practice, the sum of all these noise contributions near to the background level in a CCD image can be closely approximated by a Gaussian. (Fitting a Gaussian to the histogram of the pixel values is a common image-processing technique for measuring the mean background level.) If one assumes that the noise in the image pixels has a Gaussian distribution, with standard deviation σ, then as we show in Appendix
and the expected compression ratio for this image is then given by
Thus, in summary, we can use the MAD equation 1 to estimate the noise in the background regions of an image, and then use equation 5 to predict how much that image can be losslessly compressed. For example, a 16-bit integer image with MAD σ = 18 contains about 6 bits per pixel of uncompressible noise and will compress by a factor of 2.3 when using a compression algorithm that has K = 1.
3.1. Compression of Synthetic Images with Known Noise Properties
Before measuring the performance of the different compression algorithms on real astronomical images, it is instructive to first study the behavior of the algorithms on two sets of synthetic 16-bit and 32-bit integer images that by construction have noise properties that match the two cases discussed in § 3. In the first set of synthetic images, the lowest N bits of each integer pixel were randomly assigned a value of 0 or 1 and the upper bits are all set to 0. In the second set of synthetic images, the pixels were assigned values randomly selected from a Gaussian distribution, with σ ranging from 1.0 to 500. (We also added a constant offset to the pixels in these images to avoid negative values; however, we verified, as expected, that the magnitude of this offset has no effect on the performance of the compression algorithms.) The equivalent number of noise bits per pixel in this second set of images is given by equation 4.
We then measured how the compression ratio of these synthetic images varies as a function of the number of noise bits (or equivalent noise bits) in the image when using the three general-purpose tiled-image compression algorithms, Rice, GZIP, and Hcompress. Instead of directly plotting the compression ratio, R, it is more informative to plot the reciprocal quantity as a function of the number of noise bits per pixel, because in this coordinate frame the lines of constant K value have a slope = 1.0 and a Y-intercept = K. The results for the 16-bit integer synthetic images are shown in Figure 1 where the solid lines are derived from the images with uniformly distributed N bits of noise, and the circular or triangular points are derived from the other set of images with a Gaussian noise distribution. The lowest line and set of points is derived using Hcompress, the middle ones using Rice, and the upper ones using GZIP. The diagonal dotted lines represent the different constant K values, and the horizontal dashed lines show the image compression ratio that corresponds to the compressed bits per pixel scale on the Y axis.
In the case of the Hcompress and Rice algorithms, it can be seen that the points lie almost exactly on top of the corresponding line. This provides strong empirical confirmation of the horizontal offset (derived in Appendix
It is apparent that GZIP behaves very differently from Rice or Hcompress in Figure 1. GZIP cannot be parameterized with a single value of K, and instead ranges from about 2 to 5 bits pixel-1, depending on the amount and distribution of the noise in the image. Unlike the other two algorithms, GZIP does not compress the two types of synthetic noise images equally well; it compresses the images in which the noise is confined to the lowest N bits better than the images with a Gaussian noise distribution. It is interesting that K appears to reach a maximum at Nbits = 8, which is where the noise propagates into the more significant byte of the 2-byte pixel values. We attribute most of these differences between GZIP and the other two algorithms to the fact that Hcompress and Rice treat each 16-bit pixel value as a single integer number, whereas GZIP treats each 8-bit byte as an independent datum.
The equivalent plot for the synthetic 32-bit integer images is shown in Figure 2. The relations for Rice and Hcompress are virtually identical to those for the 16-bit integers, and in particular, the K values are the same. As was the case with 16-bit images, GZIP behaves quite differently and has a variable K value that approaches 8 bits pixel-1 for the noisiest images (i.e., has a full byte per pixel of overhead compared to the ideal K = 0 algorithm).
4. COMPRESSION OF 16-BIT ASTRONOMICAL IMAGES
In this section we examine how well the different compression methods perform on real 16-bit integer astronomical images. The primary data set used in these tests is the set of images that were taken during the night of 2006 July 27–28 at Cerro Tololo Inter-American Observatory using the Mosaic CCD camera. This data set contains a variety of different types of images that are typically taken during an observing session, including 0 s exposure bias frame images; heavily exposed flat-field images; short exposures (10 s–30 s) of bright calibration stars; and longer exposures (500 s–600 s) containing randomly distributed images of stars, faint galaxies, and diffuse emission. These are typical of the types of images obtained by many sky-survey projects. The Mosaic camera contains eight individual CCD detectors, and each detector has two amplifiers that read out half of the chip each. Thus, every exposure with this camera results in a FITS file containing 16 image extensions that are each 1112 by 4096 pixels in size. In total, this data set consists of 102 FITS files containing 1632 separate FITS image extensions.
To complement this large homogeneous set of images taken with the Mosaic camera, we also included in our test sample a smaller set of 16-bit integer images taken with a variety of other instruments. First we included the suite of test images that were collected by Murtagh & Warmels (1989) for use in testing image-processing techniques. This sample of images has been used in many previous studies (Richmond & Ellman, http://www.tass-survey.org/richmond/rice/rice.html; Sabbey 1999; Grunler et al. 2006) and remains available from ftp://iraf.noao.edu/iraf/extern/focas.std.tar.Z. In order to further increase the diversity of images in our sample, we included nine other more recent deep-sky images obtained from the archives of the Hubble Space Telescope, ESO, and the Anglo-Australian Observatory.
We compressed and uncompressed each of these images using the Rice, GZIP, and Hcompress algorithms supported by our tiled-image compression software, and in each case recorded the compression ratio and the elapsed compression and uncompression CPU times. These same parameters were measured using the GZIP utility program on the host computer to externally compress and uncompress the images. These host GZIP tests were performed on a single FITS image extension instead of on the whole multiextension file, to be comparable with the tiled-image compression tests which also operate on a single image extension at a time. We also calculated the MAD pixel dispersion in the background regions of each image from equation 1 and the corresponding equivalent number of noise bits from equation 4.
4.1. Compression Ratio versus Noise
One of most striking results of this study, as shown in Figure 3, is the very tight correlation between the compression ratio and the measured number of equivalent noise bits in the Mosaic camera images (plotted with small + symbols). The gray line going through these points is derived from the two sets of synthetic noise images discussed in § 3.1 and corresponds closely to the K = 1.2 line in Figure 1. (The compression ratio R, is plotted here, rather than the reciprocal quantity that was use in the previous figures, because R is the quantity of more direct interest to most users.) The larger points in this figure are derived from the comparison sample of images, and show that they also generally follow the same relation as the Mosaic images, but with somewhat larger scatter.
The close agreement between the Mosaic camera images and the synthetic noise images in Figure 3 demonstrates that the presence of "objects" in the astronomical images (e.g., the stars and galaxies) has little impact on the compressibility of these images. This is also demonstrated by the continuity between the Mosaic camera bias and flat-field images (which contain no objects) and the images of the sky. This is simply a result of the fact that most of the pixels in the Mosaic camera sky images have values close to the background level and only a few percent of the total image area is significantly affected by the brighter objects. Thus, even if the algorithms do not compress the regions close to these bright objects very effectively, the overall compression ratio of the image will still mainly be determined by the compressibility of the background regions. Since our MAD noise estimation algorithm (eq. 1) measures the noise in the background regions, it is an excellent predictor of the overall image compression ratio of these images.
Obviously there can be exceptions to this general rule if the objects in the image occupy a significant fraction of the image area. This is the case for the two images in our comparison sample that are plotted with open circles in Figure 3. These are a pair of long and short exposures of the NGC 3201 globular cluster (the ngc0001.fits and ngc0002.fits files from the Murtagh & Warmels 1989 image suite) which contain an unusually dense pattern of star images. In cases like this, the overall image compression ratio can be significantly less than what would be predicted simply from the pixel variations in the local background regions between the star images.
It is also interesting to note in Figure 3 that the Mosaic camera images are segregated into three distinct groups that correspond to the bias frame images (containing the least amount of noise), the short exposures of calibration stars (the middle group), and the deeper exposures of the sky and flat fields (containing the most noise). This grouping is just a reflection of the fact that these three different types of images have distinctly different average pixel values, and the noise (mainly from Poisson statistics) scales roughly as the square root of that value.
The insert in Figure 3 shows a magnified section of the data in which it can be seen that the points tend to lie along distinct bands which correspond to the 16 different detectors in the Mosaic camera. The points from a single detector are tightly correlated and thus most of the scatter seen in this figure is due to the systematic differences between image detectors. We will come back to this effect in the discussion of the NEWFIRM images in the following section which show an even more pronounced banding pattern.
4.2. Comparison of Different Compression Algorithms
Figure 4 compares the compression ratios achieved by all three algorithms plotted as a function of the equivalent number of noise bits in the Mosaic camera CCD images. As can be seen, Rice and Hcompress achieve very similar compression ratios that are about 1.4 times greater than when using Gzip. The bottom panel shows that this ratio varies from about 1.5 for images with low to moderate amounts of noise, to about 1.3 for the noisest images. The middle panel shows that Hcompress produces about 2% to 5% better compression than Rice, but as discussed below, this small gain is usually not worth the much higher required CPU times.
The upper solid curve in Figure 4 shows the theoretical maximum compression ratio, given by , of an ideal lossless algorithm that infinitely compresses all the nonnoise bits in the image (i.e., an algorithm with K = 0). The Rice and Hcompress algorithms have measured K values of 1.2 and 0.8 bits pixel-1, respectively, and thus produce about 75% to 90% of the ideal amount of compression, depending on the noise level.
The relative compression and uncompression speeds1 of the different methods are shown in Figures 5 and 6. The Rice compression algorithm is 2–3 times faster than Hcompress (depending on the amount of noise) and 4–6 times faster than tiled-GZIP (or about 2–3 times faster than the host-GZIP utility, as explained in § 2). When uncompressing images, Rice is 2.5–3 times faster than Hcompress, and 1.6 to 2 times faster than tiled GZIP (or about the same speed as the host GUNZIP utility).
The mean compression ratios and the mean compression and uncompression CPU times for all 1632 Mosaic camera images are given in Table 1 for each of the compression methods. The CPU times are relative to the Rice algorithm. As a benchmark reference, a Linux PC with a 2.4 GHz AMD Opteron 250 dual core processor (using only one of the processors) can Rice-compress a 50 MB 16-bit integer FITS image in 1 s. Uncompressing this image also takes about 1 s.
4.3. Special Case: Data Masks
Data mask images are often used in processing environments as a means of flagging special conditions that affect the corresponding pixels in an associated astronomical image. The pixel values in a data mask contain essentially no noise, so the general relationship between noise and compression ratio as discussed previously does not apply. Instead of being limited by noise, the maximum compression ratio of a data mask is determined by internal limits within each compression method. With the Rice algorithm, for example, each block of 32 image pixels (= 64 bytes) can at most be compressed to a single 4-bit code value, for a maximum compression ratio of 128. Similarly, the maximum compression ratio is about 200 for GZIP and about 700 for both Hcompress and PLIO. Other factors, such as the image tile size, can further limit the compression ratio, but in practice, large compression ratios of 50 or more can often be achieved.
In many cases, choosing the algorithm that produces the very highest compression ratio is of secondary importance because the masks compress so well with any algorithm that the size is insignificant compared to the rest of the associated data set. The compression and uncompression speed of the algorithm can be a more significant consideration in this case. Our tests on a sample of data masks show that Rice and the IRAF PLIO algorithm are the fastest, but GZIP and Hcompress are less than a factor of 2 slower. Overall, PLIO provides the best combination of compression ratio and speed when compressing data masks, but the difference with the other algorithms is not great. In practice, it may be simplest to just use the same compression algorithm on the data mask as is used on the associated astronomical image.
5. COMPRESSION OF 32-BIT INTEGER IMAGES
Astronomical images in 32-bit integer format are less frequent than 16-bit integer images because few imaging detectors produce more than 16 bits pixel-1 of precision. One of the few such instruments in general use is the NEWFIRM near-infrared camera at the Kitt Peak National Observatory. In order to measure the performance of the different compression methods on 32-bit integer images, we used a sample of FITS images taken with this camera during the night of 2008 February 24–25. These images are similar to the Mosaic data set, and include bias frames, flat fields, short exposures of calibration stars, and longer exposures of the sky containing images of stars and faint galaxies. The NEWFIRM instrument contains a mosaic of four imaging detectors, each of which is 2112 by 2048 pixels in size. There are 447 NEWFIRM observations in our data sample, giving a total of 1788 separate images.
5.1. Comparison of Different Compression Algorithms
We repeated the same analysis as was done in the previous section on the 16-bit images to measure the compression ratios and the CPU times required to compress and uncompress each of the 32-bit integer NEWFIRM images with each different compression method. Figure 7 shows how the Rice and tiled-GZIP compression ratios depend on the measured number of equivalent noise bits in each image. (The points for the Hcompress algorithm have been omitted for clarity because they lie only slightly above the Rice points.) This figure is similar to the corresponding Figure 4 for 16-bit integer images, except that the compression ratios are twice as large, given the same amount of noise. This is to be expected from equation 3 and is a natural consequence of the fact that a 32-bit integer image is a factor of 2 larger than a similar 16-bit image, but if they both have the same equivalent number of noise bits per pixel then the compressed images will be identical in size when using algorithms like Rice and Hcompress that have a constant K value.
This 2:1 relationship in compression ratios does not hold for the GZIP algorithm because K is larger for 32-bit integer images, as can be seen by comparing Figures 1 and 2. Thus, the compression ratio of a 32-bit image when using GZIP is only about 1.6 times greater then that of a 16-bit image with the same equivalent number of noise bits.
As was also the case for the 16-bit astronomical images, the different types of images are segregated into separate regions of Figure 7 because the mean pixel value, and hence the amount of noise, is distinctly different. The clump of points with Nbits < 5 are the 0 s exposure "bias" calibration exposures, the points with 5 < Nbits < 7 are the short exposures of calibration stars, and the remaining points with larger noise values are the deep-sky images and the heavily exposed flat-field images.
The upper solid curve in Figure 7 shows the maximum possible lossless compression ratio that would be achieved by an ideal compression algorithm that has K = 0. Hcompress and Rice, with K values of 0.8 and 1.2, respectively, are within 75% to 90% of this theoretical limit, depending on the amount of noise, just as with 16-bit integer images.
Finally, one other prominent feature in Figure 7 is that the points are split into distinct bands that correspond to the four different detectors in the NEWFIRM camera. This is similar to the banding seen in the 16-bit Mosaic camera image, but on a larger scale. Unlike the CCD detectors in the Mosaic camera, which are closely matched in image quality, the four infrared imaging devices in the NEWFIRM camera have distinctive characteristics. One notable feature is that these images show faint streaks in the background, and the streaks run vertically in two of the chips and horizontally in the other two (as a result of the way the chips are oriented in the camera). Since our MAD algorithm calculates the noise on a row by row basis, the noise value is larger in the cases where the rows cut across the grain of the streak pattern. This causes a systematic displacement of the points from the different chips in the figure.
Naively, one would expect the displacements would fall along the gray line in the figure that is derived from synthetic images with uniformly or Gaussian-distributed noise, so that any change in the noise level causes a corresponding change in the compression ratio according to equation 5. But as can be seen, there are instances where two images taken with different detectors have the same equivalent number of noise bits but have systematically different compression ratios (i.e., there is a vertical displacement between the points from the different chips in the figure). This indicates that these images have properties that differ in some way from the assumptions that went into the derivation of equation 5. As a simple example, if one were to add a small constant value to all pixels in the even-numbered columns of an image, this would increase the dispersion in the pixel values along each row of the image, causing the compression ratio to decrease; however the calculated MAD noise value would remain unchanged because that calculation is based on the differences between the values of every other pixel in the row. More generally, any deviations from the assumed pure Gaussian distribution in the image pixel values can lead to systematic offsets in the calculated equivalent number of noise bits, or in the efficiency of the compression algorithm (i.e., the K value), or both. At some level, every physical device will show deviations from an ideal detector. These effects are relatively large in the NEWFIRM detectors, but the banding seen in the insert in Figure 3 indicates that they are also present in the 16-bit Mosaic camera image at a smaller level. This effect probably also accounts for some of the scatter seen in that figure in the comparison sample of images taken with other instruments. If one were to analyze a larger sample of images from each of those instruments, they would likely fall within a fairly narrow region in that figure.
A comparison of the CPU times required to compress and uncompress the 32-bit integer images with GZIP or Hcompress, relative to Rice, is shown in Figures 8 and 9. The average compression ratios and the relative compression and uncompression CPU times for all 1788 NEWFIRM images are also summarized in Table 2. As can be seen, the speed advantage of Rice over Hcompress or GZIP is even greater when compressing or uncompressing 32-bit images than with 16-bit images. Our benchmark Linux machine (2.4 GHz AMD Opteron 250 dual core processor) can Rice-compress a 90 MB 32-bit integer FITS image in about 1 s, and can uncompress the same image in about 1.2 s.
5.2. Special Case: Representing Floating Point Images as Scaled Integers
Instead of storing real-valued images using the 32-bit IEEE floating point number representation, a widely used FITS convention converts the floating point values into scaled integers, where the (approximate) floating point value is then given by
and where and are linear scaling constants given as keywords in the header of the FITS image. This is technically a "lossy" compression technique that quantizes all the pixel values into a set of discrete levels, spaced at intervals of . Ideally, the quantization levels should be spaced finely enough so as to not lose any scientific information in the image, but without recording each value with an excessive amount of precision.
Unfortunately, a common practice is to simply scale the image so that the pixel values span the full 32-bit integer range. This has the effect of greatly magnifying the dispersion in the scaled integer values, which makes them virtually incompressible.
In order to achieve higher compression, a better technique, as described in detail by White & Greenfield (1999), is to choose the value so that the quantized levels are spaced at some reasonably small fraction of the noise in the image, such that
and where σ is calculated from equation 1. The number of noise bits per pixel that are preserved in this case, from equation 4, is simply log 2(D) + 1.792. In order to achieve the best compression, data providers should choose the smallest value of D that still preserves the required scientific information in the compressed image. This can be determined by performing the same photometric and astrometric data analysis on both the original and the compressed version of the image. The amount of precision that needs to be retained of course depends on the particular application (e.g., quick-look or preview images need much less precision than those intended for full scientific analysis), but previous experiments (see, for example, Fig. 2 in White & Greenfield 1999) suggest that values of D in the range of 10 to 100 (about 5 to 8 equivalent noise bits per pixel) may be sufficient to retain all the scientifically useful information in the image.
6. COMPRESSION OF FLOATING POINT IMAGES
Compressing astronomical images that have 32-bit floating point format pixel values presents special challenges. One difficulty is that many compression algorithms, including Rice and Hcompress, can by design only operate on integer data. Another problem is that in many cases the floating point image is derived from what was originally a 16-bit integer image (after applying various calibration operations such as bias subtraction, flat-fielding, sky subtraction, flux calibration, etc.) and the resulting expansion from a 16-bit to a 32-bit representation can leave the less significant bits of the mantissa effectively filled with uncompressible noise. One way to mitigate this effect is to artificially set some of the least significant bits in every pixel to zero, which serves to more coarsely quantize the pixel values. This is particularly effective when compressing the floating point image file with the GZIP utility, because this reduces the number of different byte patterns present in the file, thus increasing the compression efficiency of the algorithm. Some observatories, such as the Swift X-ray and Gamma-ray satellite, have used this quantization technique in their processing pipeline to significantly reduce the size of the GZIP-compressed files in their data archive.
Rather than directly quantize the floating point values, we use an equivalent quantization technique of converting the floating point values into scaled 32-bit integers as described in section 5.2. These integers are then compressed using the Rice algorithm. The linear scaling parameters are calculated independently for each tile (row) of the image so that the quantization levels are spaced at a user-specified fraction of the measured noise in the tile. This effectively discards some of the lower-order bits in the mantissa of the floating point values, which typically do not contain any significant information. Depending on how many bits are discarded, the image compression ratio of the scaled integer image can be dramatically increased. As is discussed in § 5.2, however, it is incumbent upon the user to determine the appropriate scaling level so as to not degrade the scientific usefulness of the image.
7. EFFECT OF TILING PATTERN ON COMPRESSION PERFORMANCE
There are a number of considerations in choosing an appropriate tiling pattern when compressing an image. First, the tile must be sufficiently large for the compression algorithm to operate efficiently. As a general guideline, the lower limit is about 500 pixels for the Rice algorithm and about 2000 pixels for GZIP. Below these levels, the compression time for the image and the size of the compressed file both begin to increase. The Hcompress algorithm is inherently different from Rice and GZIP in that the wavelet transform only operates on two-dimensional arrays of data. At a minimum it requires tiles containing at least four rows of the image, and it reaches near-maximum efficiency when the tiles contain about 16 rows. For this reason we adopted 16 rows of the image at a time as the default tiling pattern in our software when using Hcompress.
The other main consideration when choosing a tile size is how the software that reads the image will access the pixels. The two most common access methods used by astronomical software are either to read the entire array of pixels in the image into computer memory all at once, or to read the image sequentially one row at a time. In the first case, the specific tiling pattern makes little difference because the reading routine simply has to uncompress each and every tile in the image once and pass the array of uncompressed pixels back to the application program.
If the application program reads the image one row at a time, then the tiling pattern can have a major effect on the reading speed. If each tile contains multiple rows of the image (and in the extreme case, the whole image could be compressed as one big tile), then the FITS file reading routine must uncompress the whole tile in order to extract a single row. It would obviously be very inefficient to repeatedly uncompress the same tile each time the application program requests the next row of pixels. Instead, a recommended implementation strategy is to temporarily store the most recently accessed uncompressed tile in memory, so that it is still available in case the application program reads more pixels from that same tile. This caching technique adds some computational overhead, however, so in general the default single row tiling pattern is more efficient for applications that read an image row by row.
A third type of image access occurs in applications that read a rectangular "cutout" from a much larger compressed image. In this case it can be efficient to use a rectangular tile pattern that approximates the size of the typical cutout. Only those tiles that overlap the cutout region will then have to be uncompressed. This tiling pattern may be grossly inefficient, however, for software that accesses the image one row at a time, unless a fairly sophisticated caching mechanism is implemented to store all the uncompressed tiles along a row.
In summary, the default row by row tiling pattern (or 16 rows at a time in the case of Hcompress) should work well in most situations. The main exception is if the images are very small, in which case it may be more efficient to compress multiple rows, or the entire image, as a tile.
8. SUMMARY
In this article we have performed a detailed evaluation of various lossless compression techniques on a large sample of astronomical images that are representative of the types of images from many sky-survey projects. Using optimal compression methods on such images can make a large difference in the amount of disk space and network bandwidth needed to analyze, archive, and distribute the images. We focus on lossless compression techniques because they may be adopted by data providers without any risk of losing information in the data. Lossy compression techniques can provide higher compression and may be appropriate in some situations, but the data provider must ensure that the required amount of astrometric and photometric precision is retained.
As we show in § 3, the amount by which an image can be losslessly compressed basically depends on two variables: the amount of noise in the image, and the compression efficiency of the algorithm. The first quantity can be expressed as the average number of bits of noise, Nbits in each pixel value. By definition, these noise bits cannot be compressed; therefore the total number of noise bits in the image sets the lower limit on the size of the compressed file. The efficiency of a compression algorithm is a measure of how well, on average, it is able to compress the remaining nonnoise bits in each pixel; we represent this quantity with the letter K, in units of bits per pixel. The lossless compression ratio is then simply given by the number of bits per pixel (e.g., 16 or 32) divided by the sum of Nbits and K.
The noise in astronomical images is usually not neatly packed into the lowest Nbits of each pixel value (otherwise they could simply be discarded since they contain no useful information). We measure the equivalent number of noise bits per pixel from the fluctuations (i.e., noise) in the pixel values in the background regions of the image. We use a MAD algorithm that was originally developed to calculate the S/N in spectroscopic data to robustly estimate the background fluctuations in any image. As we show in the appendix, if one assumes that the pixel fluctuations have a Gaussian distribution (which is a reasonable assumption in the CCD images discussed here) with standard deviation σ, then the equivalent number of noise bits per pixel in that image is given by .
In § 3.1 we constructed synthetic images containing these two different noise distributions (i.e., Nbits of noise and Gaussian-distributed noise) to verify that the general-purpose compression algorithms used in our study, Rice, Hcompress, and GZIP, actually behave as expected. We found that Rice and Hcompress, which operate on the numerical 16-bit or 32-bit integer value of each pixel, do indeed show the expected relationship between the amount of noise in the image and the compression ratio. In particular, we confirmed that the offset of , derived in Appendix
We then compared the various compression methods on a large homogeneous sample of 16-bit integer astronomical images taken with the NOAO Mosaic CCD camera, as well as on a smaller diverse sample of images taken with other instruments. One of the striking results shown in Figure 3 is how closely the real astronomical images follow the same tight correlation between compression ratio and noise content as in the synthetic images. This demonstrates that the presence of the stars and faint galaxies in many of these CCD images has very little effect on the lossless compression ratio. This is because these objects cover only a small faction of the image area, and the overall compression ratio of the image is mainly determined by the compressibility of the large majority of pixels with values close to the local background. Of course if the density of the objects in the image is large enough, it will have a negative impact on the compression ratio of the image, as was seen the case of a couple images of the central region of a rich globular cluster.
Another interesting result seen in Figure 3 is that the different major types of images (i.e., the bias frames, the short exposures of calibrations stars, and deeper exposures of the sky and flat fields) all follow the same continuous relation between compression ratio and noise content. They are segregated into different regions in the figure simply because the average pixel value, and hence the noise, is distinctly different. This can have significant consequences when estimating the storage needs of a data archive, because the bias frame and short calibration exposures compress much better than the flat field and deep exposures of the sky.
The comparison of the compression ratios achieved by the different algorithms in Figure 4 shows that Rice and Hcompress produce 1.3–1.5 times better compression that GZIP on 16-bit integer astronomical images. Hcompress produces a few percent better compression than Rice, but it also requires about three times more CPU time. Rice is 2–3 times faster than the host-GZIP utility when compressing images, and has about the same uncompression speed as GUNZIP. Overall Rice provides the best combination of speed and compression efficiency of the algorithms studied here. Since the compression ratios produced by Rice and Hcompress are already within 75% to 90% of an ideal algorithm with K = 0, any further improvements to the compression algorithms will produce relatively little gain in the size of the compressed images.
While 32-bit integer astronomical images are relatively uncommon, our compression tests on one such set of images taken with the NEWFIRM camera shows that Rice has an even larger performance advantage over GZIP in both speed and compression ratios than was the case with the 16-bit images. One interesting consequence of the fact that the compression ratio of 32-bit images is twice that of 16-bit images (assuming the images contain the same amount of noise), and hence the compressed files have the same size, is that there is no disk space penalty in storing the compressed images as 32-bit integers instead of 16-bit integers. This might be desired to allow for a greater range of pixel values during subsequent data-processing operations on that image.
One of the motivations of our work is to publicize and promote a better alternative to simply using the GZIP utility to compress FITS images, as is currently done by most astronomical data providers. The FITS tiled-image compression format is more efficient for accessing individual images within a multi-image FITS file, and for reading a small section from a larger image, and the Rice algorithm produces much better compression and is faster than GZIP in most cases. This new compression format is also supported by major astronomical software packages such as the CFITSIO library and the ds9 image display program. To further encourage the use of this compression method, two open-source image compression programs that were used in this study (called fpack and funpack) are publicly available for general use from the HEASARC web site (http://heasarc.gsfc.nasa.gov/fitsio/fpack). These programs run on all major computer platforms and are invoked in the command line, just like the GZIP and GUNZIP utilities, to compress or uncompress any FITS image file. Various options can be specified in the command line to control the programs. More information about using these utility programs is available in the companion user's guide.
Note added in proof. While this manuscript was in press, we learned of an article by Watson (2002) that compares the performance of several lossless compression algorithms to the Shannon entropy limit. It also studies in greater detail a variation on the lossy quantization method we discuss in § 5.2 for achieving higher image compression ratios.
A.: Derivation of the Equivalent Number of Noise Bits
- 1.For images with a Gaussian noise distribution (for instance, the readout floor of a CCD), we derive the equivalent number of noise bits. Start by assuming N bits of uniform noise, and average over the range of data numbers (x = DN) for the expected values of x and x2 and Solve for the variance In the limit of large N Solving for N then gives The factor of can be identified as the familiar analog-digital quantization noise (Janesick 2001). The same result also follows with continuous variables by integrating the second moment of a stepwise probability density symmetrically over 2N quanta. This discrete derivation makes the nonlinear limiting behavior at small values of N evident.
- 2.Our quantity Nbits is equivalent to the Shannon entropy (really just a definition of terms). Entropy is a sum over all possible states of some discrete random variable (pixels in our case), and depends only on the probabilities of each state, not on their values For uniform noise, all states are equally probable, so pi = 1/2N The sum equals unity since it is over 2N states where N = Nbits, thus H = Nbits.
- 3.For a continuous random variable with probability f(x), the differential entropy is There are caveats as with any integral, e.g., one must consider whether a solution exists in each case. More fundamentally, the differential case (as the name suggests) provides a measure of entropy that can be understood only relative to a particular coordinate frame–unlike the discrete entropy that provides an absolute measure of randomness.The differential entropy of a Gaussian probability density function is (Shannon 1948; Cover & Thomas 1991; Roger & Arnold 1994)
- 4.How does our equation 4 correspond to the previous result? To answer this, consider Massey's bound on the discrete entropy (as derived in Cover & Thomas 1991, referencing unpublished work by J. Massey, and independently by F. Willems). Starting with the fact that the Gaussian probability distribution has the maximum entropy for a given variance, the differential expression leads to a bound on the entropy of a discrete random variable (a similar inequality holds for all permutations): As before, assume N bits of uniform noise such that pi = 2-N. Use the same expressions to sum the two series as in item (2) above, but with limits from 1 to 2N (i.e., k = 2N, since we are counting states here, not calculating statistical moments as in eq. 4): However, we have already shown the identity H = Nbits, such that: The units of bits for the entropy are fixed when choosing the base-2 logarithm. It is beyond the scope of the current work to ponder the origin of this one-quarter bit offset. It is sufficient to note that the inequality is satisfied, and further to attribute the difference of to a coordinate transformation in the differential entropy.
- 5.In particular, to motivate equation 4, we can remove this same coordinate offset from the Gaussian differential entropy (hg) to obtain a measure of the Shannon entropy (Hg): Therefore, In summary, equation 4 was derived by calculating the variance of Nbits of uniform random noise. A coordinate transformation of the differential entropy of a Gaussian leads to the same equation. (A Gaussian maximizes the entropy for a given variance, suggesting this equivalence is an identity.) Given Nbits, one can estimate the variance. On the other hand, given the statistical variance of some random variable (easily estimated from the pixel values of an astronomical image), an estimate follows of the equivalent number of effective noise bits, that is, of the image's entropy in bits. With an estimate of entropy in hand, the compression factor is calculable.
- 6.Figures 1 and 2 demonstrate empirical agreement of synthetic Gaussian data with the Nbits relation for both 16-bit and 32-bit integer pixels. Figures 3 and 7 empirically confirm this relation for the Gaussian read-noise dominated background of real world optical and infrared data sets, and further, for both on-sky and calibration data products.
Footnotes
- 1
The timing measurements in this article are based on CPU time and not on the total elapsed processing time, which includes the time needed to access data on magnetic disk. The latter time should be larger; however, it is difficult to measure consistently because of the sophisticated data caching techniques used by modern computer systems.