Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Instrumentation The following article is Free article

The Sloan Digital Sky Survey Data Transfer Infrastructure

, , , , and

© 2015. The Astronomical Society of the Pacific. All rights reserved. Printed in U.S.A.
, , Citation Benjamin A. Weaver et al 2015 PASP 127 397 DOI 10.1086/680999

1538-3873/127/950/397

Abstract

The Sloan Digital Sky Survey (SDSS) has been active for approximately 15 years as of this writing. SDSS continues to produce large amounts of data, effectively daily. SDSS needs an effective system for data transfer and management that can operate essentially free of human intervention. In 2008, with the commencement of the third phase of SDSS, SDSS-III, new needs and opportunities motivated a fresh look at the data transfer infrastructure. We have constructed and are releasing a Python package, transfer, that contains all the automation needed for daily data transfer operations. This package has been tested and used successfully for several years. Significant portions of this code will continue to be used as SDSS transitions to its fourth phase, SDSS-IV.

Export citation and abstract BibTeX RIS

1. Introduction

The Sloan Digital Sky Survey (SDSS, York et al. 2000) has been collecting imaging and spectroscopic data from the Sloan Foundation 2.5-m telescope (Gunn et al. 2006) at Apache Point Observatory (APO) since 2000. The resulting raw and reduced data have been made public in a series of data releases. See Abazajian et al. (2009) for a discussion of Data Release 7 (DR7) and references to previous data releases.

Starting in 2008, the SDSS entered its third phase, SDSS-III (Eisenstein et al. 2011). SDSS-III has produced three previous data releases, Data Release 8 (DR8, Aihara et al. 2011), Data Release 9 (DR9, Ahn et al. 2012), and Data Release 10 (DR10, Ahn et al. 2014). The final data release of SDSS-III, Data Release 12 (DR12, Alam et al. 2015), was made public at the end of 2014 and contains approximately 116 TB of data. SDSS-III consists of four surveys that use the telescope and attached instruments in different ways and at different times: the Apache Point Observatory Galactic Evolution Experiment (APOGEE; S. Majewski et al., in preparation), the Baryon Oscillation Spectroscopic Survey (BOSS, Dawson et al. 2013), the Multi-object APO Radial Velocity Exoplanet Large-area Survey (MARVELS; J. Ge et al., in preparation), and the second phase of the Sloan Extension for Galactic Understanding and Exploration (SEGUE-2; C. Rockosi et al., in preparation).

Early in the SDSS, raw data were written to digital tape at APO and express shipped to Fermilab for processing. However, by the time of the second phase of SDSS, SDSS-II, the bandwidth from APO (a microwave link) became sufficient to transfer all data via standard Internet transfers.

In SDSS-III, we have a fully automatic data transfer pipeline that daily copies data from APO over a microwave link and then bulk internet to Lawrence Berkeley National Laboratory (LBNL). The LBNL site is also frequently referred to as the Science Archive Server (SAS). Data processing centers at Princeton University (SEGUE-2), University of Utah (APOGEE),4 and University of Florida (MARVELS) copy data from LBNL as needed using their own methods. The processed data are automatically copied back to LBNL for long-term archiving and preparation for release. Data processing for the BOSS survey is colocated with the central data repository at LBNL. Both raw and reduced data are copied to a mirror facility at New York University (NYU) and backed up to tape. The mirror facility at NYU is sometimes referred to as the Science Archive Mirror (SAM), or simply "the Mirror." These operations are controlled by a unified software package called transfer, which is written in Python (van Rossum & Drake 2006; https://www.python.org).

The remainder of this article is arranged as follows. First, we describe the history of the development of the transfer package in § 2. In § 3, we describe the basic structure of the data transfer pipeline system. In §§ 4.1 and 4.2, we discuss details of the transfer of raw and reduced data, respectively. We compare to other data transfer systems and offer some general discussion in § 5.

2. Development Of The Pipeline

2.1. History

During the earliest phases of SDSS, the Internet connectivity to APO had insufficient bandwidth to handle the data produced each night, which was dominated by imaging data (Gunn et al. 1998). Instead, data were written to digital tape and express shipped daily to Fermilab for processing.

Around the time of the start of SDSS-II, 2005 September, a 10 Mbit/s microwave link became available. Data could then be transferred via standard Internet connections to Fermilab. Data volume was still dominated by imaging. To handle the data volume, data transfers would commence as soon as the raw data were finalized by the instrument control system. On certain highly-productive imaging nights, data transfers would start to fall behind. That is, data taking would commence before the data transfer for the previous night had completed (C. Loomis, private communication). The data transfer infrastructure was handled by a set of shell scripts (written in the bash shell language), collectively known as the Get Apache Point (GAP) software suite (J. Hendry, private communication). The SDSS imaging camera contained 30 imaging CCDs, one for 5 different filters, arranged in 6 columns. These columns, called "camcols" in SDSS terminology, allowed the imaging data to be naturally divided into six groups, each of which could be transferred in parallel, thus making the best use of the available bandwidth. A standard data transfer utility, rsync (Tridgell & Mackerras 1996, http://rsync.samba.org), was used to handle the low-level data transfer, the actual reading and writing of bytes. The necessary rsync commands were wrapped by the GAP scripts.

Throughout SDSS-I and -II, primary data processing took place at Fermilab, and reduced data were made available to the collaboration and the public directly from there. Thus, there was no need for additional data transfer infrastructure, because data merely had to be written and copied on disk servers internal to Fermilab.

2.2. Requirements

With the commencement of SDSS-III, a number of factors required significant changes to the data transfer infrastructure. These factors included:

  • 1.  
    Fermilab would no longer be the primary data processing center. Instead, SDSS-III adopted a decentralized data processing model. That is, raw data associated with the SDSS-III surveys would be processed at a number of institutions.
  • 2.  
    The observers stationed at APO wanted to reserve the available bandwith to support night-time operations, e.g., by facilitating remote observation and support. Thus, it was no longer desirable to transfer data as soon as it became available.
  • 3.  
    The APO bandwidth was upgraded to 20 Mbit/s.
  • 4.  
    Although imaging continued into the early years of SDSS-III, the primary focus would be on a number of different spectroscopic instruments with different raw data outputs.
  • 5.  
    In addition to a primary data server, located at LBNL, there would be a mirror data server at NYU. This mirror would serve both unreleased and public SDSS data.

These factors naturally lead to a new set of requirements on the SDSS-III data transfer architecture:

  • 1.  
    Commence raw data transfer operations at a fixed time of day instead of whenever the data became available; complete raw data transfers before night-time observations start.
  • 2.  
    Track the transfer completion status of individual nights of observation.
  • 3.  
    Distribute raw data to institutions hosting the data transfer operations for the SDSS-III surveys; gather the processed data back to the central data repository at LBNL and copy this processed data to the mirror at NYU immediately after.
  • 4.  
    Flexibly support a wider variety of raw data file types compared to SDSS-I and -II.
  • 5.  
    Due to security restrictions, control all data transfer operations from LBNL; wherever possible, use ssh to initiate data transfers securely.
  • 6.  
    Minimal assumptions should be made about software installed at APO and at remote data reduction facilities; a generic Linux installation should be sufficient in most cases.
  • 7.  
    Support automated tape backups.

2.3. Conceptual Design and Initial Development

Development on the SDSS-III data transfer infrastructure commenced in 2008 April, shortly before the official start of SDSS-III. The initial development was informed by the GAP code, but did not reuse any of it. The earliest code was written in Perl (Christiansen et al. 2012, http://www.perl.org). By the time of the start of SDSS-III, this initial raw data transfer pipeline had been tested and deployed. At this time the bulk of the raw data was still imaging data, and the principle of transferring each camcol in parallel using rsync was maintained.

Additional Perl programs supported verification of data checksums, automated tape backup to the High-Performance Storage System (HPSS)5 at the National Energy Research Scientific Computing Center (NERSC, which is affiliated with LBNL), and copy to the mirror facility at NYU. An additional Perl program wrapped the individual stages of the data transfer pipeline. This program would be invoked daily at the appropriate time by the cron daemon on the control server at LBNL.

SEGUE-2 was the first of the SDSS-III surveys to produce reduced data. This processing took place at Princeton University. The most natural method for retrieving SEGUE-2 reduced data was to wait until a particular reduction run ("rerun") was complete, then transfer all data associated with that rerun. The data would be aggregated into tar files, which could be copied to LBNL and NYU with the bulk data transfer program bbcp (Hanushevsky et al. 2001, http://www.slac.stanford.edu/~abh/bbcp/).

MARVELS data processing followed a similar model to SEGUE-2. Transfers would take place after a rerun was complete. However, security policy at the the University of Florida required data transfers to take place using a dedicated rsync server provided by the MARVELS collaboration. Fortunately, the data volume was sufficiently small to make this transfer tractable.

BOSS data processing was colocated with the central data repository at LBNL, so it was only necessary for the data transfer pipeline to copy the reduced data to the NYU mirror.

APOGEE data processing started somewhat later than the other surveys. Processing took place at the University of Virginia. The APOGEE raw and reduced data volume is significant. In addition, as a new instrument, there was considerable demand for the reduced data to be quickly available to the entire collaboration. Every day, new or changed files were identified using rsync in "dry run" mode. These files were bundled into a tar file, and bbcp was used for the actual transfer.

The SDSS-III data flow is represented in Figure 1. This flow diagram is independent of any specific software package. In other words, transfer was written to support this data flow, rather than the data flow being imposed by the design of the software.

Fig. 1.  Refer to the following caption and surrounding text.

Fig. 1.  Data flow diagram. "APO" represents raw data originating at Apache Point Observatory. "SAS" represents the Science Archive Server, the primary data repository at LBNL. "SAM" represents the Science Archive Mirror, the mirror repository at NYU. "HPSS" is the High-Performance Storage System, a massive tape back-up system at NERSC. "Data Processing" represents the remote data processing facilities at Florida, Princeton, Utah, and Virginia. All transfers are initiated by the SAS, except for transfers to the data processing facilities, which are initiated by the local data teams.

2.4. Current Development

The initial Perl version of the data transfer pipeline was successful, efficient, and highly automated. Human intervention was only required in infrequent cases of network outages, disk failures, and similar problems. However, after a few years of experience, a few general factors motivated a rewrite of the transfer pipeline:

  • 1.  
    In rare cases of failures, it was not always easy to tell that a failure had occurred.
  • 2.  
    Logging was done with ad hoc methods, making debugging failures more difficult than it had to be.
  • 3.  
    The deactivation of the SDSS imaging camera provided an opportunity for both a simplification and a generalization of the raw data transfer pipeline.

Perl was not being widely used by the SDSS collaboration, and the object-oriented and documentation features of Perl were not being employed, due to lack of familiarity. In effect, the Perl version was simply Perl replacing the bash shell. In the meantime, use of Python in the SDSS collaboration and in the astronomay community was increasing (see, e.g., Greenfield 2011). Thus, long-term maintainability was a key factor in choosing Python.

Development of Python version, what would become the transfer package, followed the principle that the collaboration should not even notice the transition. Because all data transfer operations were initiated by a single server at LBNL, this was not a difficult requirement. In addition, the extensive Python built-in library would be exploited to the fullest extent possible, rather than relying on subshell calls to external programs. The documentation of the pipeline needed improvement, and Python's extensive support for self-documenting code was a desirable feature. The earliest development began in 2011 February, and full deployment took place in 2013 June. Approximately 20% of a full-time equivalent (FTE) effort was involved in the development.

3. Transfer Architecture

3.1. Overview of the Package

The transfer code is a set of Python modules organized into a Python package.6 The code requires Python 2.7, but is written with Python 3 in mind (see also § 5.2). The individual module files are organized by function. For example, code related to transferring APOGEE data is collected in the transfer.apogee module.

3.2. Core Functionality

Code that is used by all parts of the package is collected in the transfer.common module. Log generation is provided by the built-in logging library,7 which is configured by transfer.common.get_logger. The module transfer.common.system_call provides a wrapper on subprocess.Popen that interfaces with the Log generation infrastructure and that can terminate processes that run for an excessively long time.

3.3. "Executable" Packages

Each survey has a module that defines a main() function. These functions are the entry points for executable scripts that are automatically constructed when the package is installed. In addition to the four main SDSS-III surveys described above, there is a module to support transfer of BOSS data to the Max Planck Institut für Astronomie (MPIA) in Heidelberg, Germany.

3.4. Additional Functionality

In addition to the main raw- and survey-data modules, transfer provides additional utilities that we list here briefly.

  • 1.  
    The transfer.alert module that provides a simple warning system that runs on the Mirror server. This warning system runs daily as a cron job and is activated when it appears that raw data was not successfully transferred to the Mirror.
  • 2.  
    The transfer.bulk module provides a flexible system for transfer of large amounts of data.
  • 3.  
    The transfer.module module provides a script that can install Module files (Furlani 1991; Furlani & Osel 1996, http://modules.sourceforge.net) appropriate to the transfer package.
  • 4.  
    The transfer.version module provides convenient access methods for the transfer package's version number.
  • 5.  
    The subpackage transfer.sphinx provides some helper modules that make the construction and formatting of Sphinx documentation easier.8

3.5. Configuration

The transfer package follows a model where configuration data are strongly separated from code. This minimizes the need for changes to the code as minor details about the data change. Also, this allows us to distribute the code without revealing potentially sensitive network details. The configuration files are described in detail with the documentation that is distributed with the code. The configuration files are INI-like text files understood by the Python built-in package ConfigParser.9

As an example, the configuration file for transfer.raw looks like this:

[DEFAULT]

; Connect as this user.

user = sdssuser

; Base DNS domain.

domain = apache-point-observatory.org

; Default server at APO.

machine = sdssserver

; Use parallel streams.

multiple = False

; Use rsync compression.

compress = False

; Method to verify after transfer.

verify = SKIP

; Default disk at LBNL.

sas_copy = sdssraid1

; Default disk at NYU.

sam_copy = nyuraid1

[general]

; Environment variable that points to the

staging area.

staging = STAGING_DATA

; Default number of parallel streams.

streams = 6

; Alter permissions during transfer.

permission = False

report_url=http://users.apo.nmsu.edu/

obs-reports-25m/reports/

; Default server at NYU.

sam_machine = sdss-mirror-server.org

; Disk containing staging area at NYU.

sam_staging = nyuraid2

[spectro]

; Path to data at APO.

path = /data/spectro

multiple = True

verify = sha1sum --check

sas_copy = sdssraid2

; Copy data to the directory contained in

this env variable.

env_copy = BOSS_SPECTRO_DATA

[mapper]

path = /export/home/mapper/scan

machine = platemapperserver

compress = True

env_copy = MAPPER_DATA

The details in this configuration file have been altered to protect sensitive network information. A full description of the configuration documentation is available at http://sdss.physics.nyu.edu/transfer/doc/configure.html.

3.6. External Dependencies

The transfer package has relatively few external dependencies outside of the Python standard library. NERSC provides two programs that interact with HPSS: hsi and htar. hsi provide general shell-like access to the HPSS file system, while htar is used to create a special type of TAR archive that resides on the HPSS file system. The latter is especially useful to SDSS. By aggregating many small files into a single large file, htar allows much more efficient use of the tape file system.

The transfer package also uses the bbcp program. This is used for low-level bulk data transfer over parallel channels to maximize data throughput. bbcp is a stand-alone program that can be activated over ssh connections, so it is relatively easy to install on systems that have no existing bulk data transfer infrastructure.

4. Data Description

4.1. Raw Data

4.1.1. Raw Data Description

A number of different data products are produced nightly at APO. Most of the data are related to one of the SDSS surveys, but there are additional engineering and metadata that some or all of the surveys need to process their data, or that simply need to be retained for archival purposes. Most of the data are in the form of FITS files. The average, peak and total raw data transfer over the course of SDSS-III are summarized in Table 1.

In SDSS-III the SDSS/BOSS spectrograph (Smee et al. 2013) was used for both the SEGUE-2 and BOSS surveys. Although the spectrograph was significantly upgraded for BOSS, the raw data format did not change substantially.

Currently, the APOGEE spectrograph (Wilson et al. 2010) produces the largest amount of data per night on average. APOGEE uses the New Mexico State University 1-m Telescope (Holtzman et al. 2010) to take additional spectra with the same instrument. The additional 1-m data are stored and transferred separately from the primary data from the APOGEE instrument, but there is basically no difference in the data format.

In addition to raw data, both the BOSS and APOGEE surveys perform quick reductions at APO for quality assurance purposes. The outputs of these quick reductions, known as "sos" for BOSS and "quickred" for APOGEE, are also transferred. Although it has now been deactivated, the MARVELS instrument also produced a combination of raw and quality assurance data.

The hand-plugged SDSS spectroscopic plates require a mechanism to convert the fiber number into position on the sky (i.e., the object from which the fiber is receiving photons). At APO, the "mapper" device performs this function. The data from this device are included in the daily transfer.

There are a few ancillary imaging systems that are included in the daily transfer. The "ircam" is an all-sky infrared camera that is used to detect the presence of clouds. It is completely separate from the Sloan telescope. The guide camera, or "gcam," is the essential part of the telescope guiding mechanism during spectroscopic observations. Finally, the engineering camera, or "ecam," a CCD (identical to the guide camera) mounted in a dedicated cartridge on a moveable stage, is used in telescope collimation, the creation of pointing models, and other telescope engineering tasks.

Historically, we also transferred data from the imaging camera (Gunn et al. 1998), the monitor telescope (Tucker et al. 2006), and some engineering log files. None of these have been actively transferred since the Python transfer package has been used.

Observing systems at APO are designed to handle network outages of order one month, though this has never actually happened. The disk space available at APO is more than sufficient to facilitate observing during such an outage. That is, it is not necessary for the data transfer system to operate every day in order for there to be enough disk space for ongoing operations.

4.1.2. Raw Data Transfers

Raw data transfers are handled by the transfer.raw module and in particular the transfer_raw class. The transfer_raw class provides a number of methods that handle the several phases of the raw data transfer. The transfer_raw is a subclass of transfer_common in the transfer.common module. The super-class mostly deals with configuring logging.

The end of a night of observing is formally defined by the start of a new SDSS Julian Day (SJD), and data transfer operations start each day approximately 30 minutes after this turnover. SJD is similar to the standard astronomical Modified Julian Day (MJD). Normally, MJD is defined in terms of the Julian Day (JD):

However, this definition would have a new MJD start inconveniently near the beginning of a night of observing at APO. Therefore, at APO,

Furthermore, the offset of 0.3 is only exact when MJD is determined using International Atomic Time (TAI), instead of Coordinated Universal Time (UTC), a 35 s offset since 2012 June 30. In internal SDSS code and documents, MJD and SJD are used interchangeably, though hereinafter we will use SJD, since that is how the raw data are subdivided.

The daily raw data transfer begins with a download phase. Each data type described above has a directory on a particular disk server at APO, and each directory is subdivided into SJD directories for each night. For each data type, the primary SDSS data server at LBNL establishes a connection to APO and obtains the list of files in the SJD directory, if any. The list of files is divided into six separate lists, and each list is assigned to an rsync connection. The six parallel streams, initially based on the number of camcols in the imaging camera, have proved to be an efficient number for other data types as well. In Figure 2, we show the bandwidth use during a typical day's data transfer. On the LBNL end, data are initially stored in a staging area separate from other SDSS data.

Fig. 2.  Refer to the following caption and surrounding text.

Fig. 2.  APO bandwith use measured over a 1-day period. The horizontal axis is UTC time. This measurement was retrieved 2014-07-13T17:20:30; the vertical line at zero denotes 2014-07-13T00:00:00. Lighter solid gray is network traffic into APO, while outlined black is network traffic out of APO. The data transfer commencing at 2014-07-12T17:30:00 corresponds to SJD 56850. In total, 35 GB were transferred, of which 30 GB were raw APOGEE data. This was an approximately average night of observing, with half the night lost to cloudy weather.

Once download has completed without any detected data transport errors (such errors are rare), the data are verified by examining checksum files provided by the various instrument operations software packages. The checksums are typically MD5 or SHA1, as provided by the standard utilities md5sum and sha1sum, respectively. The choice of checksum is left up to the instrument operations teams. Checksum mismatch errors at this stage are very rare; so rare that it is difficult to estimate the rate, though one checksum mismatch error per year would be a conservative upper limit. That is not to say that there are no errors at this stage, just that file corruption is not the typical cause of them. It is much more common to discover missing or extraneous files at this stage; for example, where the checksum file lists a file that was removed from APO prior to download. Errors of this type occur at a rate of approximately one per month.

After the verification stage, raw data are copied to the HPSS facility at NERSC. Data for each data type are consolidated into a single htar file for each SJD. This stage of the transfer is aware that NERSC performs periodic maintenance on HPSS, so it has the capability to wait until the maintenance is complete and HPSS is available again before continuing the backup.

After backup, the raw data are now ready to be copied to permanent archive areas on spinning disk. The data are unaltered in this process: data remain in SJD directories, and file system timestamps are preserved. At this time, data that is more than 30 days old are removed from the staging area.

The final stage of the raw data transfer is the copy to the Mirror facility. A dummy rsync command is used to identify files in the LBNL staging area that are new or have changed since the last copy to the Mirror. These files are aggregated into a single TAR archive, and this file is copied to the mirror with bbcp. After the copy is complete and the TAR file is expanded, another dummy rsync command is used to verify the transfer. If this is successful, data are copied to permanent archive areas on the mirror.

At the conclusion of the data transfer pipeline, new raw data exist in at least five places: The LBNL staging area, the permanent archive area for each survey at LBNL, the staging area at the Mirror facility, the permanent archive area for each survey at the Mirror facility, and HPSS.

The raw data transfer status is recorded in a special set of flat files that can be queried by a script provided by the transfer package, transfer_apo_status. This script can be used to alert data processing facilities that raw data are ready for transfer to the processing location. The data in these flat files are also collated into a webpage visible to the collaboration for easier reading. The checksum files provided at the time of raw data transfer constitute the list of files the data processing centers should expect to find.

At any point, serious errors will cause the pipeline to halt, and a warning message will be e-mailed to maintainers of the pipeline. Once the cause of the error is determined, the transfer of the SJD can be restarted. The pipeline is deliberately designed so that restarts produce no side effects.

In special situations, such as extended maintenance at LBNL, the data transfer pipeline can be run at the Mirror facility. Once the primary data facility is available again, data copied directly to the Mirror are copied back to the primary facility.

4.2. Reduced Data

The SDSS surveys process the raw data to produce images, spectra, and catalogs. This processing typically takes place at other institutions. Only the BOSS survey processes data on the same cluster system that contains the primary SDSS data server. Automation is required to transfer processed data back to the primary data server in preparation for public release. And even in the case of the BOSS survey, automation is required to copy processed data to the Mirror facility.

Processed data transfers take place on different schedules for different surveys. The type of schedule each survey uses may be divided into two categories, "daily" and "burst," which are defined below. APOGEE and BOSS transfers operate on a daily schedule; MARVELS and SEGUE-2 transfers operate on a burst schedule.

A daily transfer begins with a dummy rsync command that is used to identify files that are new or have changed since the previous daily transfer. This list of files is used to create a TAR archive, and the TAR file is transferred to the primary facility (if necessary) and the Mirror. During periods of intensive processing, e.g., when an entire data set is being reprocessed, temporary files may be removed between the rsync and TAR creation. This condition is detected and the transfer is postponed.

A burst transfer is initiated when a survey indicates reduced data are ready for transfer by adding a reduction code name (internally called a "rerun") to a particular file on the remote server containing the data. When a new rerun is detected, the entire reduced data set is downloaded at once. In the case of SEGUE-2, this is done similarly to daily transfers: a TAR archive is transferred using bbcp. The MARVELS survey provides a dedicated rsyncd server to handle the transfer. The volume of MARVELS reduced data are small enough that special parallel transfer handling is not required.

In both cases, the transfer scripts can detect when an instance of the same script is still running; i.e., the transfer has taken more than one day. When this condition is detected, the script immediately exits to avoid interfering with the script already running.

5. Discussion

5.1. Other Data Transfer Systems

The SDSS data transfer system was developed independently of other data transfer pipelines. In particular, since the Python transfer package was based on two previous software suites, the bash and the Perl versions, the development was driven more by backward-compatibility and satisfaction of existing requirements (see § 2.2) than consideration of new features and systems.

There are quite possibly as many data transfer pipelines as there are data-intensive scientific projects, but here we wish to mention a few other systems that are in active use by the astronomy community. All these systems address requirements that differ from SDSS's own requirements in ways that may be of interest to anyone seeking to implement a data transfer pipeline.

The NOAO Data Transport System (DTS; Fitzpatrick 2010a, b) is used by the Dark Energy Survey (DES) to transfer data from Chile to various institutions in the United States. The data pathway is considerably longer and potentially less reliable than any pathway within the United States. Also, the data are transferred effectively continuously, i.e., as soon as any processing by the instrument is complete, instead of only at particular times of day, though this is a matter of configuration, not architecture. DTS also supports multiple destinations, similar to how the SDSS pipeline forwards raw data on to HPSS and the mirror facility. However, DTS would break the requirement of minimal software installed at the data source; that is, DTS software must be installed on both ends of a transfer. SDSS-IV will be transferring data from Chile in the future, and the transfer package is already set up to handle this (see § 5.2).

Zampieri et al. (2009) describes the data transfer system for the European Southern Observatory (ESO). This system was under development at the exact same time as the old Perl version of the SDSS data transfer pipeline. Notably, this system uses bbcp for bulk data transfer. However, similar to DTS, this system requires software on both ends of the transfer. Also, it is designed to serve an entire observatory, whereas the transfer package is focused on the Sloan telescope.

Another similar system is the Gemini Science Archive transfer system (Melnychuk et al. 2005). This system uses Perl and bbftp, which is similar to bbcp. Again, this system requires software at both ends of the transfer.

In addition to software requirements, all of these systems interact with a database at some level. We consider it a strength of the SDSS data transfer system that it does not require a database for routine operations. Instead, files are simply retrieved from and placed in locations that are controlled from the configuration files. These locations are known to the collaboration and (when released) can be found by the general public. Also, the requirements placed on the files transferred are much simpler than the cases mentioned above. All files, raw and reduced, belong entirely and only to the SDSS collaboration, and all are released simultaneously at well-defined, infrequent times, so there is no need to carefully manage the ownership and release status of individual files.

In conclusion, we feel that the transfer package is easy to install (including prerequisites), configure, and deploy.

5.2. Future Development

As of 2014 July, SDSS has commenced the fourth phase of its operations, SDSS-IV. The transfer package is in continued use for raw data from APO. SDSS-IV has adopted a more centralized data processing model, not unlike SDSS-I and -II. Data processing will take place at University of Utah. SDSS-IV will continue to support a Mirror site, so transfer operations will be more tightly focussed on raw data and on mirroring all data.

In addition, transfer will support raw data transfers from the Irénée du Pont 2.5-meter telescope (Bowen & Vaughan 1973) at Las Campanas Observatory (LCO). This telescope will extend the APOGEE-2 survey into the southern hemisphere. We have already successfully tested the transfer package with full-sized data sets. The tests transferred data from LCO to LBNL—from Chile to the United States; future transfers will be from LCO to Utah. No significant code changes will be required; only a different configuration file is needed for this operation.

The SDSS collaboration has been making increasing use of Globus Online (Foster 2011; Allen et al. 2012; https://www.globus.org), a general purpose bulk data transfer system that grew out of the GridFTP infrastructure. This system provides reliable bulk transfer between "endpoints" which are GridFTP servers that have advertised themselves to Globus Online. Authentication is provided by a system of certificates. Typically, these certificates expire after a certain time, though this can be overridden for highly-trusted systems. Globus Online provides an ssh-based command-line interface that could, in principle, provide an additional data transfer backend to the transfer package, supplementing or replacing rsync and bbcp. However, the transfer package, or something equivalent would still be required for verification, internal copies, managing tape backups, etc.

Currently, transfer requires Python 2.7. However, it is written with Python 3 in mind. Every module uses the recommended from __future__ imports to ease this transition.10 The decision to move to Python 3 will be left up to the SDSS-IV data transfer team.

5.3. Code Release

To accompany this article, we are releasing the transfer code. Version 1.2.2 is the latest version at the time of this writing. We are releasing it under a 3-clause BSD-style license. The code is available at http://sdss.physics.nyu.edu/transfer. Because data transfers necessarily involve some sensitive network information, the configuration files that drive the Python code are not included with the code release. These files can be recreated as needed using the documentation that comes with the code.

The transfer package has been registered with the Astrophysics Source Code Library (ASCL, Nemiroff & Wallin 1999, http://ascl.net), and has been assigned ascl:1501.011.

SDSS-III is managed by the Astrophysical Research Consortium for the Participating Institutions of the SDSS-III Collaboration including the University of Arizona, the Brazilian Participation Group, Brookhaven National Laboratory, Carnegie Mellon University, University of Florida, the French Participation Group, the German Participation Group, Harvard University, the Instituto de Astrofisica de Canarias, the Michigan State/Notre Dame/JINA Participation Group, Johns Hopkins University, Lawrence Berkeley National Laboratory, Max Planck Institute for Astrophysics, Max Planck Institute for Extraterrestrial Physics, New Mexico State University, New York University, Ohio State University, Pennsylvania State University, University of Portsmouth, Princeton University, the Spanish Participation Group, University of Tokyo, University of Utah, Vanderbilt University, University of Virginia, University of Washington, and Yale University.

Funding for SDSS-III has been provided by the Alfred P. Sloan Foundation, the Participating Institutions, the National Science Foundation, and the U.S. Department of Energy Office of Science. The SDSS-III web site is http://www.sdss3.org/.

Footnotes

  • Prior to DR12, APOGEE data were processed at the University of Virginia.

  • Please see http://www.nersc.gov/systems/hpss-data-archive/.

  • Please see https://docs.python.org/2/tutorial/modules.html.

  • Please see https://docs.python.org/2/library/logging.html.

  • Please see http://sphinx-doc.org.

  • Please see https://docs.python.org/2/library/configparser.html.

  • 10 

    Please see https://docs.python.org/3/howto/pyporting.html#prevent-compatibility-regressions.

Please wait… references are loading.