Designing Scalable Cyberinfrastructure for Metadata Extraction in Billion-Record Archives: Paper - iPRES 2016 - Swiss National Library, Bern

Gregory Jansen; Richard Marciano; Smruti Padhy

You are here: University of Vienna PHAIDRA Detail o:502901

Title

Language

English

Description (en)

We present a model and testbed for a curation and preservation infrastructure, \Brown Dog", that applies to heterogeneous and legacy data formats. \Brown Dog" is funded through a National Science Foundation DIBBs grant (Data Infrastructure Building Blocks) and is a partnership between the National Center for Supercomputing Applications at the University of Illinois and the College of Information Studies at the University of Maryland at College Park. In this paper we design and validate a \computational archives" model that uses the Brown Dog data services framework to orchestrate data enrichment activities at petabyte scale on a 100 million archival record collection. We show how this data services framework can provide customizable workflows through a single point of software integration. We also show how Brown Dog makes it straightforward for organizations to contribute new and legacy data extraction tools that will become part of their archival workows, and those of the larger community of Brown Dog users. We illustrate one such data extraction tool, a _le characterization utility called Siegfried, from development as an extractor, through to its use on archival data.

Author of the digital object

Gregory Jansen

Smruti Padhy

Richard Marciano

Publisher

Swiss National Library, Bern

Format

application/pdf

Size

1.2 MB

Licence Selected

CC BY-NC-SA 3.0 AT