a centre of expertise in data curation and preservation

Curation of Scientific Data:
                 Challenges for Repositories

                  Chris Rusbridge
           JISC Repositories Conference
              5 June 2007, Manchester
                                                                                                    Funded by:
This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 2.5
UK: Scotland License, excluding content property of others. To view a copy of this license, visit
http://creativecommons.org/licenses/by-nc-sa/2.5/scotland/ ; or, (b) send a letter to Creative
Commons, 543 Howard Street, 5th Floor, San Francisco, California, 94105, USA.
a centre of expertise in data curation and preservation

     •   Audience?
     •   Science and digital curation
     •   Why are data important?
     •   What kinds of data?
     •   What to do with data?
     •   Repository options
     •   Changing practice

JISC Repositories 2007
a centre of expertise in data curation and preservation

     • I assume you are either…
         • A Repository Manager concerned about adding
           data to your collections of ePrints (most likely), or
         • A research data manager or other researcher,
           concerned about finding an appropriate repository
           to curate your data (possibly), or
         • Neither of the above, in the wrong room, just come
           in to get out of the sun…

JISC Repositories 2007
a centre of expertise in data curation and preservation

  Digital Curation Centre Mission
        “The over-riding purpose of the DCC is to
        support and promote continuing improvement
        in the quality of data curation, and of
        associated digital preservation”

JISC Repositories 2007
a centre of expertise in data curation and preservation

JISC Repositories 2007
a centre of expertise in data curation and preservation

       “The Records of Science”
     • Data increasingly important as evidence
         • Key part of the scholarly record (public good)
             • Unrepeatable observations & experiments
             • Value for public money (eg OECD)
     • Experimental verifiability (the basis of science)
         • Would Chang retractions have been reduced if his first data
           were available?
                 CHANG, G., ROTH, C. B., REYES, C. L., PORNILLOS, O., CHEN, Y.-J. & CHEN, A. P. (2006)
                 Retraction of Pornillos et al., Science 310 (5756) 1950-1953. Retraction of Reyes and Chang,
                 Science 308 (5724) 1028-1031. Retraction of Chang and Roth, Science 293 (5536) 1793-1800.
                 Science Magazine, 314. http://www.sciencemag.org/cgi/content/full/314/5807/1875b

     • Allows additional interpretations
     • Legal and compliance (eg emerging RC mandates)

JISC Repositories 2007
a centre of expertise in data curation and preservation

                OECD declaration
     • “…Work towards the establishment of access regimes
       for digital research data from public funding in
       accordance with the following objectives and principles:
         •   Openness
         •   Transparency
         •   Legal conformity
         •   Formal responsibility
         •   Professionalism
         •   Protection of intellectual property
         •   Interoperability
         •   Quality and security
         •   Efficiency
         •   Accountability”
JISC Repositories 2007
a centre of expertise in data curation and preservation

Retaining research data means…
     • Data secure against loss (within group)
     • Communal repository (secure data store)
     • Re-usable, sharable information
     • As above, plus active curation (eg bio-
     • Long term preservation of information

     • Be clear what you are trying to do!

JISC Repositories 2007
a centre of expertise in data curation and preservation

     … or the data trajectory is…
     • Hard drive → lost (crash)
     • Hard drive →DVD →Cardboard box →Loft
       →Skip/dumpster → lost

     • Sometimes this is a very bad thing
     • Sometimes these are the right options!

JISC Repositories 2007                                          •© Marita Bushell
a centre of expertise in data curation and preservation

         Long term bit storage…
     • A solved problem? Just requires well-
       understood good data management
     • Wrong! For very large datasets over very long
       time, there are significant problems…

                   BAKER, M., SHAH, M., ROSENTHAL, D. S. H., ROUSSOPOLOUS, M., MANIATIS, P., GIULI, T.
                   J. & BUNGALE, P. (2006) A Fresh Look at the Reliability of Long-term Digital Storage. EuroSys
                   '06. Leuven, Belgium, ACM.

JISC Repositories 2007
a centre of expertise in data curation and preservation

   How Well Must We Preserve?
   Keep a petabyte for a century
   – With   50% chance of remaining completely undamaged

   Consider each bit decaying independently
   – Analogy   with radioactive decay

   That's a bit half- life of 10**18 years
   – One    hundred million times the age of the universe

   That's a very demanding requirement
   – Hard   to measure
    – Even very unlikely faults will matter a lot

JISC Repositories 2007      •Slide from David Rosenthal, LOCKSS
a centre of expertise in data curation and preservation

       What to do about curation
     • Build curation/reusability into science workflow
         • Curation begins before creation
         • What’s easy at first becomes (impossibly) hard later
         • Describe data (metadata schemas, “representation info”,
         • Keep experimental parameters (technical, who, what, when,
         • Keep ability to process
         • Keep data!

JISC Repositories 2007
a centre of expertise in data curation and preservation

    What to do about curation - 2
     • Use standard/agreed formats for data
     • Make ownership & restrictions clear, &
       explain how to cite data
     • Offer for deposit in institutional or discipline
         • Appraisal and selection essential
         • Possible time-limited embargos
     • “Publish” data in support of articles

JISC Repositories 2007
a centre of expertise in data curation and preservation

 Internet Archaeology: publication with

JISC Repositories 2007
a centre of expertise in data curation and preservation

            Database as book…
     • Buneman (early pilot)
       work on IUPHAR
     • MySQL to XML
         • Historic to logical
     • XML via XSLT to LaTeX

JISC Repositories 2007
a centre of expertise in data curation and preservation

                  The StORe vision
     • Seamless transport                                   Source
       from research data to
       research publications
       and vice versa                                        ware
     • Bi-directional links                                 Middle
       proven in social science
       e-research but capable
       of export to other

JISC Repositories 2007                              •Slide from Graham Pryor
a centre of expertise in data curation and preservation

  StORe survey: linkage value?
   The value of
                       University   University
   direct links                                    PG       Contract    Independent
                       academic     research                                           Other    Totals
   from source to                                student   researcher    researcher
                         staff      assistant
   output data

        advantage         85           18         33           11            2          26       175
            Useful        78            9         41            5            4           9       146
        Interesting       24            4          5            3            0           5        41
     Of no interest        9            0          0            0            0           1        10
          Not sure         7            0          7            0            1           2        17
             Other         1            1          0            0            0           1         3
            Totals       204           32         86           19            7          44       392
             •But: “researchers’ attitudes to enabling access depend to a large
             •extent on whether they are behaving as producers or users of data”

JISC Repositories 2007                                              •Slide from StORe project
a centre of expertise in data curation and preservation

        What to do about data (3)
     • Institutional repository managers
         • Make contact with emerging institutional data services
         • Start raising awareness of the need to curate rather than just
           dump data
         • Start thinking about the relationship of data to publications
           (especially e-theses)
         • Start thinking about the metadata needed to find and re-use
         • Make contact with key researchers
         • Start thinking about their data…

JISC Repositories 2007
a centre of expertise in data curation and preservation

             What kinds of data?
     • Observations
         • eg UARS (Upper Atmosphere) Level 0: telemetry
         • UARS Level 1: measured physical parameters (post
     • Derived data
         • UARS Level 2: calculated geophysical? profiles
         • UARS level 3: gridded, interpolated?
     • Combined data
     • Crafted data
         • Eg annotated gene/protein databases
     • Descriptive (meta)data

JISC Repositories 2007
a centre of expertise in data curation and preservation

StORe: Source data formats
                                                  CAD/GIS:                       39

                 Extensible mark -up language (XML):                             35

                Database files (e.g. Access, MySQL):                            117

                                     Flat files (e.g. FITS):                     66

                Hypertext mark -up language (HTML):                              60

                 Image files (e.g. .jpg, .tif, .bmp, .gif):                     228

                                           Plain text (.txt):                   179

                    Portable document format (.pdf):                            156

                                      Rich text files (.rtf):                    53

                        Spreadsheets (e.g. Excel/.xls):                         220

                                     Statistical software:                       75

                                      Tables/catalogues:                        102

               Word processed files (e.g. Word/.doc):                           220

                                 Other (please specify) :                        76

JISC Repositories 2007                                          •Slide from StORe project
a centre of expertise in data curation and preservation

StORe: the other data formats?
     They said the 76 other formats included:
       +latex+.cc source code, .cif (crystallographic data),
       .pdb, .mtz, .pool, .root, .raw, .swf, .fla, .raw, .mpg,
       binary files, chemdraw cdx, xwin nmr files, .ps files,
       .fla, .swf, masslynx files, derived data in PAw-format
       ntuples, raw mass spectrometry data, X-ray
       diffraction data, kaleidagraphs, Atlas/ti hermeneutic
       unit files, C++/shell scripts, Fourier induction decay
       files, etc., etc., etc., etc………..

JISC Repositories 2007                         •Slide from StORe project
a centre of expertise in data curation and preservation

StORe: the other data formats - more
  They also said such things as:
    “It is stored in a database, but nothing so simple as an
    Access file! It's one of the largest databases in the world!
    The format is Kanga/Root and previously was
    Objectivity. I think it's of the order of Picobytes in size.”
    “God preserve us from idiots who archive data in
    proprietary commercial formats (Excel spreadsheets and
    MS-word documents)!”

JISC Repositories 2007                         •Slide from StORe project
a centre of expertise in data curation and preservation

  What are the reusability issues?
     • Data not neutral; highly contextual!
     • Hard to know the risks & pitfalls of a particular
     • Data not self-describing: hard to find
       appropriate data (but see Murray-Rust on
       Googling InChI etc)
     • Hard to “understand” data once found
         • Really need information, not data!
     • Hard to use data once understood

JISC Repositories 2007
a centre of expertise in data curation and preservation

     • Data meaningless without context
         • Metadata of many kinds
         • Representation information… from data to
         • Linkage and connection between datasets
     • Provenance
         • Authenticity/integrity
         • Computational lineage

JISC Repositories 2007
a centre of expertise in data curation and preservation

              Access and re-use
     • Ethics and rights control access
         • Weak in expressing this long-term
     • Collaboration tools
         • Annotation, discussion, review (see DART…)
         • Re-use leading to change and development
     • “Publication”
         • Not just in “print”
         • Underlying data should be “published”, too

JISC Repositories 2007
a centre of expertise in data curation and preservation

           Data citation issues…
     • Citation for human readers and machine use cases
     • Granularity: database, record, item
     • Citation of changing objects
         • Version change (eg W3C practice: no version = latest, vs bibliographic:
           no version = first)
         • An efficient way to reference and access “archived” past states of
           more rapidly changing dataset, eg Genomics… datasets that result
           from the combined work of curators, or contain opinions or facts likely
           to change (work in progress, Buneman et al)
     • Standards conflict and immature (NLM best?)

     • Citation ESSENTIAL for motivating quality academic work on data
       management and curation

JISC Repositories 2007
a centre of expertise in data curation and preservation

             Repository challenges
     • Data are different: you’ll need access to some domain
     • Appraisal/selection harder
     • Broader range of formats
         • Appropriate “standards” for longevity? XML-based?
     • What metadata are needed?
         •   Descriptive, to find the dataset
         •   Context and background
         •   Provenance
         •   “Representation information” to connect data to information
             (whatever gives meaning to data for the “designated

JISC Repositories 2007
a centre of expertise in data curation and preservation

        Repository challenges - 2
     • May distort your repository
         •   Size
         •   Number of objects
         •   Rate of deposit
         •   Nature of use
     • Databases may be dynamic
     • Databases may need to be accessed in situ
     • Rights and ethical limitations hard to describe and
     • Need to build links to publications (cf StORe)
     • Need to build discipline links across repositories…

JISC Repositories 2007
a centre of expertise in data curation and preservation

        Repository challenges - 3
     • Is your platform suitable?
     • Most successful (ie older) data repositories
       are DIY
     • Data also held in repositories built on Dspace,
       ePrints and Fedora

JISC Repositories 2007
a centre of expertise in data curation and preservation

JISC Repositories 2007   •Data from MIT DSpace Political Science
a centre of expertise in data curation and preservation

JISC Repositories 2007
a centre of expertise in data curation and preservation

JISC Repositories 2007
a centre of expertise in data curation and preservation

         Who does data curation?
     •   Individuals
     •   Departments or groups
     •   Institutions, often through libraries
     •   Communities
     •   Disciplines
     •   Publishers
     •   National services
     •   Other 3rd parties…

JISC Repositories 2007
a centre of expertise in data curation and preservation

   Who are the curation players?

JISC Repositories 2007
a centre of expertise in data curation and preservation

       Disciplinary repositories…
     • >900 Nucleic Acids datasets!
     • ESDS/UKDA and NERC data centres, but…
     • “AHRC Council has decided to cease funding the Arts
       and Humanities Data Service (AHDS) from March
       2008. […] Grant holders must make materials they
       had planned to deposit with the AHDS available in an
       accessible depository for at least three years after the
       end of their grant”
             • AHRC Press Release 14/05/2007
             • (Note petition at http://petitions.pm.gov.uk/AHDSfunding/)
         • Does not apply to Archaeology: ADS still funded?

JISC Repositories 2007
a centre of expertise in data curation and preservation

        Institutional Repositories
     • OpenDOAR: only 5 Institutional Repositories claim to
       include datasets
         •   Bristol
         •   Cambridge
         •   Edinburgh
         •   Leicester
         •   Southampton
     • …and some of these seem doubtful on inspection!
         • … of course not all research data are “datasets”

JISC Repositories 2007
a centre of expertise in data curation and preservation

                 Cultural change
     • If we build it, will they come? NO!!
     • Outreach important: communication with
       scientists and researchers is hard graft
     • Cultural change to new approach requires more:
         • Incentives, rewards and mandates
         • Successful exemplars (well publicised)
         • Discipline-oriented approach (one size does not fit all)

JISC Repositories 2007
a centre of expertise in data curation and preservation

Need for advocacy?
       What functionality is missing from source repositories?
                    Academic     Research        Post-             Independent
                    staff        assistants      graduates         researchers

       None               9           2                 7
       Don’t use          7                            10                   1
       Lack of            3                             4                   2
       Don’t know         5           3                13                   1

       No reply          129          20               45                  13

JISC Repositories 2007                            •Slide from StORe project
a centre of expertise in data curation and preservation

Need for advocacy?
       What functionality is missing from output repositories?
                    Academic     Research        Post-             Independent
                    staff        assistants      graduates         researchers

       None               3           2                 5                   1
       Don’t use          1           1
       Lack of                                          2                   1
       Don’t know                     2                 6                   1

       No reply          123          15               48                  15

JISC Repositories 2007                            •Slide from StORe project
a centre of expertise in data curation and preservation

Need for advocacy?
        “The majority of academics do not know
        what repositories are nor are they
        familiar with the issues around new
        means of dissemination”
        – UKOLN/Eduserv Foundation: Digital
        Repositories Roadmap: looking forward, April

JISC Repositories 2007                    •Slide from StORe project
a centre of expertise in data curation and preservation

                               Thank you

JISC Repositories 2007

