Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
SlideShare a Scribd company logo
File Creation, Rendering
         and Formats

   Euan Cochrane, Archives New Zealand
                    &
Dirk von Suchodoletz, University of Freiburg

             Future Perfect 2012
                 26 March 2012
            Wellington, New Zealand
Contents
Euan
•Files, formats and their relationships to creating applications

•Files, formats and their relationships to rendering applications

Dirk
•Maintaining the ability to use older rendering applications

Euan
•Context and conclusions
Digital Preservation
• What is digital preservation?
Maintaining the full information content of digital objects
  [across time]
Maintaining the ability to render digital objects [across time]
“The goal of digital preservation is the accurate rendering of
  authenticated content over time”

• What is a file format?
“[pre-defined/particular] way that information is encoded for
   storage in a computer file”
File Creation and Formats
• In 2007 Over 90% of HTML documents did not conform
  to standards

• Microsoft Office 2007
(and possibly 2010) create
 ODS files differently to most
open source office suites.

• Microsoft Office 2007 and 2010 create Microsoft Office
  97-2003 formatted files differently to Microsoft Office
  97-2003
Format Standards are Often
Ambiguous or not Available
• The JPEG standard specifies an end of image
  marker but not an end of file marker –
  Different apps write them differently


• LibreOffice 3.5 (14 February 2012) now
  “supports” Visio file import. This support
  is based on reverse engineering as the
  format standard is not publically
  available. It is not complete
“Rendering Matters” Research
• Compared the rendering of ~100 files on old software
  running on old hardware (the “control”) to:

  1.   LibreOffice version 3.3.0
  2.   Microsoft Office 2007
  3.   Word Perfect Office X5
  4.   Control Software running on emulated hardware
Cochrane von Suchodoletz File Creation, Rendering and Formats
Cochrane von Suchodoletz File Creation, Rendering and Formats
Cochrane von Suchodoletz File Creation, Rendering and Formats
Cochrane von Suchodoletz File Creation, Rendering and Formats
Cochrane von Suchodoletz File Creation, Rendering and Formats
Cochrane von Suchodoletz File Creation, Rendering and Formats
Cochrane von Suchodoletz File Creation, Rendering and Formats
Cochrane von Suchodoletz File Creation, Rendering and Formats
Cochrane von Suchodoletz File Creation, Rendering and Formats
Summary Research Results
• [The choice of] Rendering [Environment] Matters

• MS-Office 2007 was a better rendering tool for the old
  files than either LibreOffice or WordPerfect Office

• The use of particular attributes/features in office files is
  inconsistent but most are used at least once.

• At least one “odd”/rare attribute/feature is included in
  most office files
Original Environments (OE)

    Original creating application best candidate to render
    documents properly


    Proprietary format knowledge embedded in the
    application


    One environment renders all objects of a certain type


    Keeping original software (and hardware) environments
    has impact on preservation and access workflows
Components of Access through OE

    Emulators for different computer architectures


    Software archive of all required applica-tions, operating
    systems, additional components like fonts, codecs


    Workflows on object ingest


    Access systems for end users
Emulators

    Wide range available for all relevant
    computer architectures


    Many Open Source


    Not yet DP aware – long term
    availability to be secured


    DP community should seek more
    influence
Software Archive

    Preserve the relevant software components and
    operational knowledge
Necessary Workflows


    Freiburg digital preservation group leads the state-
    sponsored two years bwFLA project


    BwFLA project providing access to complex, interactive
    digital objects


    Provide extended ingest workflows with feedback loop
Extended Ingest Workflow

    Make use of donator's expertise to collect complete
    information and components
         
             Extend software archive if necessary
         
             Add necessary technical metadata
         
             Record knowledge on object handling


    Let the donor check and sign-off the rendering results
Access Workflows

    Provide a reading room system or extension
        – Pre-configure emulator to the OE required by the
            object
        – Prepare the inclusion of the object into the original
            environment
        – Automate the startup of the OSE
        – Provide the user information and hints on how to
            interact with the OE & automate parts of this
        – (Dis)allow to a certain degree to save results from
            the original environments or capture certain states
            (e.g. using screenshots)
Access System

    Many components already exist, develo-ped by past DP
    projects

    Next step: Make them a usable “product”
Reading Room Access System

    Make emulation accessible to standard users like in
    memory institutions


    Robust platform, extension to standard reading room
    systems


    Unified access to a wide range of different emulators +
    preconfigured environments
Context and Conclusions
• Making decisions about preservation strategies

• When to Normalise?

• Variation in format implementation doesn’t matter if you
  maintain a compatible rendering environment

• Variation in rendering across environments doesn’t matter if
  you maintain the “right” rendering environment

• There are practical options for maintaining rendering
  environments
Thank you

More Related Content

Cochrane von Suchodoletz File Creation, Rendering and Formats

  • 1. File Creation, Rendering and Formats Euan Cochrane, Archives New Zealand & Dirk von Suchodoletz, University of Freiburg Future Perfect 2012 26 March 2012 Wellington, New Zealand
  • 2. Contents Euan •Files, formats and their relationships to creating applications •Files, formats and their relationships to rendering applications Dirk •Maintaining the ability to use older rendering applications Euan •Context and conclusions
  • 3. Digital Preservation • What is digital preservation? Maintaining the full information content of digital objects [across time] Maintaining the ability to render digital objects [across time] “The goal of digital preservation is the accurate rendering of authenticated content over time” • What is a file format? “[pre-defined/particular] way that information is encoded for storage in a computer file”
  • 4. File Creation and Formats • In 2007 Over 90% of HTML documents did not conform to standards • Microsoft Office 2007 (and possibly 2010) create ODS files differently to most open source office suites. • Microsoft Office 2007 and 2010 create Microsoft Office 97-2003 formatted files differently to Microsoft Office 97-2003
  • 5. Format Standards are Often Ambiguous or not Available • The JPEG standard specifies an end of image marker but not an end of file marker – Different apps write them differently • LibreOffice 3.5 (14 February 2012) now “supports” Visio file import. This support is based on reverse engineering as the format standard is not publically available. It is not complete
  • 6. “Rendering Matters” Research • Compared the rendering of ~100 files on old software running on old hardware (the “control”) to: 1. LibreOffice version 3.3.0 2. Microsoft Office 2007 3. Word Perfect Office X5 4. Control Software running on emulated hardware
  • 16. Summary Research Results • [The choice of] Rendering [Environment] Matters • MS-Office 2007 was a better rendering tool for the old files than either LibreOffice or WordPerfect Office • The use of particular attributes/features in office files is inconsistent but most are used at least once. • At least one “odd”/rare attribute/feature is included in most office files
  • 17. Original Environments (OE)  Original creating application best candidate to render documents properly  Proprietary format knowledge embedded in the application  One environment renders all objects of a certain type  Keeping original software (and hardware) environments has impact on preservation and access workflows
  • 18. Components of Access through OE  Emulators for different computer architectures  Software archive of all required applica-tions, operating systems, additional components like fonts, codecs  Workflows on object ingest  Access systems for end users
  • 19. Emulators  Wide range available for all relevant computer architectures  Many Open Source  Not yet DP aware – long term availability to be secured  DP community should seek more influence
  • 20. Software Archive  Preserve the relevant software components and operational knowledge
  • 21. Necessary Workflows  Freiburg digital preservation group leads the state- sponsored two years bwFLA project  BwFLA project providing access to complex, interactive digital objects  Provide extended ingest workflows with feedback loop
  • 22. Extended Ingest Workflow  Make use of donator's expertise to collect complete information and components  Extend software archive if necessary  Add necessary technical metadata  Record knowledge on object handling  Let the donor check and sign-off the rendering results
  • 23. Access Workflows  Provide a reading room system or extension – Pre-configure emulator to the OE required by the object – Prepare the inclusion of the object into the original environment – Automate the startup of the OSE – Provide the user information and hints on how to interact with the OE & automate parts of this – (Dis)allow to a certain degree to save results from the original environments or capture certain states (e.g. using screenshots)
  • 24. Access System  Many components already exist, develo-ped by past DP projects  Next step: Make them a usable “product”
  • 25. Reading Room Access System  Make emulation accessible to standard users like in memory institutions  Robust platform, extension to standard reading room systems  Unified access to a wide range of different emulators + preconfigured environments
  • 26. Context and Conclusions • Making decisions about preservation strategies • When to Normalise? • Variation in format implementation doesn’t matter if you maintain a compatible rendering environment • Variation in rendering across environments doesn’t matter if you maintain the “right” rendering environment • There are practical options for maintaining rendering environments

Editor's Notes

  1. Digital Preservation definition based on: http://public.ccsds.org/publications/archive/650x0b1.PDF File format definition from Wikipedia: http://en.wikipedia.org/wiki/File_format
  2. On HTML docs: Ian Hickson http://en.wikipedia.org/wiki/Ian_Hickson http://lists.whatwg.org/pipermail/whatwg-whatwg.org/2007-February/009517.html “ There are literally dozens if not hundreds of billions of documents already on the Web. A study of a sample of several billion of those documents with a test implementation of the HTML5 Parser specification that I did at Google put a very conservative estimate of the fraction of those pages with markup errors at more than 78%. When I tweaked it a bit to look at a few more errors, the number was 93%. And those are only core syntax errors -- it didn't count misuse of HTML, like putting a <p> element inside an <ol> element. If we required browsers to refuse those documents, then you couldn't browse over 90% of the Web. But consider -- if one browser showed error messages on half the Web, and another browser showed no errors and instead showed the Web roughly as the author intended. Which browser would the average person use? If we want to make HTML5 successful, we have to make sure the browser vendors pay attention to it. Any requirements that make their market share go down relative to browsers who aren't following the spec will immediately be ignored.” On MS Office 2007 & ODS: http://www.robweir.com/blog/2009/05/update-on-odf-spreadsheet-interoperability.html And http://www.robweir.com/blog/2009/05/follow-up-on-excel-2007-sp2s-odf.html On Office 2007, 2010 writing office 97-2003 files differently: This result was identified during Internal research when developing the Office Dependency Discovery Tool: http://sourceforge.net/projects/officeddt/ The differences involve content being added to the files in XML form that seems to be ignored by the earlier versions of Office but used in some way by the more recent versions.
  3. JPEG: http://www.jpeg.org/public/jfif.pdf And here: http://en.wikipedia.org/wiki/JPEG#The_JPEG_standard LibreOffice and Visio: http://www.libreoffice.org/download/3-5-new-features-and-fixes/ And here: http://en.wikipedia.org/wiki/LibreOffice#Version_3.5
  4. Rendering Matters Report available here: http://bit.ly/zqEP8f