The document discusses file formats, rendering applications, and digital preservation. It summarizes research comparing the rendering of files on older software running on old hardware versus newer software. The research found that Microsoft Office 2007 was a better rendering tool for old files than LibreOffice or WordPerfect Office. Maintaining original rendering environments is important for digital preservation as different applications may render files inconsistently. This can be done through emulators, software archives, and workflows to ingest and provide access to files within their original environments.
1 of 27
More Related Content
Cochrane von Suchodoletz File Creation, Rendering and Formats
1. File Creation, Rendering
and Formats
Euan Cochrane, Archives New Zealand
&
Dirk von Suchodoletz, University of Freiburg
Future Perfect 2012
26 March 2012
Wellington, New Zealand
2. Contents
Euan
•Files, formats and their relationships to creating applications
•Files, formats and their relationships to rendering applications
Dirk
•Maintaining the ability to use older rendering applications
Euan
•Context and conclusions
3. Digital Preservation
• What is digital preservation?
Maintaining the full information content of digital objects
[across time]
Maintaining the ability to render digital objects [across time]
“The goal of digital preservation is the accurate rendering of
authenticated content over time”
• What is a file format?
“[pre-defined/particular] way that information is encoded for
storage in a computer file”
4. File Creation and Formats
• In 2007 Over 90% of HTML documents did not conform
to standards
• Microsoft Office 2007
(and possibly 2010) create
ODS files differently to most
open source office suites.
• Microsoft Office 2007 and 2010 create Microsoft Office
97-2003 formatted files differently to Microsoft Office
97-2003
5. Format Standards are Often
Ambiguous or not Available
• The JPEG standard specifies an end of image
marker but not an end of file marker –
Different apps write them differently
• LibreOffice 3.5 (14 February 2012) now
“supports” Visio file import. This support
is based on reverse engineering as the
format standard is not publically
available. It is not complete
6. “Rendering Matters” Research
• Compared the rendering of ~100 files on old software
running on old hardware (the “control”) to:
1. LibreOffice version 3.3.0
2. Microsoft Office 2007
3. Word Perfect Office X5
4. Control Software running on emulated hardware
16. Summary Research Results
• [The choice of] Rendering [Environment] Matters
• MS-Office 2007 was a better rendering tool for the old
files than either LibreOffice or WordPerfect Office
• The use of particular attributes/features in office files is
inconsistent but most are used at least once.
• At least one “odd”/rare attribute/feature is included in
most office files
17. Original Environments (OE)
Original creating application best candidate to render
documents properly
Proprietary format knowledge embedded in the
application
One environment renders all objects of a certain type
Keeping original software (and hardware) environments
has impact on preservation and access workflows
18. Components of Access through OE
Emulators for different computer architectures
Software archive of all required applica-tions, operating
systems, additional components like fonts, codecs
Workflows on object ingest
Access systems for end users
19. Emulators
Wide range available for all relevant
computer architectures
Many Open Source
Not yet DP aware – long term
availability to be secured
DP community should seek more
influence
20. Software Archive
Preserve the relevant software components and
operational knowledge
21. Necessary Workflows
Freiburg digital preservation group leads the state-
sponsored two years bwFLA project
BwFLA project providing access to complex, interactive
digital objects
Provide extended ingest workflows with feedback loop
22. Extended Ingest Workflow
Make use of donator's expertise to collect complete
information and components
Extend software archive if necessary
Add necessary technical metadata
Record knowledge on object handling
Let the donor check and sign-off the rendering results
23. Access Workflows
Provide a reading room system or extension
– Pre-configure emulator to the OE required by the
object
– Prepare the inclusion of the object into the original
environment
– Automate the startup of the OSE
– Provide the user information and hints on how to
interact with the OE & automate parts of this
– (Dis)allow to a certain degree to save results from
the original environments or capture certain states
(e.g. using screenshots)
24. Access System
Many components already exist, develo-ped by past DP
projects
Next step: Make them a usable “product”
25. Reading Room Access System
Make emulation accessible to standard users like in
memory institutions
Robust platform, extension to standard reading room
systems
Unified access to a wide range of different emulators +
preconfigured environments
26. Context and Conclusions
• Making decisions about preservation strategies
• When to Normalise?
• Variation in format implementation doesn’t matter if you
maintain a compatible rendering environment
• Variation in rendering across environments doesn’t matter if
you maintain the “right” rendering environment
• There are practical options for maintaining rendering
environments
Digital Preservation definition based on: http://public.ccsds.org/publications/archive/650x0b1.PDF File format definition from Wikipedia: http://en.wikipedia.org/wiki/File_format
On HTML docs: Ian Hickson http://en.wikipedia.org/wiki/Ian_Hickson http://lists.whatwg.org/pipermail/whatwg-whatwg.org/2007-February/009517.html “ There are literally dozens if not hundreds of billions of documents already on the Web. A study of a sample of several billion of those documents with a test implementation of the HTML5 Parser specification that I did at Google put a very conservative estimate of the fraction of those pages with markup errors at more than 78%. When I tweaked it a bit to look at a few more errors, the number was 93%. And those are only core syntax errors -- it didn't count misuse of HTML, like putting a <p> element inside an <ol> element. If we required browsers to refuse those documents, then you couldn't browse over 90% of the Web. But consider -- if one browser showed error messages on half the Web, and another browser showed no errors and instead showed the Web roughly as the author intended. Which browser would the average person use? If we want to make HTML5 successful, we have to make sure the browser vendors pay attention to it. Any requirements that make their market share go down relative to browsers who aren't following the spec will immediately be ignored.” On MS Office 2007 & ODS: http://www.robweir.com/blog/2009/05/update-on-odf-spreadsheet-interoperability.html And http://www.robweir.com/blog/2009/05/follow-up-on-excel-2007-sp2s-odf.html On Office 2007, 2010 writing office 97-2003 files differently: This result was identified during Internal research when developing the Office Dependency Discovery Tool: http://sourceforge.net/projects/officeddt/ The differences involve content being added to the files in XML form that seems to be ignored by the earlier versions of Office but used in some way by the more recent versions.
JPEG: http://www.jpeg.org/public/jfif.pdf And here: http://en.wikipedia.org/wiki/JPEG#The_JPEG_standard LibreOffice and Visio: http://www.libreoffice.org/download/3-5-new-features-and-fixes/ And here: http://en.wikipedia.org/wiki/LibreOffice#Version_3.5
Rendering Matters Report available here: http://bit.ly/zqEP8f