Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
SlideShare a Scribd company logo

1

One Core Preservation
System for All your Data.
No Exceptions!
Marco Klindt, Kilian Amrhein
Zuse Institute Berlin (ZIB)
November 3, 2015

Frameworks for Digital Preservation

2

One Core Preservation System for
All your Data.
Some assembly required.

3

Why?
•  Berlin funds digitization projects
for cultural heritage institutions (LAM)
•  Servicecenter for Digitization (digiS) @ ZIB
supports these project with training and
technical solutions
•  Sustainability demands digital
preservation service for digitization
outputs

4

ZIB what?
•  Zuse Institute Berlin
Research Institute for
Applied Mathematics and
Computer Science
•  Namesake Konrad Zuse:
an inventor of the computer
(Z1, 1938, 22bit floating point
processing... [German view])

5

National Tier 2 Supercomputer
•  Konrad

6

Supercomputing Storage
•  Tape libraries
(2x StorageTek 8500 Enterprise grade)
•  Climate-controlled, fire-resistant vault
•  ~ 100 PB (Petabyte = 1015 Bytes)
•  400 TB (net) reserved for LAM

7

Facility Map
Data	Vault	 Supercomputer	
Cooling

8

Preservation is hard
•  Digital Preservation as well
•  Not feasible for smaller Institutions
•  Provide Preservation as a Service utilizing
ZIB infrastructure and expertise

9

Even as a service
•  Community effort (learn from each other)
•  Depends on multiple Communities:
– Preservation
– IT
– Cultural Heritage

10

Architectural Requirements
1.  Self-contained, self-documented
Information Packages (Intellectual
Entities)
2.  Anticipate obsolescence of formats,
software tools, hardware, and
organisation
3.  Loosely coupled Components with
defined Responsibilities
4.  Use community (OSS) tools and
standards

11

Chapel Hill, USA
Data
Workflow
Open (Source)
•  Do not reinvent the Wheel (it still remains
hard to maintain)
Canada
Ingest
Workflow
USA/Canada
Access/Management
Workflow
Code Glue inhouse

12

Architectural Design (Overview)
Preingest
Deposit
Subm
ission
M
anifest
Contract
#
Submission
ID
...
B
in
a
ry
P
a
ylo
a
d
DublinCore(DC),
LIDO,
MODS,
EAD
...
D
e
scrip
tive
M
e
ta
d
a
ta
SIP
AIP
DIP
Ingest
Archivematica
Data Management
iRODS
Archival Storage
Online &Tape
ManagementFedora/IslandoraAccess
0101100
1011<xml>
FPR
La
n
d
in
g
P
a
ge
s
R
e
p
o
sito
ry
AdminAccess
Compound
Object
ContentAccess
Compound
Object
Submission PDI
Mapped DC
Content
Submission PDI
Content Information
PDI (PREMIS)
Information
Content Description
Derivatives
Microservices:
identification
characterisation
normalization

13

Preingest
Preingest
Deposit
Subm
ission
M
anifest
Contract
#
Submission
ID
...
B
in
a
ry
P
a
ylo
a
d
DublinCore(DC),
LIDO,
MODS,
EAD
...
D
e
scrip
tive
M
e
ta
d
a
ta
SIP
AIP
DIP
Data Management
iRODS
Archival Storage
Online &Tape
0101100
1011<xml>
Submission PDI
Mapped DC
Content
Submission PDI
Content Information
PDI (PREMIS)
Information
Content Description
Derivatives
ervices:
dentification
characterisation
normalization

14

Data Agnostic Service
•  Problem:
Highly
heterogenous data
(formats and
metadata)
+
We accept
everything (digital)

15

Deposit/Transfer Components
5x
1x
1x
314,159x 271,828x
Administrative Information
(Submission Manifest)
Descriptive Information
(Metadata Formats DC, MODS, EAD, LIDO)
Content Information
(Binary or textual data)
Context Information
(Submission Documentation)
000000	
000001

16

Deposit/Transfer Components
Administrative Information
(Submission Manifest)
Descriptive Information
(Metadata)
Content Information
(Binary or textual data)
Context Information
(Submission Documentation)
5x
1x
1x
314,159x 271,828x
000000	
000001	
To find Stuff (the archive)
To find Stuff (the depositor)
The Stuff
Stuff (maybe useful for users)

17

Preingest: SIP Creation
Content
Informa-
tion
Unstructured
Data:
Text
Emails
...
Text
XML
Binary
Files
Content
Documentation
Content
Submission
•  Normalize Structure
•  Primary Binary/Textual
Content

•  Submission
Documentation

18

Descriptive Metadata
•  Original description
in domain specific
Metadata formats
•  Community standards
Content
Informa-
tion
DC
EAD
LIDO
MODS
Metadata
Unstructured
Data:
Text
Emails
...
Text
XML
Binary
Files
Content
Documentation
Content
Submission

19

Submission Metadata
•  Submission Manifest
•  YAML or METS
–  Rights Information
–  Contract and Contact
Information of
Depositor
–  ...
•  (nearly) complete SIP 
DublinCore
Administrative
Description
Information
Content
Informa-
tion
Subm
ission
M
anifest
DC
EAD
LIDO
MODS
Metadata
Unstructured
Data:
Text
Emails
...
Text
L
Binary
Files
Content
Documentation
Content
Submission
SIP

20

MD Mapping
•  Subset MD Mapping to
Dublin Core
–  Metadata Object
Description Standard
(MODS)
–  Encoded Archival
Description (EAD)
–  Light Information
Describing Objects
(LIDO)

•  SIP now ready for
Ingest. 
Mapped to DC
DublinCore
Administrative
Description
Information
Content
Informa-
tion
DublinCore
Description
Information (DI)
Subm
ission
M
anifest
DC
EAD
LIDO
MODS
Metadata
Unstructured
Data:
Text
Emails
...
Text
XML
Binary
Files
Content
Documentation
Content
Submission
SIP

21

SIP Rejection
•  We only require
Submission Manifest
(Administrative)
Metadata and DI
•  If incomplete -> Reject
DublinCore
Administrative
Description
Information
Content
Informa-
tion
DublinCore
Description
Information (DI)
SIP

22

P
Subm
ission
M
anifest
Contract
#
Submission
ID
...
SIP
AIP
DIP
Ingest
Archivematica
tFedora/IslandoraAccess
FPR
Submission PDI
Mapped DC
Conte
Submission PDI
Content Information
PDI (PREMIS)
ormation
Content Description
Derivatives
Microservices:
identification
characterisation
normalization
Ingest

23

•  Automated Ingest Workflow System
–  Identification of File Formats
–  Characterization (Technical MD Extraction)
–  Normalization (Migration on Ingest)
–  Creation of Access Derivatives
–  Fixity Hashes

•  Microservices
and
Best-Practice Tools

24

•  Every transformation based upon Rules in
Format Policy Registry (FPR)
•  One workflow (No Exceptions!)
•  One FPR Ruleset (No Exceptions!)

25

DublinCore
Administrative
Description
Information
Content
Informa-
tion
DublinCore
Description
Information (DI)
SIP
AIP
•  Technical MD
PREMIS

•  Identification:
–  Known Knowns
–  Known Unknowns
–  No Unknown
Unkowns

(after D. Rumpsfeld via Matthew Addis)
DublinCore
Administrative
Description
Information
Content
Informa-
tion
DublinCore
Description
Information (DI)
Preservation
Description
Information (PDI)
PREMIS
Normalized
binary and
text files
AIP

26

Preservation Levels
•  Preservation level is perceived
not assigned:
– Passive (Known Unkown)
– Active (Known Known)
•  Core Preservation Actions:
– Re-Identification
– Migration scheduling based on
FPR changes
Beholder
Rules
Technical
MD
+
+
Schedule

27

t
SIP
AIP
DIP
Data Management
iRODS
Archival Storage
Online &Tape
ManagementFedora/IslandoraAccess
P
R
e
p
o
sito
ry
AdminAccess
Compound
Object
ContentAccess
Compound
Object
Submission PDI
Mapped DC
Content
Submission PDI
Content Information
PDI (PREMIS)
Information
Content Description
Derivatives
rvices:
entification
characterisation
normalization
Data Management

28

•  Federated Object Store
– Object = File + Metadata (incl. Checksum)
– Rule-based Synchronization across abstracted
physical storage locations
– 1 AIP = 1 Object
•  Horizontal replication across
data-centers

29

Metadata
stored for
AIP objects

used for
Management
& Retrieval
29	
AVUs	defined	for	dataObj	bac186cd-4d11-48ac-bb1d-4ab2cd7593cc.tar:	
attribute:	uuid	
value:	bac186cd-4d11-48ac-bb1d-4ab2cd7593cc	
units:	
----		
attribute:	producerID	
value:	DE-MUS-019910	
units:		
----	
attribute:	submissionID	
value:	DE-MUS-019910-201505131006	
units:		
----	
attribute:	checksum	
value:	sha2:E4dMTd7/J4z9qg36CSjSzdXXIa4ltgAak+MKfSuPKww=	
units:		
----	
attribute:	lastFixityCheck	
value:	2015-05-13T08:06:16Z	
units:		
----	
attribute:	type	
value:	AIP	
units:		
DublinCore
Administrative
Description
Information
Content
Informa-
tion
DublinCore
Description
Information (DI)
Preservation
Description
Information (PDI)
PREMIS
Normalized
binary and
text files
AIP

30

Hierarchical Storage
Management (HSM)
•  Vertical replication:
– Staging Area (Disk Cache)
– Automatic Archiving onto
two Tapes in different
Tape Libraries
– Checksums per data block
•  Staging time offline -> online

 < 90s (nearline)

31

SIP
AIP
DIP
Ingest
Archivematica
Data Management
iRODS
Archival Storage
ManagementFedora/IslandoraAccess
La
n
d
in
g
P
a
ge
s
R
e
p
o
sito
ry
AdminAccess
Compound
Object
ContentAccess
Compound
Object
Submission PDI
Mapped DC
Content
Submission PDI
Content Information
PDI (PREMIS)
Information
Content Description
Derivatives
croservices:
entification
aracterisation
ormalization
Management/
Access

32

•  Access and Management System
•  Dark Archive 
– only for Admins and Depositors
•  One Object (AIP) – Two Views:
– Admin Access View (us)
– Content Access View (them)

33

Content Access
Compound Object
(CACO)

Mapped
Descriptive
Metadata
DublinCore
DublinCore
Administrative
Description
Information
Content
Informa-
tion
Description
Information (DI)
Preservation
Description
Information (PDI)
PREMIS
Normalized
binary and
text files
Text
XML
Binary
Files
SubmissionDocu.
AIP
CACO
Content
Access
Compound
Object

34

CACO
DublinCore
DublinCore
Administrative
Description
Information
Content
Informa-
tion
Description
Information (DI)
Preservation
Description
Information (PDI)
PREMIS
Normalized
binary and
text files
DerivativesAccesscopies
Text
XML
Binary
Files
SubmissionDocu.
AIP
CACO
Content
Access
Compound
Object

35

CACO

36

Amin Access
Compound Object
(AACO)

Administrative
Description for
Access and
Management
DublinCore
DublinCore
Administrative
Description
Information
Content
Informa-
tion
Description
Information (DI)
Preservation
Description
Information (PDI)
PREMIS
Normalized
binary and
text files
Text
XML
Binary
Files
SubmissionDocu.
AIP
AACO
Admin
Access
Compound
Object

37

AACO
DublinCore
DublinCore
Administrative
Description
Information
Content
Informa-
tion
Description
Information (DI)
Preservation
Description
Information (PDI)
PREMIS
Normalized
binary and
text files
Text
XML
Binary
Files
SubmissionDocu.
Reference
AIP
AACO
Admin
Access
Compound
Object

38

Admin
Access to:

iRods
Metadata

Data
Object
(AIP)
AACO

39

AIP
Data
Model Mapped to DC
DublinCore
DublinCore
Administrative
Description
Information
Content
Informa-
tion
Description
Information (DI)
Preservation
Description
Information (PDI)
PREMIS
Normalized
binary and
text files
DerivativesAccesscopies
Subm
ission
M
anifest
DC
EAD
LIDO
MODS
Unstructured
Data:
Text
Emails
...
Text
XML
Binary
Files
Documentation
Content
Content
Submission
Metadata
AACO
CACO
Reference
AIP
Admin
Access
Compound
Object
Content
Access
Compound
Object

40

Preservation Actions
Preingest
Deposit
Subm
ission
M
anifest
Contract
#
Submission
ID
...
B
in
a
ry
P
a
ylo
a
d
DublinCore(DC),
LIDO,
MODS,
EAD
...
D
e
scrip
tive
M
e
ta
d
a
ta
SIP
AIP
DIP
Ingest
Archivematica
Data Management
iRODS
Archival Storage
Online &Tape
ManagementFedora/IslandoraAccess
0101100
1011<xml>
FPR
La
n
d
in
g
P
a
ge
s
R
e
p
o
sito
ry
AdminAccess
Compound
Object
ContentAccess
Compound
Object
Submission PDI
Mapped DC
Content
Submission PDI
Content Information
PDI (PREMIS)
Information
Content Description
Derivatives
Migration
Microservices:
identification
characterisation
normalization

41

AIP Reingest
Preingest
Deposit
Subm
ission
M
anifest
Contract
#
Submission
ID
...B
in
a
ry
P
a
ylo
a
d
DublinCore(DC),
LIDO,
MODS,
EAD
...
D
e
scrip
tive
M
e
ta
d
a
ta
SIP
AIP
DIP
Ingest
Archivematica
Data Management
iRODS
Archival Storage
Online &Tape
0101100
1011<xml>
FPR
Submission PDI
Mapped DC
Content
Submission PDI
Content Information
PDI (PREMIS)
Information
Content Description
Derivatives
Migration
Microservices:
identification
characterisation
normalization
•  Sponsored
Feature (Q1/2016)
•  Allows for
– Metadata addition
– Re-Identification
– Re-Normalization

42

Contingency – Exit Strategy
•  Archivematica: only find new
ingest workflow
•  iRods: use filesystem
•  Islandora: reingest into other
repository
•  Organisation: Self-contained
AIP (transformation req‘d)
•  No Strategy against Evil, yet!

43

Preingest
D
eposit
Subm
ission
M
anifest
Contract
#
Submission
ID
...
B
in
a
ry
P
a
ylo
a
d
DublinCore(DC),
LIDO,
MODS,
EAD
...
D
e
scrip
tive
M
e
ta
d
a
ta
SIP
AIP
DIP
Ingest
Archivematica
Data Management
iRODS
Archival Storage
Online &Tape
ManagementFedora/IslandoraAccess
0101100
1011<xml>
FPR
La
n
d
in
g
P
a
ge
s
R
e
p
o
sito
ry
AdminAccess
Compound
Object
ContentAccess
Compound
Object
Submission PDI
Mapped DC
Content
Submission PDI
Content Information
PDI (PREMIS)
Information
Content Description
Derivatives
Migration
Single Pipeline
OAIS-aligned
Modular
D
ata
D
eposits:
Cultural heritage and rsearch data
w/ binary content, descriptive meta-
data and submission information.
P
olicies
> control metadata
mapping
> normalization through
Format Policy Registry (FPR)
Microservices:
identification
characterisation
normalization

More Related Content

One Core Preservation System for all your Data. No Exceptions! Marco Klindt and Kilian Amrhein