Linked Data for Libraries: Experiments between Cornell, Harvard and Stanford

Linked Data for Libraries:
Experiments between
Cornell, Harvard and
Stanford
Simeon Warner (Cornell University)
SWIB15, Hamburg, Germany
2015-11-24

LD4L project team
Cornell
• Dean Krafft
• Jon Corson-Rikert
• Lynette Rayle
• Rebecca Younnes
• Jim Blake
• Steven Folsom
• Muhammad Javed
• Brian Lowe*
• Simeon Warner
Harvard
• Randy Stern
• Paul Deschner
• Jonathan Kennedy
• David Weinberger*
• Paolo Ciccarese*
Stanford
• Tom Cramer
• Rob Sanderson
• Naomi Dushay
• Darren Weber
• Lynn McRae
• Philip Schreur
• Nancy Lorimer
• Joshua Greben
* no longer with institution

Linked Data for Libraries (LD4L)
• Nearing the end of a two-year $999k grant to
Cornell, Harvard, and Stanford
• Partners have worked together to assemble
ontologies and data sources that provide
relationships, metadata, and broad context for
Scholarly Information Resources
• Leverages existing work by both the VIVO project
and the Hydra Partnership
• Vision: Create a LOD standard to exchange all that
libraries know about their resources

LD4L goals
• Free information from existing library system silos to
provide context and enhance discovery of scholarly
information resources
• Leverage usage information about resources
• Link bibliographic data about resources with academic
profile systems and other external linked data sources
• Assemble (and where needed create) a flexible,
extensible LD ontology to capture all this information
about our library resources
• Demonstrate combining and reconciling the assembled
LD across our three institutions

LD4L working assumptions
• Trying to do conversion and relation work at scale,
with full sets of enterprise data
o Almost 30 million bibliographic records (Harvard:
13.6M, Stanford and Cornell: roughly 8M each)
• Trying to understand the pipeline / workflows that
will be needed for this
• Looking to build useful, value-added services on top
of the assembled triples

Bibliographic
Data
• MARC
• MODS
• EAD
Person Data
• CAP, FF,
VIVO
• ORCID
• ISNI
• VIAF, LC
Usage Data
• Circulation
• Citation
• Curation
• Exhibits
• Research
Guides
• Syllabi
• Tags
LD4L data sources

LD4L Workshop
https://twitter.com/us_imls/status/573235622237892609

LD4L Workshop
• February, 2015 at Stanford
• 50 attendees doing leading work in linked data
related to libraries, from around the world
• Review & vet the LD4L work done to date
o Use cases
o Ontology
o Technology
o Prototypes
• Plot development moving forward
Workshop details: https://wiki.duraspace.org/x/i4YOB

Topics
• Curation of Linked Data
• Techniques & Technology
o Entity resolution (strings to things)
o Reconciliation (things to things)
o Converters & validators
• New Uses, Use Cases & Services (Why?)
• Community (Who?)

Workshop Recommendations
• Our goal should be that others outside the library
community use the linked data that we produce
• We must create applications that let people do
things they couldn’t do before – don’t talk about
linked data, talk about what we will be able to do
• Local original assertions (new vs. copy cataloging)
should use local URIs even when global URIs exist
• Look to LD to bring together
physically/organizationally dispersed but related
collections
• Libraries must create a critical mass of shared linked
data to ensure efficiency and benefit all of us

Use Cases
https://wiki.duraspace.org/x/u4eNAw

LD4L Use Case Clusters
1. Bibliographic +
curation data
2. Bibliographic + person
data
3. Leveraging external
data including
authorities
4. Leveraging the deeper
graph (via queries or
patterns)
5. Leveraging usage data
6. Three-site services,
e.g. cross-site search
42 raw use cases
12 refined use cases
in 6 clusters…

UC1.1 - Build a virtual collection
Goal: allow librarians and patrons to create and
share virtual collections by tagging and optionally
annotating resources
• Implementations
o Cornell
o Stanford

15
New “Archery” collection
created, has no items
Select “Home” to search
Cornell catalog

16
Select item of
interest from search

17
From the “Add to virtual
collection” drop list,
select “Archery”

18
Book added to “Archery” collection
Behind the scenes: App used content-negotiation to get
MARCXML (no RDF yet...), converted to LD4L ontology
and added to Aggregation based on ORE ontology

19
Now search in the
Stanford catalog

20
No close integration so
have to copy URI from the
browser address bar

21
Click “+ Add External
Resource” under the virtual
collection title Archery in the
header of the main content
area of the page

22
Paste in URI, “Save changes”

23
Book from Stanford catalog
added to “Archery” collection
Behind the scenes: App gets
data from Stanford, converts
to LD4L and adds to ORE
Aggregation

24
Find item in interest in
Cornell VIVO

25
In VIVO there is a good
semweb URI which supports
RDF representations

26
Same process to “+ Add
External Resource”
Behind the scenes: App can
get RDF directly but still
needs to map to LD4L
ontology

UC1.2 - Tag scholarly information resources to
support reuse
Goal: provide librarians tools to create and manage
larger online collections of catalog resources
• Implementation
o More automation
o Batch processes as well as individual editing
o At Cornell plan to use this to replace current
mechanisms for selecting subset collections for
subject libraries. Key is separation of tags (as
annotations) from core catalog data

28
Free text tags supported for each item
Tags saves as Open Annotation with
motivation oa:tagging

UC 2.1 - See and search on works by people to
discover more works and better understand people
Goal: link catalog search results to researcher
networking systems to provide current articles, courses
• Implementation
o Adding VIVO URIs to MARC records for thesis
advisors
o Adding links to VIVO records linking back to
faculty works and their students’ theses
o Raises important issues about URI stability

Thesis Advisors and VIVO
Cornell Technical Services is including thesis advisors
in MARC records using NetIDs from the Graduate
school database
e.g., 700 1 ‡a Ceci, Stephen John ‡e thesis advisor ‡0
Advisors are looked up against VIVO to get URIs for the
faculty members

Relation added to VIVO,
link goes back to catalog

UC4.1 - Identifying related works
Goal: find additional resources beyond those directly
related to any single work using queries or patterns, as
for example changes in illustrations over a series of
editions of a work
• Implementation
o Explored by modeling non-MARC metadata from Cornell Hip
Hop Flyer collection using LinkedBrainz
o Availability of data will influence richness of discoverable
context

Hip Hop flyers
494 flyers, each flyer
describes an event/s
Events can have a
known venue.
Multiple flyers refer
to same venue.
Each event can have
anywhere from 1-20
(plus) performers

Pilot: Linking Hip Hop flyer metadata to
MusicBrainz/LinkedBrainz data
• Model non-MARC metadata from Cornell Hip Hop
Flyer Collection in RDF
o Test LD4L BIBFRAME for describing flyers
originally catalogued using ARTstor’s Shared Shelf
o Use Getty Art & Architecture Thesaurus to create
bf:Work sub-classes
o Test the use of other ontologies for describing
other entities including Event ontology and
Schema.org
• Use of URIs for performers to recursively discover
relationships to other entities via dates, events,
venues, graphic designers, work types and
categories

MusicBrainz
LinkBrainz is RDF
from MusicBrainz
Connects out to
Dbpredia and
broader LOD graph

Reconciling mo:Release with bf:Audio

Takeaways
• Able to map large parts of our metadata to
RDF using multiple ontologies to discover
more relationships to more entities (still
some mapping and reconciliation work to do)
• Largely predicated on manual workflows for
preprocessing, URI lookups, and unstable
software for RDF creation
• Need more URIs for both linking to and
linking from in order to take advantage of
queries and patterns

Assembling* the
LD4L Ontology
* Note “Assembling” not “Creating”

BIBFRAME1 basic
entities and
relationships
http://bibframe.org/vocab-model/
• Creative work
• Instance
• Authority
• Annotation

A number of issues with BIBFRAME1
Some linked data best practices highlighted in the
Sanderson report:
• Clarify and limit scope
• Use URIs in place of strings (identification of the resource itself
vs. resource description)
• Reuse existing vocabularies and relate new terms to existing
ones
• Only define what matters (and inverse relationships do)
• Remove authorities as entities in favor of real world URIs
• Reuse the Open Annotation ontology vs. reinventing the wheel
 Use BIBFRAME where possible,
mix in other ontologies

Linked Data for Libraries: Experiments between Cornell, Harvard and Stanford

Use foaf:Person and
foaf:Organization (subclasses of
foaf:Agent) instead of
BIBFRAME1 classes because we
want identities not authorities,
and to reuse common
vocabularies

Using schema:Event and
prov:Location to explore
particular use case of
model for Afrika
Bambaataa collection

Photo: James Cridland https://www.flickr.com/photos/jamescridland/613445810

Cross institutional StackScore
• Builds on StackScore work at Harvard
• Have computed anonymous scores at Cornell
• Represent scores as annotations on Works/Instances
Open issues:
• Best ways to
calculate?
• Cross institutional
normalization?
• How to integrate
with UX?

Normalizing StackScores
Data: https://github.com/ld4l/ld4l-cul-usage
Shared normalization has about
0.001% (1 in 100,000) items for
each of the top scores (ie.
around 100 from each
institution)
Vast majority of items have
lowest StackScore.
Is this useful?

Plumbing
Photo: Tony Hisgett https://www.flickr.com/photos/hisgett/3365087837

LD4L data transformationMARC
XML
Pre-
processor
MARC
XML
LC MARC to
BIBFRAME
BF RDF
(disjoint)
Post-
processor
LD4L
LOD
MARC21
OCLC
works

XML
Pre-
processor
MARC
XML
LC MARC to
BIBFRAME
BF RDF
(disjoint)
Post-
processor
LD4L
LOD
MARC21
OCLC
works
Clean data, normalize
local practices

MARC
XML
Pre-
processor
MARC
XML
LC MARC to
BIBFRAME
BF RDF
(disjoint)
Post-
processor
LD4L
LOD
MARC21
OCLC
works
LD4L data transformation
Unmodified LC converter:
https://github.com/lcnetdev/marc2
bibframe

MARC
XML
Pre-
processor
MARC
XML
LC MARC to
BIBFRAME
BF RDF
(disjoint)
Post-
processor
LD4L
LOD
MARC21
OCLC
works
LD4L data transformation
Match up
• Worldcat ids for instances
• Authorized names for persons
• FAST subject headings
• Strings to ids by rules
BF -> LD4L ontology
OCLC data to combine works

XML
Pre-
processor
MARC
XML
LC MARC to
BIBFRAME
BF RDF
(disjoint)
Post-
processor
LD4L
LOD
Proﬁles (VIVO/
CAP/FF)
Dbpedia
VIAF
ORCID
…
MARC21
OCLC
works

Future processing challenges
• Join with VIVO/CAP/Profiles data as a coherent,
richer local authority picture
• Extend to full variety of different types of catalog
records
• Address issues of entity resolution and linking in the
real world for works, people, organizations, events,
places, and subjects
• Integrate with other linked data sources via common
global identifiers and shared ontologies

Triplestores – Very small load (1)

Triplestores – Very small load (2)
BANG!

Triplestores – Slightly larger load (3)

Triplestores – Billion triple loads
1 billion triples loaded in
~1day, small machine
Will try 3 billion (all three
catalogs) on large AWS
instance

Triplestores - AllegroGraph @ Stanford
• Using AllegroGraph developer license with 500M triple limit
o would have loaded full 650M triples in absence of limit
• Running on 64GB machine, 4 cores, 2 threads
o java settings: -Xms24G –Xmx 52G
• Divided data from 7.5M bibliographic records into 75 RDF/XML
files, 1GB each
• Average 1h to process and load each -> 2k triples/sec but
expect that is dominated by RDF/XML parse
• Total 3 day load time – painful but repeatable
• Lots of web management tools, visualization, full-text indexing,
user and permission handling
• SPARQL relatively fast and even ones with very large result sets
complete given time
[Thanks to Joshua Greben @ Stanford for summary]

From triplestore to index
• Goal of triplestore load is to be able to merge data
and analyze
• Do NOT expect end user performance => build Solr
index
• Initial tests suggested 2-weeks to build Solr index for
just Cornell data
• Will use large AWS instance to build Blacklight index
for Cornell + Harvard + Stanford data
• Various optimization possibilities but try simple
approach first

Bibliographic Data
• MARC
• MODS
• EAD
Person
Data
• CAP,
FF,
VIVO
• ORCID
• VIAF,
LC
Usage
Data
LD4L data sources revisited
Looking to relate three classes of data from across
three different institutions. Different progress on
different fronts, most with bibliographic data

Project Outcomes
• Open source extensible LD4L ontology compatible
with VIVO ontology, BIBFRAME, and other existing
library LOD efforts
• Open source LD4L semantic editing and display
• Implementation of virtual collections compatible
with Project Hydra using ActiveTriples
• Demonstration Blacklight search across multiple
LD4L instances
• Draft LOD dumps of augmented catalog data from
Harvard, Stanford and Cornell

More Info: http://ld4l.org
Code: https://github.com/ld4l
Data (soon): http://draft.ld4l.org
Project team outside the now-demolished
Myer Library, Stanford, Summer 2014

Linked Data for Libraries: Experiments between Cornell, Harvard and Stanford

More Related Content

Linked Data for Libraries: Experiments between Cornell, Harvard and Stanford

Editor's Notes