This presentation will discuss how the structured data, together with the semantically indexed/mined entities in semi-structured and unstructured data, are contributing to researches beyond libraries, especially in digital humanities. It aims to explore the opportunities and strategies to use, reuse, share, and effectively elaborate the smart data -- generated or to be generated -- in libraries.
Building an international infrastructure for research data - Jisc Digital Fes...
Report
Share
1 of 51
More Related Content
Zeng marcia ifla-subjectaccesssmartdatadh
1. Subject Access, Smart Data, and
Digital Humanities
– Finding Unlimited Opportunities through
their Intersections
Marcia Lei Zeng
Kent State University
Keynote @ IFLA Classification & Indexing Satellite Conference
2016 August 11-12, Columbus, OH, USA
http://marciazeng.slis.kent.edu/
2. Outline
• I. Background
• II. Subject Access -- Finding the Unlimited
Opportunities
• III. The Importance of Knowledge
Organization Systems (KOS) for Effective
Subject Access
M Zeng - IFLA Classification & Indexing
Satellite Conference 2016
28/11/16
3. I. Background
What do I mean…
1) Subject access
– in the context of today’s environments
2) Smart data
– in the context of Big Data
3) Digital Humanities
– in the context of heritage institutions’ data
M Zeng - IFLA Classification & Indexing
Satellite Conference 2016
38/11/16
4. What is happening around us?
• The 2nd generation of the Web:
the Semantic Web
– Search engines involvement,
– mature of the Linked Data technologies,
– non-traditional databases
• “Big Data”
– Government funding opportunities,
– Blooming of ‘data analytics’ profession
• Modern AI (artificial intelligence)
– Machine-learning
– Contextual computing
• Participatory culture
– Social media
– Engaging end-users in the workflow
M Zeng - IFLA Classification & Indexing Satellite Conference 2016 4
I. Background
https://www.w3.org/2002/Talks/www2002-w3ct-swintro-em/slide7-0.html
8/11/16
5. Source: Nova Spivak, Radar Networks; John Breslin, DERI; & Mills Davis, Project10X,
2007, 2008 Copyright MILLS•DAVIS
5
Web 1.0: connecting information and getting on the net.
Web 2.0: connecting people — putting the “I” in user interface, and the “we” into Webs of social participation.
Web 3.0 Connecting knowledge -- representing meanings, connecting knowledge, and putting these to work in ways that make our
experience of internet more relevant, useful, and enjoyable.
Web 4.0 Connecting intelligence -- It is about connecting intelligences in a ubiquitous Web where both people and things reason and
communicate together.
8/11/16
M Zeng - IFLA Classification & Indexing
Satellite Conference 2016
6. M Zeng - IFLA Classification & Indexing
Satellite Conference 2016
6
Big data
• Volume (data quantity)
• Velocity (data speed)
• Variety (data types &
nature)
• Variability (data
consistency)
• Veracity (data quality)
• Complexity
Source: Kobielus, James. 2016. The Evolution of Big Data to Smart Data. Keynote at Smart Data Online 2016.
Source: Big Data. Wikipedia.
SAS Institute Inc. [2014]. Big Data:
What it is and why it matters.
Smart Data
= Ability to achieve big
insights from such data
at any scale, great or small.
I. Background
8/11/16
7. Why Smart Data
• “However, in its raw form, data is just like crude oil; it needs to be refined and
processed in order to generate real value. Data has to be cleaned, transformed, and
analyzed to unlock its hidden potential.” (TiECON East. Data is new oil.)
• Once tamed through organizing and integrating processes, large volumes of
unstructured, semi-structured, and structured data are turned into “smart data” that
reflect the research priorities of a particular discipline or field.
• Smart data inquiries can then be used to provide comprehensive analyses and
generate new products and services.
M Zeng - IFLA Classification & Indexing
Satellite Conference 2016
7
Sources:
Gardner, D, 2012.
Prithwis Mukerjee, 2014
Schöch, 2013.
TiECON East, 2014.
8/11/16
8. What can we do to avoid asthma episode?
8
Real-time health signals from personal level (e.g., Wheezometer, NO in breath,
accelerometer, microphone), public health (e.g., CDC, Hospital EMR), and population level
(e.g., pollen level, CO2) arriving continuously in fine grained samples potentially with missing
information and uneven sampling frequencies.
Variety Volume
VeracityVelocity
Value
What risk factors influence asthma control?
What is the contribution of each risk factor?
semantics
WHY Big Data to Smart Data: Asthma example
Slide from: Sheth, Amit. 2014. Transforming Big Data into Smart Data: Deriving Value via harnessing Volume, Variety and Velocity
using semantics and Semantic Web.
Understanding relationships between
health signals and asthma attacks
for providing actionable information
8/11/16
M Zeng - IFLA Classification & Indexing
Satellite Conference 2016
9. M Zeng - IFLA Classification & Indexing
Satellite Conference 2016
9
Data
(in humanities)
Big data
unstructured
messy
implicit
relatively large in
volume
varied in form
Smart data
semi- structured
or structured
Clean
Explicit and
enriched
Raw data
+ markup, annotations
and metadata
Relatively small in
volume
The creation involves
human agency & demands
time
The process of modeling
the data is essentialOf limited
heterogeneity
Complied based on Schöch, Christof. 2013. Big? Smart?
Clean? Messy? Data in the humanities. Journal for Digital
Humanities. 2(3)
What about LAMs?
8/11/16
10. Structured
Semi-structured
Unstructured
10
• National bibliographies
• Catalogs
• Special collection portals
• Registries
• Metadata for datasets
• …
• Text Encoding Initiative (TEI) files
• Finding Aids
• Value added/tagged resources
• Unstructured portion within metadata
descriptions
• Digitized materials, textual or
non-textual
• Original information-bearing
objects
• Documents in all kinds of formats
• …
• Data from Web crawling that
need to be cleaned
• … …
LAM data examples
“Smart Data” emphasizes the
organizing and integrating processes
from unstructured data to structured
and semi-structured data, to make the
big data smarter.
- Schöch, 2013
8/11/16
M Zeng - IFLA Classification & Indexing
Satellite Conference 2016
11. • the field is still expanding,
• the definitions are being debated, and
• the multifaceted landscape is yet to be fully
understood.
• Most agree that initiatives and activities in digital
humanities are at the intersection between the
humanities and digital information technology.
• The field applies big data mathematical research
techniques to the description and analysis of cultural
objects—including art, literature, and technological
artifacts themselves.
M Zeng - IFLA Classification & Indexing
Satellite Conference 2016
11
Image source: Katherine Hayles
http://dtc-wsuv.org/wp/dtc375-
scodi/katherine-hayles/
• Svensson, P. 2010. The Landscape of Digital Humanities. Digital
Humanities Quarterly. 4(1).
• Svensson, P. 2009. Humanities Computing as Digital Humanities.
Digital Humanities Quarterly. 3(3)
I. Background
8/11/16
12. Advanced technologies now allow researchers :
(under the umbrella of Big Data and the Semantic Web)
• to access and reuse large volumes of diverse data,
• to discover patterns and connections formerly hidden
from view,
• to reconstruct the past,
• to discover impacts in real and virtual environments, and
• to bring the complex intricacies of innovations to light,
all as never before.
M Zeng - IFLA Classification & Indexing
Satellite Conference 2016
12
Image source: http://goo.gl/a4gZsd
Image source: Schöch, 2013.
8/11/16
13. M Zeng - IFLA Classification & Indexing
Satellite Conference 2016
13
Think:
• What kind of data did the project use?
Data sources:
• Freebase (now Wikidata)
• Union List of Artist Names (ULAN®)
• Allgemeines Künstlerlexikon/ Artists of the World
Schich, M. et al. 2014. “A Network Framework of Cultural History.”
Science, 345(6196), 558-562.
Nature Video. (2014, July 31).
Charting culture.
https://www.youtube.com/w
atch?v=4gIhRkCcD4U
8/11/16
14. Advanced technologies now allow researchers :
(under the umbrella of Big Data and the Semantic Web)
• to access and reuse large volumes of diverse data,
• to discover patterns and connections formerly hidden
from view,
• to reconstruct the past,
• to discover impacts in real and virtual environments, and
• to bring the complex intricacies of innovations to light,
all as never before.
Data provided by LAMs and cultural heritage institutions are
treasures for all humanities researchers.
Trending:
• Machine readable understandable data
• Machine readable actionable data
• Accurate (no error) data in the processes of interlinking,
citing, transferring, rights-permission, use and reuse, etc.
• One –to -many uses and high efficiency processing data
M Zeng - IFLA Classification & Indexing
Satellite Conference 2016
14
http://goo.gl/a4gZsd
8/11/16
15. 15
Digital humanities – Librarian Survey Results, December 2015
http://americanlibrariesmagazine.org/2016/01/04/special-report-digital-humanities-libraries/
Source: http://americanlibrariesmagazine.org/wp-content/uploads/2016/01/digital-humanities-faculty.pdf
8/11/16
M Zeng - IFLA Classification & Indexing
Satellite Conference 2016
16. 16
Digital humanities – Faculty Survey Results, December 2015
Source: http://americanlibrariesmagazine.org/wp-content/uploads/2016/01/digital-humanities-faculty.pdf8/11/16
M Zeng - IFLA Classification & Indexing
Satellite Conference 2016
17. Structured
Semi-structured
Unstructured
17
• National bibliographies
• Catalogs
• Special collection portals
• Registries
• Metadata for datasets
• …
• Text Encoding Initiative (TEI files)
• Finding Aids
• Value added/tagged resources
• Unstructured portion within metadata
descriptions
• Digitized materials, textual or
non-textual
• Original information-bearing
objects
• Documents in all kinds of formats
• …
• Data from Web crawling that
need to be cleaned
• … …
LAM data examples
II. Subject Access
-- Finding the Unlimited Opportunities
8/11/16
M Zeng - IFLA Classification & Indexing
Satellite Conference 2016
18. Figure. Overview of Relationships (draft)
Source: http://www.ifla.org/files/assets/cataloguing/frbr-lrm/frbr-lrm_20160225.pdf + revision draft
FRBR-Library Reference Model (LRM)
- World-wide review version
RES:“Any entity in the universe of discourse”
8/11/16
M Zeng - IFLA Classification & Indexing
Satellite Conference 2016
18
19. liked, cited, researched, tagged, searched,
shared, followed, time spent, …
Three Perspectives
-- from the creation of the structured data
based on Rose, Gillan. 2013.
Visual Methodologies, 3rd. Ed.
• Index
• Markup
• Ontology
• Knowledge
base
• Metadata
• Descriptive
• Administrative
• Structural/techni
cal
M Zeng - IFLA Classification & Indexing
Satellite Conference 2016
19
1 2
3
Production Content
Audiences’ receiving interests
Image sources:
http://www.mahalo.com/how-to-
understand-perspective-in-drawing/
http://judithlondono.com/
http://www.smrfoundation.org/nodexl/
8/11/16
20. 20
Read more: Godby, Wang, Mixter. 2015. Library Linked Data in the Cloud –
OCLC’s Experiments with New Models of Resource Description. ISBN
9781627052191.
Figure 1.1: A bibliographic description as a record and a graph.
WorldCat Linked Data
https://www.oclc.org/developer/develop/linked-
data.en.html
OCLC WorldCat Works
– 197 Million Nuggets of Linked Data
-- Since 2014-
“The bibliographic metadata found
in WorldCat contains a rich set of
objects that can be represented in
linked data.”
Linking THINGS, not strings.
Access through the linked things.
8/11/16
M Zeng - IFLA Classification & Indexing
Satellite Conference 2016
21. M Zeng - IFLA Classification & Indexing
Satellite Conference 2016
21
BibFrame
- based
Read: https://www.denverlibrary.org/blog/rachel-f/dpl-announces-linked-data-launch 2015-06
Try: http://labs.libhub.org/denverpl/
Linking THINGS, not strings.
Access through the linked things.
8/11/16
23. 23
Go to http://dbpedia.org/page/Lois_Mai_Chan and
follow knownFor and about
Linking THINGS, not strings.
Access through the linked things.
8/11/16
M Zeng - IFLA Classification & Indexing
Satellite Conference 2016
24. 24
The connected structured data for THINGs from
Perspectives #2 and #3:
2
3
Linking THINGS, not strings.
Access through the linked things.
8/11/16
M Zeng - IFLA Classification & Indexing
Satellite Conference 2016
25. Structured
Semi-structured
Unstructured
25
• National bibliographies
• Catalogs
• Special collection portals
• Registries
• Metadata for datasets
• …
• Text Encoding Initiative (TEI files)
• Finding Aids
• Value added/tagged resources
• Unstructured portion within metadata
descriptions
• Digitized materials, textual or
non-textual
• Original information-bearing
objects
• Documents in all kinds of formats
• …
• Data from Web crawling that
need to be cleaned
• … …
LAM data examples
There are many hidden access
points that can bring in much
richer information and
knowledge through LAM data.
8/11/16
M Zeng - IFLA Classification & Indexing
Satellite Conference 2016
26. Big text in humanities
“Big text”
– The text version of “big data”
– Where?
• special collections,
• archives,
• oral histories,
• annual reports,
• provenance indexes,
• inventories,
• … etc.
– How?
• Fact mining, analytics
– What is needed?
Tools
• to ‘mine’ the text,
• to manage extracted
entities as new access
points, and
• to connect with the
outside data.
M Zeng - IFLA Classification & Indexing
Satellite Conference 2016
26
Source: SmarkLogic webinar
http://www.marklogic.com/w
ebinars/
8/11/16
27. Audiences’ receiving interests
liked, cited, researched, tagged, searched,
shared, followed, time spent, …
• Index
• Markup
• Ontology
• Knowledge
base
• Metadata
• Descriptive
• Administrative
• Structural/techni
cal
M Zeng - IFLA Classification & Indexing
Satellite Conference 2016
27
1 2
3
Production Content
2). Taking archival finding aids as an example
Image source:
http://alelemuseum.tripod.com/Archive
s.html
Image source:
https://libraries.u
sc.edu/article/ins
ide-usc-libraries-
grand-avenue-
library
8/11/16
28. • Finding aids
• Provide detailed descriptions of a collection's component parts,
• summarize the overall scope of the content,
• convey details about the individuals and organizations involved,
• list box and folder headings.
• The ‘subject’ access is to the whole archive (=Perspective #1)
• Few provided accesses to the ‘things’ contained in the contents
through index terms.
[Images from a finding aids: title page, content page, and index terms page.
Image source:
https://libraries.u
sc.edu/article/ins
ide-usc-libraries-
grand-avenue-
library8/11/16
M Zeng - IFLA Classification & Indexing
Satellite Conference 2016
28
29. Finding Aids
semantic analysis
using ontology-based tools
by KSU SLIS LOD-LAM team
http://lod-
lam.slis.kent.edu/SemanticAnalysis.html
• 45 archival finding aids
• drawn from 16 repositories
• From OpenCalais: extracted 8,096 entities and 336
suggested social tags
29
OpenCalais and COGITO are
• semantic analysis/fact-mining tools,
• taxonomy and ontology-supported,
• with machine learning and natural language processing
behind.8/11/16
M Zeng - IFLA Classification & Indexing
Satellite Conference 2016
31. Structured data
produced by Calais
(RDF/XML)
31
http://www.opencalais.com/
8/11/16
M Zeng - IFLA Classification & Indexing
Satellite Conference 2016
34. 34
for enhancing access to the contents in oral history transcripts files.
• Currently many are managed at collection level only.
• Only some have deep indexes, with great quality.
• The indexes usually existed as a ‘back-of-the-book’ style
and stayed within PDF files, downloadable.
• The indexes could be used in the
collection's subject searching.
• Indexed THINGs could be linked to
external resources.
The same approach can be used for the oral history collections
[image of a back-of-the-book style index
to an oral history transcripts]
[image of a page of the oral history
transcripts]
8/11/16
M Zeng - IFLA Classification & Indexing
Satellite Conference 2016
35. 35
Tool used: Open Calais
Note: Only for assistant extraction; still need
human cleaning process.
The same approach can be used for the library catalogs
8/11/16
M Zeng - IFLA Classification & Indexing
Satellite Conference 2016
36. M Zeng - IFLA Classification & Indexing
Satellite Conference 2016
36
The same approach can be used for the museum object labels and descriptions
COGITOOpenCalais
http://www.metmuseum.org/toah/works-of-art/2010.312/
8/11/16
38. Audiences’ receiving interests
liked, cited, researched, tagged, searched,
shared, followed, time spent, …
• Index
• Markup
• Ontology
• Knowledge
base
• Metadata
• Descriptive
• Administrative
• Structural/techni
cal
M Zeng - IFLA Classification & Indexing
Satellite Conference 2016
38
1 2
3
Production Content
3) Taking non-textual objects as examples
8/11/16
39. Portrait of Marcus Aurelius
Online Coins of the Roman Empire (OCRE) - Ontology based, knowledge base
http://numismatics.org/ocre/results
• Modeling in an ontology (formed in classes,
properties, relationships)
• Following Linked Data principles
• Using RDF triples for entities
• Querying in SPARQL language
8/11/16
M Zeng - IFLA Classification & Indexing
Satellite Conference 2016
39
40. 40
Online Coins of the Roman Empire (OCRE)
http://numismatics.org/ocre/
• Using sparql queries to find
• Output as CSV files
• Auto-Visualizing using FusionTable
• Just needs a few seconds
8/11/16
M Zeng - IFLA Classification & Indexing
Satellite Conference 2016
41. 8/11/16
M Zeng - IFLA Classification & Indexing
Satellite Conference 2016
41
Online Coins of the Roman Empire (OCRE)
http://numismatics.org/ocre/
42. M Zeng - IFLA Classification & Indexing
Satellite Conference 2016
42http://www.synaptica.com/oasis/
Deep Image Annotation
8/11/16
43. 43Clarke, David. 2015. Deep image annotation and Knowledge Organization. ISKO-UK 2015.
/content/deep-image-annotation-and-knowledge-organization
8/11/16
M Zeng - IFLA Classification & Indexing
Satellite Conference 2016
44. 44Clarke, David. 2015. Deep image annotation and Knowledge Organization. ISKO-UK 2015.
/content/deep-image-annotation-and-knowledge-organization
8/11/16
M Zeng - IFLA Classification & Indexing
Satellite Conference 2016
45. IIIF Image API
45See API specifications at: http://iiif.io/technical-details.html
International Image Interoperability Framework
Sanderson, Rob. 2014. Open Repositories 2014: Crowdsourced Transcription via IIIF, slide 9.
API= application programming interface, a set of routines, protocols, and
tools for building software applications.
8/11/16
M Zeng - IFLA Classification & Indexing
Satellite Conference 2016
46. III. The Importance of
Knowledge Organization Systems (KOS)
for Effective Subject Access
Various Types of KOS
1. Eliminating ambiguity
2. Controlling synonyms or
equivalents
3. Making explicit semantic
relationships
between/among concepts
Hierarchical relationships
hierarchical + other
associate
relationships
4. Presenting relationships
between/among concepts
as well as properties of
concepts
Fundamental
KOS Approaches
See full picture at http://nkos.slis.kent.edu/KOS_taxonomy.htm8/11/16
M Zeng - IFLA Classification & Indexing
Satellite Conference 2016
46
47. Figure. FRBR-LRM Overview of Relationships (draft)
Source: http://www.ifla.org/files/assets/cataloguing/frbr-lrm/frbr-lrm_20160225.pdf + revision draft
Dealing with The Problem of Semantic Conflicts
(inconsistencies in terminology and meanings)
8/11/16
M Zeng - IFLA Classification & Indexing
Satellite Conference 2016
47
48. 48
Dealing with The Problem of Information Overload
Traditional Filters -- “Filter-out”
• Site (physical or digital) organization and navigation support
• Advanced search functions
• “Umbrella” structures of classification and taxonomy from which to
extend content
• Browsing support—hierarchical structures
Beyond traditional filters -- “Filter-forward”
• Browsing and Filtering to the Front -- Using Faceted Structure
• Connecting Things via Semantic Relations
• Enabling Rediscovery
– Data mining, semantic analysis, machine-learning through expert
feedback, machine reasoning
• LOD KOS Datasets become Knowledge Bases
– obtaining special graphs or datasets for very complicated questions, and
– revealing unknown relationships (e.g., http://vocab.getty.edu/queries#Top-
level_Subjects
• From Machine-readable to Machine-understandable/processable
8/11/16
M Zeng - IFLA Classification & Indexing
Satellite Conference 2016
49. In the BARTOC registry
KOS registered: 1836
in the Datahub
LOD KOS registered :1251
(about a half are ontologies)
M Zeng - IFLA Classification & Indexing
Satellite Conference 2016
49
http://bartoc.org/
https://datahub.io/
(2016.05.27 data)
(2016.03.15 data)
Fact: The Increasing Need for KOS
8/11/16
50. Initiatives in digital humanities have demonstrated a paradigm shift in
how cultural heritage materials can be
- searched, mined, displayed, taught, and analyzed
utilizing digital technologies.
Data provided by LAMs and cultural heritage institutions are treasures for all
humanities researchers.
When subject access, smart data, and digital humanities interact, the
opportunity of effective and innovative services and contributions can be
endless.
Let’s embrace the new and changing concepts and make these happen.
Conclusion
Thank you!8/11/16
M Zeng - IFLA Classification & Indexing
Satellite Conference 2016
50
51. References
• Gardner, D. 2012. An ocean of data [Introduction]. In: Smolan, R., Erwitt, J. (eds.) The human face of big
data, pp. 14-17. Sausalito, CA: Against All Odds Productions.
• Joshi, Kunal. 2013. Big data, data science & fast data. http://www.slideshare.net/kunaljoshi111/big-data-
data-science-fast-data
• Kobielus, James. 2016. The Evolution of Big Data to Smart Data. Keynote at Smart Data Online 2016.
• Rose, Gillan. 2013. Visual Methodologies, 3rd. Edition. SAGE Publications Ltd.
• Sanderson, Rob. 2014. Open Repositories 2014: Crowdsourced Transcription via IIIF, slide 9.
http://www.slideshare.net/azaroth42/open-repositories-2014-crowdsourced-transcription-via-iiif
• SAS Institute Inc. [2014]. Big Data: What it is and why it matters. http://www.sas.com/big-data/
• Schöch, Christof. 2013. Big? Smart? Clean? Messy? Data in the humanities. Journal for Digital Humanities.
2(3): 2-13. http://journalofdigitalhumanities.org/2-3/big-smart-clean-messy-data-in-the-humanities/
• Sheth, Amit. 2014. Transforming Big Data into Smart Data: Deriving Value via harnessing Volume, Variety
and Velocity using semantics and Semantic Web. Keynote at 30th IEEE International Conference on Data
Engineering (ICDE) 2014.
• Svensson, Patrik. 2010. The landscape of digital humanities. Digital Humanities Quarterly. 4(1)
http://digitalhumanities.org/dhq/vol/4/1/000080/000080.html
• Svensson, Patrik. 2009. Humanities computing as digital humanities. Digital Humanities Quarterly. 3(3)
http://digitalhumanities.org/dhq/vol/3/3/000065/000065.html
• TiECON East. 2014. Data is new oil. http://www.tieconeast.org/2014/big-data-analytics
M Zeng - IFLA Classification & Indexing
Satellite Conference 2016
518/11/16