Core webinar updated 30-05-2020

Dr Mayank Trivedi
University Librarian & Senate Member
CORE Ambassador
The Maharaja Sayajirao University of Baroda
Date : 30th May, 2020
Email : librarian-hml@msubaroda.ac.in

 Open access digital archive on open source software.
 A managed, persistent way of making research,
learning and teaching content with continuing value
both discoverable and accessible.
 Repositories can be subject or institutional in their
focus.
 Putting content into an institutional repository
enables staff and institutions to manage and preserve
it, and therefore derive maximum value from it.
 A repository can support research, learning, and
administrative processes.
 The university is building a repository to house the
scholarly output of staff called DORA De Montfort
Open Research Archive. Alternatively a suitable
subject based repository may be available.

 Clifford Lynch, Executive Director, Coalition for Networked
Information, stated.
 “In my view, a university-based institutional
repository is a set of services that a university
offers to the members of its community for the
management and dissemination of digital materials
created by the institution and its community
members. It is most essentially an organizational
commitment to the stewardship of these digital
materials, including long-term preservation where
appropriate, as well as organization and access or
distribution.”
 ARL: A Bimonthly Report, no. 226 (February 2003)

 Centered around a university (other academic
institution) and contain items which are the
scholarly output of that institution.
 A collection of (digital) objects, in a variety of
formats Include works of various degrees of
scholarly authority and from various stages in the
process of scholarly inquiry. In addition to
published works, an IR may include preprints,
theses & dissertations, images, data sets,
working papers, course material, or anything else
a contributor deposits.
 Typically motivated by a commitment to open
access.

 Institutions are logical implementers of repositories because they can
take responsibility for:
– Centralizing a distributed activity
– Framework and Infrastructure
– Permanence that can sustain changes
– Stewardship of Digital assets
– Preservation policy for long term access
– Provide central digital showcase for the research, teaching and
scholarship of the institution
 Institutional repositories were originally developed to provide a
solution for the collection, preservation and dissemination of the
research output created at universities and research institutions.
 The importance of such repositories was quickly recognized since they
provided the means for the institutions to showcase their academic
work.
 Using open source software gave the ability to developers
worldwide to build custom features and functionalities for the
systems.

Institutional Repositories
 Are organized around a
particular institutional
community
 Often are dependent upon the
voluntary contribution of
materials by scholars for the
content in their collection
 Are mainly repositories and
therefore may only offer
limited user services
E.g. Dspace@MIT, IR@MSU
Digital Libraries
 May be built around any
number of organizing
principles (often topic, subject,
or discipline)
 Are the product of a deliberate
collection development policy
 Typically include an important
service aspect (reference and
research assistance,
interpretive content, or special
resources.)
E.g. NDLI

 Content is deposited in a repository – by
content creator, owner etc.
 Repository architecture manages the content
and the metadata
 Repository software offers a minimum set of
basic services – put, get, search
 Repository must be sustainable, trusted, well
supported and well-managed

 Faculty
 Pre-prints, post-prints, research
findings, working papers, technical
reports, conference papers
 Multimedia, videos, teaching materials,
learning objects
 Data sets (scientific, demographic, etc.)
and other ancillary research material
 Web-based presentations, exhibits, etc.
 Journals and journal articles
 Books and book chapters
 Technical reports
 Working papers
 Standards
 Theses and dissertations
 Reviews
 Components (e.g. graphs, figures,
audio, video files)
 Departmental and research center
newsletters and bulletins
 Papers in support of grant applications
 Statistical reports
 Surveys
 Students
 Theses and dissertations
 Projects and portfolios
 Awarded research
 Performances and recitals
 University
 Annual Reports
 Old Question Papers
 Curriculum
 Course Modules

What is open access (OA)?
 Many definitions – a report from the Joint Information
Systems Committee (JISC) in the UK of 2006 stated:
 The Open Access research literature is composed of free,
online copies of peer-reviewed journal articles and
conference papers as well as technical reports, theses and
working papers. In most cases there are no licensing
restrictions on their use by readers. They can therefore be
used freely for research, teaching and other purposes.
(http://www.jisc.ac.uk/publications/publications/pub_ope
naccess_v2.aspx)
 An Open access institutional repository is that repository
where are contents are freely available for use.

 There are various misunderstandings about
Open Access. It is not self-publishing, nor a
way to bypass peer-review and publication,
nor is it a kind of second-class, cut-price
publishing route. It is simply a means to
make research results freely available online
to the whole research community.

 Gold OA - uses a funding model that does not
charge readers or their institutions for access
e.g. Ariadne, Dlib Magazine and First Monday
 Green OA - authors publish papers in one of
the 25,000 or so refereed journals in all
disciplines and then self archive these papers
in open access/digital/institutional
repositories.

Core webinar updated 30-05-2020

Open Archives Initiative-
Protocol for Metadata
Harvesting (OAI-PMH)
Digital Library

 An institutional repository is a tangible indicator of
research output of a university – thus increasing its
visibility, prestige and public value
 Repository content is readily searchable – both locally and
globally
 Can be used as a marketing tool for the institution
 Allows an institution to manage its Intellectual Property
Rights appropriately
 An institutional repository benefits the institution, the
researcher and anyone interested in scholarly outputs.
 From an institutional perspective it provides a record of
scholarly activity taking place within the university.
 For a researcher, it creates stable and reliable records of
your work, managed and stored in ways which meet
international technical standards.
 Each item in the repository has a unique Internet address
(called a Handle) and it can be found easily on major
search engines.
 And finally, increased access to scholarly knowledge is a
benefit to all.

 Scholarly communication
 Storing learning materials and coursework
 Managing collections of research documents
 Preserving digital materials for the long term
 Knowledge management
 Electronic publishing
 Research assessment exercise
 Collaboration tool

 For researchers
 Showcase your institute’s
output
 Increases citation for authors
 24-hour access through any
web-enabled device
 Life’s work in one location
 Satisfies funder’s mandates
 Persistent URLs
 For librarians
 Provides new ways for
archiving & preserving valuable
work
 Time-saving and cost-effective
 Help to identify trends
 Reduce duplication of records
 For the university
 An effective marketing tool
 Increase the visibility,
reputation and prestige
 Greater interdisciplinary
research
 Enhanced funding
 Facilitates gathering data such
as publications for
Assessments
 For the global community
 Free access of scholarly
information
 Taxpayers fund a large amount
of scientific research
 Developing countries
 Increase public knowledge
 Gain access to a wide variety of
materials

 Increase in citations, impact and usage
(useful for research evaluations)
 Increase in public research profile – both for
the individual as well as the institution
 Preservation of research outputs from the
institution

 Steps to Building an IR
 1. Justify the relevance to the institution and
contributors
 2. Develop a policy framework. How will we find this
content and what will we do with it?
 3. Build the infrastructure
 Bonus: Get institutional support and a mandate.
 IR Technology
 IR software(Dspace)
 OAI-PMH harvesting protocol/software (Free)
 Intel/Pentium servers for IR
 Linux/Red Hat OS, MySQL/PostGre DBMS,
 Apache/Tomcat web server, Perl/Java (Free)

 Core issues
◦ • Policy Decisions
◦ • Organizational Issues
◦ • Cultural Issues
 Key Issues:
◦ Faculty buy-in
◦ Submission polices
◦ Copyright issues
◦ Deposit types
◦ Metadata
◦ OAI-PMH compliant systems
◦ Specialized staff
◦ Outreach and Liaison services

 Scope : Reinforce the repository’s active support for the
institution’s mission, values and goals
- Identify/build a context in which the repository is
necessary
- Multidiscipline / single subject /Entire research
output/database for each functional unit
 Types of documents
◦ Single database for different types /single one
 Software: OSS like DSpace or GNU Eprints or develop own
 Research Deposit Types: Thesis, Journal articles, Preprints,
Reports, Conference papers, Book Chapter, etc
 Resources: Human, IT, Funding
 Stake holders: Library, Each Department, Institute as a whole
 Services : Focus on building services not collections

 Deposit options
- Researcher self deposit and /or assisted deposit
 Metadata quality
- Ensuring quality and rich metadata is labour intensive
 Digitization : Born digital / Scanning
 File formats : Accept all, Only PDF and/or other, Conversion
 Only full text database and/or Bibliographic
 Copyright : RoMeO Publishers Copyright policies
 Quality assurance: Peer review, Editing
 Deposit Agreement and Use Agreement
- Depositor’s declaration: Non-exclusive license
- Copyright/Patent/Trademarks
- Repository’s rights and responsibilities: Distribute, Store,
Migrate, Copy Rearrange, Remove
- Use Agreement: Copy, Distribute, Display, Share, Author
credit

 Advocacy
- Sensitive to organizational culture and
background
- Community size
- Strategy: stakeholders, management
committees
 Copyright
- Concern of researchers, Legal department
 Positioning
- Library/Institute Website

Produced by Berkeley Electronic Press
(bepress), focused on maintaining
scholarly output. Not open source.
Developed at the University of Southampton (UK).
Widely considered to be the least complex of the
major repository software platforms.
Developed at Cornell and University of
Virginia. Based on a framework known as
the Flexible Extensible Digital Object and
Repository Framework.
Designed by MIT and Hewlett-Packard to manage
the intellectual output of research institutions and
provide for long-term preservation.

 Significant subject repositories include many using e-Prints or DSpace
software:
 ArXiv - http://www.arxiv.cornell.edu/ (physics, mathematics, nonlinear
 science and computer science)
 Cogprints - http://cogprints.ecs.soton.ac.uk/ (Cognitive sciences
 including psychology, neuroscience, linguistics and other related areas)
 CiteSeer - http://citeseer.nj.nec.com/cs (computer science)
 HTP Prints - http://htpprints.yorku.ca/ (History and theory of psychology)
 PubMedCentral - http://www.pubmedcentral.nih.gov/ (US National
Library of Medicine's digital archive of life sciences journal literature)
 PhilSci Archive - http://philsci-archive.pitt.edu/ (philosophy of science)
 E-LIS - http://eprints.rclis.org/ (library and information science)
 RePEc (Research Papers in Economics)

256
96
82
74
72
66
57
52
49
48
14.69
5.51
4.7
4.25
4.13
3.79
3.27
2.98
2.81
2.75
0
50
100
150
200
250
300
Usage of IR During COVID-19
Total Searches :- 1743
Searches % of Total

 The OpenDOAR service provides a quality-
assured listing of open access repositories
around the world.
 OpenDOAR staff harvest and assign metadata to
allow categorisation and analysis to assist the
wider use and exploitation of repositories.
 Each of the repositories has been visited by
OpenDOAR staff to ensure a high degree of
quality and consistency in the information
provided.
 OpenDOAR is maintained by SHERPA consortium
staff at the University of Nottingham, UK

 Aims to monitor overall growth in the number
of eprint archives and to maintain a list of GNU
EPrints sites (http://roar.eprints.org)
 Available from Southampton University, UK
 Data gathered automatically via OAI-PMH
 Also ROAR Materials Archiving Policies –
ROARMAP - 181 Institutional repositories
(including The M S University, Baroda,
INFLIBNET, Bharathidasan University in India)

 Nearly 94 Institutions
◦ Dspace – 54
◦ Eprints - 21
◦ Greenstone – 12
◦ Inhouse - 7
 Leading IRs
 IISc, ISI, NAL, NCL, NIO, RRI, DU, IITs

 Preserving natural/cultural heritages
 For promoting academic research
 Enabling public access to legacy collections
 Technological change
 Siginificant increase in the overall volume of research
 Increasing need for archival and access to unpublished information
 Increasing demand to access knowledge objects from anywhere at
anytime (24x7)
 Increasing uncertainty over who will handle the preservation and
archiving of digital scholarly research materials.
 Born digital Content(e-only)
 Research Data Management(RDM)
Need for Digital Preservation

 The maintenance of digital materials over the
long-term with a view to ensuring its
continued accessibility.
 It ensures that the digital resources are
stored correctly and maintained adequately in
the online world, such that they are available
consistently for use over time.
 “Long-term” includes timescales of decades
or even centuries.

 Primary Sources of
Information
 Electronic Conferences
 Courseware/Tutorials/G
uides/Manuals
 Electronic Journals
 Patents
 Electronic Preprints and
E-prints
 Projects (Ongoing and
Completed)
 Science/Research News
 Software
 Standards
 Technical Reports
 Electronic Theses and
Dissertation
 Secondary source of Information
-Databases, Data sets and
Collections
- Abstracting and Indexing
Databases (Bibliographic
Databases)
- Citation Databases
- Digital Collections (Images,
Audio, Video)
- Equipment/Product Catalogues
- Scientific Data sets (Numeric,
Property and Structural
Databases)
- Library Catalogues (including
Union Catalogues)
- Virtual Libraries

 Tertiary Source of Information
◦ Electronic Books, Online Book Selling and Print-on-Demand,
Pocket PCs
◦ E-Books on the web
◦ Reference Sources
◦ Dictionaries
◦ Electronic Encyclopaedia
◦ Biographies
◦ Acronyms and Abbreviations
◦ Thesauri and Subject Headings
◦ Handbooks and Manuals
◦ Maps
◦ Employment / Career Sources
◦ Funding / Grants Sources
◦ People/ Experts/Scientist Directories

 Website:
http://core.ac.uk
 Twitter: @oacore

• An aggregation service with more than a thousand data providers
• CORE aggregates open access journals and repositories (subject
and institutional)
• The collected data are available to everyone via the CORE search
engine
• Supports text and data mining of the aggregated content via the
API and the Data Dumps
• Provides a recommendation tool for repositories and journal
software
• Seamless access to the world’s open access research papers
• Aggregate all open access articles across relevant data sources
worldwide, enrich this content and provide seamless access to it
through a set of data services …

 World’s largest open access articles collection
 49TB of data
 3,648 Data providers
 ~ 50k Journals
 > 131,302,881 Metadata
 > 93,348,838 Abstracts
 > 11,250,039 Hosted full texts
 > 78,781,218 Links to full text

 Institutional repositories (IRs) serve a number
of purposes; such collecting and curating
digital outputs, providing statistics, research
excellence, etc.
 The primary goal of repositories is to open
and disseminate research outputs to a
worldwide audience (Crow, 2002) – SPARC’s
position paper on the case for institutional
repositories.

 “For the repository to provide access to the broader
research community, users outside the university must be
able to find and retrieve information from the repository.
Therefore, institutional repository systems must be able to
support interoperability in order to provide access via
multiple search engines and other discovery tools.
 An institution does not necessarily need to implement
searching and indexing functionality to satisfy this
demand:
◦ it could simply maintain and expose metadata, allowing other
services to harvest and search the content. This simplicity lowers
the barrier to repository operation for many institutions, as it only
requires a file system to hold the content and the ability to create
and share metadata with external systems.”

 “Each individual repository is of limited value for
research: the real power of Open Access lies in the
possibility of connecting and trying together
repositories, which is why we need interoperability.
 In order to create a seamless layer of content through
connected repositories from around the world, Open
Access relies on interoperability, the ability for
systems to communicate with each other and pass
information back and forth in a usable format.
 Interoperability allows us to exploit today's
computational power so that we can aggregate, data
mine, create new tools and services, and generate
new knowledge from repository content.’’ [COAR
manifesto]

• OAI-PMH
• ResourceSync
• Bespoke APIs of Publisher
CORE Pipeline
• Metadata download, extraction and harmonisation
• Full text download
• Text extractions and sections extraction
• Metadata validation
• Thumbnails generation
• References and citation contexts extraction
• API enrichment (e.g. finding DOIs, linking to other systems)
• Document type classification
• Deduplication
• Indexing
• Exposing (data dumps, API, ResourceSync)

 OAI-PMH supports representing metadata in
multiple formats, but at a minimum
repositories must be able to return records
with metadata expressed in the Dublin Core
format (OAI-PMH v2.0, 2008)
 If repositories want to satisfy the SPARC
guidelines (Crow, 2002), they must provide a
link to the content as part of the exposed
metadata.

 The OAI-PMH specification states on this topic
that: “The nature of a resource identifier is
outside the scope of the OAIPMH.
 To facilitate access to the resource associated
with harvested metadata, repositories should use
an element in metadata records to establish a
linkage between the record (and the identifier of
its item) and the identifier (URL, URN, DOI, etc.)
of the associated resource.
 The mandatory Dublin Core format provides the
identifier element that should be used for this
purpose.”

 OAI-PMH metadata harvesting
 Locating full-text
 Focused crawling (to locate full-texts)
 Focused crawling (driven by citation analysis)
 Aggregations need access to content, not just
metadata!
 Certain metadata types can be created only at the
level of the aggregation
 Certain metadata can be changing in time
 Ensuring content:
◦ accessibility • availability • validity • quality

 CORE Portal
 CORE Mobile
 CORE Plugin
 CORE API
 Repository Analytics

 • Content
• 7M records
 • 230 repositories
 • 402k full-texts
 • 1TB of data
 • 40GB large index
 • 35 million RDF triples in the CORE LOD repository
 • Started: February 2011
 • Budget: 140k£

 Link to the CORE search engine from your library
 Search the CORE engine to help library users
 Let your users know about CORE
 Discover content in the CORE search engine
 Use the CORE API https://core.ac.uk/services#api
and Dataset to apply TDM practices
https://core.ac.uk/services#dataset
 AS A REPOSITORY MANAGER
◦ install the CORE Recommender and register for the
Repository Dashboard to manage your content in CORE
 AS A DEVELOPER OR COMPANY
◦ subscribe to our API, Dataset or FastSync services

CORE has already more than 1,000 data providers
https://core.ac.uk/dataproviders
 Join CORE as a data provider https://core.ac.uk/join
 Ask your library to list the CORE search engine in
the library pages
 When you become a CORE data provider request
access to the CORE Dashboard
https://core.ac.uk/services#dashboard
 Install the CORE recommender in your repository
https://core.ac.uk/services#reccomender

 Extend the current metadata enrichment service of the CORE
system
 Classify research papers according to their subject
 Development of APIs to increase interoperability and reusability
 Develop good practice for publishing content in OARs
 Clustering content based on different semantic relations,
associating research areas/topics with each cluster
 Developing citation networks and analysing relationships with
semantic similarity of resources
 Developing formal methods for analysing the impact of
publications and authors
 Mining trends as temporal relations of topics
 Analysing inter-topic relationships
 To provide a platform for the delivery of Open Access content
aggregated from multiple sources and to deliver a wide range of
services on top of this aggregation.
 A nation-wide aggregation system that will improve the
discovery of publications stored in British Open Access
Repositories (OARs).

 Objective :
… the COnnecting REpositories (CORE) project aims
to facilitate the access and navigation across relevant
scientific papers, stored in British Open Access
repositories using Natural Language Processing and
Linked Data …
 - 2.5 Million RDF triples and growing - 143 British
Open Access repositories (more than 1,900
worldwide) - More than 50k full-text papers and
growing (more than 3 million worldwide)
 Schema
 - MuSim ontology (similarity) - Bibo ontology - OAI
Linked Data repository

 Recommender :A leading research papers recommendation
Solution - A plugin for repositories, journal systems and web
interfaces to suggest similar articles, supporting users in
discovering similar articles
 Discovery : One-click access to free copies of research papers
whenever you hit the paywall- CORE Discovery assists users in
finding freely accessible copies of research papers that are
behind a paywall. Access millions of full text open access papers.
 Repository Dashboard : Manage your repository outputs and
ensure they are visible to the world. A service designed for our
data providers. It provides an online interface offering valuable
technical information and statistics to content providers.
 With its analytical tools, CORE aims to contribute to the cultural change
and promotion of open access, supporting a variety of stakeholders,
such as researchers, libraries, software developers, funders and more.

 Faceted searching
 What you find is what you can download
 Host the full text content
 CORE is not here to replace Google Scholar, It
serve a different community
 >22 million yearly unique visitors
 The world’s largest aggregator of open
access research papers
 CORE provides free and seamless access to
millions of research articles aggregated from
thousands of Open Access data providers,
such as repositories and journals.
 >60 registered repositories

 Recommending relevant content to users
from across the CORE’s content
 Recommender plugin for repositories
 15 repositories using it, including Cambridge
University

 Free online availability of research outputs (not metadata of
research outputs) is the main goal of open access (OA).
 Repositories are one of the recommended ways to achieve this
(BOAI,2002).
 Despite large amount of OA content already available online
(Laakso & Bjork, 2012), OA content is not necessarily easily
discoverable (Morrisson, 2012; Konkiel, 2012).
 Often available, but difficult to find …
 Inhibiting the OA impact, accessibility, discoverability, reuse
 Discoverability of OA content on the Web can be dramatically
increased by adopting two simple principles!
 From Open Access Metadata to Open Access Content
 Two Principles for Increased Visibility of Open Access Content

 It is beyond human capacities to read all
scientific literature
 Example use cases in which CORE is applied:
- Improving discovery
- Plagiarism detection
-Question answering in science
-Literature based discovery
-Fact checking and detection of misinformation
-Analysing research trends
-Finding experts in a particular domain
-Research evaluation and scientometrics
-Exploratory and visual search

 Main source of data for Naver Academic
 IRIS.AI: A start up of 20 people entirely
developing tools to help researchers be more
productive. Highly dependent on CORE API.
 ProQuest: CORE supplies data about OA
publications to improve discovery of OA
content in Primo and Summon (new)
additional discovery collaborations on the
way.
 Providing a key API for OA Button

Programmable
access
-API
-Dataset
Researchers,
Developers,
Companies
Transaction
access
-CORE Search
-CORE
Recommender
Researchers,
Students, Life
long learners
Analytical
Information
Access
-CORE
Dashboard
Data
Providers,
Funders
Source:
http://www.dlib.org/dlib/november12/knoth/11knoth.html

 The institutional repository is like a roach motel. Data goes in, but it doesn’t come
out.” (Salo, 2008)
 Lower accessibility of papers (we have them, but cannot find them)
 Text-mining, Cannot monitor growth
 Loosing a strong argument for the adoption of OA!
 Repository systems are currently highly used in education and research
environments; hence they provide a solid base for a global infrastructure of
research.
 However most of the platforms used in the repositories are using out of date
technology which was designed two decades ago.
 For this reason, repositories have not yet reached their full potential and they are
mainly used for publishing the research output of the institutions.
 Working Group to “identify the core functionalities for the next generation of
repositories. These functionalities include more web-friendly architectures,
embedding repositories into the workflow of researchers, open peer review and
quality assessment of content, and better impact and usage measures.”


 According to COAR the exact vision for Next Generation
Repositories is
 “to position repositories as the foundation for a distributed,
globally networked infrastructure for scholarly communication,
on top of which layers of value-added services will be deployed,
thereby transforming the system, making it more research-
centric, open to and supportive of innovation, while also
collectively managed by the scholarly community”.
 These new functionalities aim to transform repositories into
the foundation for a distributed, globally networked
infrastructure for scholarly communication.
 Layers of added value services can be deployed on this
infrastructure which make a repository more research
oriented, innovative, while providing open peer review and
quality assessment of the content and better impact and
usage measures

 Proportion of OA content that can be harvested is fairly low in comparison to
metadata
 Inhibiting the benefits of OA - Interoperability
 Two principles:
1) Dereferencable identifiers - Open Repositories provide open access to content
and not just meatadata
2) Machine access – Open Repositories should provide free access to content (for
anybody, but mainly researchers)
• Compliance validation tools are needed
 The rapid growth of OA content provides great opportunity for text-mining.
 Aggregations need to aggregate content, not just metadata.
 Aggregations should serve the needs of different user groups including researchers
who need access to data. CORE aims to support them.
 Global world’s largest aggregator of open access papers
 A suite of data services for a variety of stakeholders
 Large user base
 Moving fast from research to a product, Always innovate.
 Excellent value for money.

 ……..Thank You
 Stay Safe and Take Care…
 PPTs will be available on :
https://www.slideshare.net/DrTrivedi1
https://www.slideshare.net/mayanktrivedi2
1/presentations

Core webinar updated 30-05-2020

More Related Content

Core webinar updated 30-05-2020