This document provides an outline of Lecture 2 from the course GEO 802, Data Information Literacy. It discusses various portals and repositories for publishing and finding data, including discipline-specific repositories, as well as directories and indexes of repositories. It also covers data journals and venues for publishing datasets to get them cited. Finally, it lists some exercises for students to find relevant data repositories in their fields and to explore search tools and open data portals.
Report
Share
Report
Share
1 of 71
More Related Content
2 Discovery and Acquisition of Data1.pptx
1. GEO 802, Data Information Literacy
Fall 2020 – Lecture 2
Gary Seitz, MA
2. Lesson 2 Outline
Portals for data
publication
Data Repositories
Discipline-related
repositories
Open Data from
organizations
Luis Prado from The Noun Project
7. Check Registry of Research Data Repositories
www.re3data.org:
Can you find data repositories in your field?
List 5 data repositories, where you think you could find
data for your thesis.
Exercise 2.1
8. Open Access Directory: Data Repositories Launched in
2008 and hosted by the Graduate School of Library and
Information Science at Simmons College, the Open Access
Directory is a wiki that lists links to over 50 open data
repositories in the disciplines of archaeology, biology,
chemistry, environmental sciences, geology, geosciences
and geospatial data, marine sciences, medicine and
physics, as well as multidisciplinary open data
repositories.
Directories of repositories
9. Zenodo
• A research data repository. It was created by OpenAIRE and CERN to
provide a place for researchers to deposit datasets
• Has integration with GitHub to make code hosted in GitHub citable
• Provides secure archiving and referability, including digital object
identifiers (DOIs)
• Easy access
• Disadvantage: No curation, no quality control
Data repositories
10. • International repository of data underlying scientific and medical
publications
• All data files are associated with a published article, and are made
available for reuse under the terms of a Creative Commons Zero waiver
• Began charging submission fees in September 2013
• Data in Dryad receives a permanent, unique Digital object identifier (DOI)
Dryad
Data repositories
12. figshare
• Repository for data and files (figures, datasets, images, audio and videos, articles
(including pre-print), posters, software und file-sets)
• Advantage: items are attributed a DOI, allows researchers to publish negative
data, altmetrics, tracks the download statistics for hosted materials, acting in turn
as a source for altmetrics, partnership with PLOS
• Disadvantage: operated by Macmillan (Nature)
Data repositories
16. GitHub
DataCite Protocols.io
GitHub is a web-based Git or version
control repository and Internet hosting
service. It offers all of the distributed
version control and source code
management (SCM) functionality of Git
as well as adding its own features. It
provides access control and several
collaboration features such as bug
tracking, feature requests, task
management, and wikis for every
project
An up-to-date open access
repository of science
methods and a collaborative
protocol-centered platform,
to find and share life science
protocols
.
DataCite is a leading
global non-profit
organisation that
provides persistent
identifiers (DOIs) for
research data and
other research
outputs. Organizations
within the research
community join
DataCite as members
to be able to assign
DOIs to all their
research outputs.
Data repositories
17. ROAR
ICSU World Data System
Research Data Australia
Data repositories
Research Data Australia helps you find,
access, and reuse data for research from
over one hundred Australian research
organisations, government agencies, and
cultural institutions.
The aim of ROAR is to promote the development of
open access by providing timely information about
the growth and status of repositories throughout the
world.
WDS aims to facilitate scientific research by
coordinating and supporting trusted
scientific data services for the provision,
use, and preservation of relevant datasets,
while strengthening their links with the
research community.
18. Try to find data for your thesis in these repositories
Exercise 2.2
Repository Results Remarks
Zenodo
Dryad
Figshare
GitHub
Research Data Australia
ROAR
ICSU
19. Ecology
Long Term Ecological Research (LTER)
https://portal.lternet.edu/nis/home.jsp
EcoTrends: http://www.ecotrends.info/
Ecological Society of America (ESA) Data Registry and Archive:
http://data.esa.org/esa/style/skins/esa/index.jsp
Knowledge Network for Biocomplexity (KNB):
https://knb.ecoinformatics.org/index.jsp
Oceanographic Data Repositories: provides access to several
oceanographic data repositories created by the US Joint Global Ocean
Flux Study and US Global Ocean Ecosystem Dynamic programs.
Global Biodiversity Information Facility: http://www.gbif.org/
Discipline-related repositories
20. Life and Biological Sciences
Biogeographic Information and Observation System
(BIOS).
Protein DataBank - Experimentally determined structures
for macromolecules (protein and nucleic acids). The site
includes search and visualization tools
TreeBase: http://treebase.org/treebase-web/home.html
Discipline-related repositories
21. Environmental and Geosciences
Marine Geoscience Data System (MGDS): A data portal, hosted at the
Lamont-Doherty Earth Observatory (Columbia University)
National Climatic Data Center (NCDC) : Meteorology and
paleoclimatology
National Oceanographic Data Center (NODC): World-wide marine
environmental and ecosystem data
National Snow and Ice Data Center (NSIDC): Cryospheric datasets
from ground field reseach and satellites
DataONE (Data Observation Network for Earth):
https://search.dataone.org/data
Kompetenzzentrum Forschungsdaten: https://www.komfor.net/data-
portal.html
Polar Data Catalogue: https://www.polardata.ca/
Discipline-related repositories
22. Environmental and Geosciences
DASH (University Corporation for Atmospheric Research&National
Centre for Atmospheric Research
WDC (World Data Center for Climate): https://cera-
www.dkrz.de/WDCC/ui/cerasearch/
Climate Data at the National Center for Atmospheric Research:
https://www.earthsystemgrid.org/home.html
ENES (European Network for Earth System Modelling):
https://verc.enes.org/
EarthChem - EarthChem operates and maintains a suite of data
systems and data collections that provide access to a wide variety of
solid earth data
Discipline-related repositories
23. Environmental and Geosciences
Atmospheric radiation measurement data: focuses on obtaining
continuous measurements and providing data products that
promote the advancement of climate models.
CUAHSI: a list of web portals and/or websites with data or links to
data on water resources. The portals generally provide data that
are at a minimum national in scope, and many of the portals offer
global data.
British Atmospheric Data Centre (BADC) - Data Centre for the
Atmospheric Sciences
KNMI Climate Explorer: http://climexp.knmi.nl/
USGS Water Data for the Nation: https://waterdata.usgs.gov/nwis
PANGAEA Data Publisher for Earth & Environmental Science
http://www.pangaea.de/
Discipline-related repositories
24. GIS and Geography
GeoCommons.com GIS file repository and finding tool
Federal Geographic Data Committee - Provides access to the
National Spatial Data Infrastructure (NSDI) Clearing House
Network and the geodata.gov portal
http://inspire-geoportal.ec.europa.eu/ : The INSPIRE Geoportal
is the central European access point to the data provided by EU
Member States and several EFTA countries under the INSPIRE
Directive.
Geoportal, Geodaten aus Deutschland
http://www.geoportal.de/
Geodatenkatalog : https://wiki.gdi-de.org/display/gdk
Discipline-related repositories
25. Remote Sensing
GEOSS Datenportal http://www.geoportal.org
CEOS: Data & Tools of the Commitee on Earth Observation
Satellites
Discipline-related repositories
26. Chemistry
ORNL DAAC for Biogeochemical Dynamics - The Oak Ridge National
Laboratory Distributed Active Archive Center for biogeochemical
dynamics is one of the NASA Earth Observing System Data and
Information System
Cambridge Structural Database - small molecule crystal structures
ChemSpider - free-to-access collection of chemical structures and
their associated information
eCrystals - x-ray crystallographic data
PubChem - NCBI's repository of bioactivy/bioassay data and
information for "small" molecules (i.e. not macromolecular). Both
text-based and structure-based search tools are provided
Discipline-related repositories
27. Social Sciences
ICPSR (Inter-university Consortium for Political and Social
Research at the University of Michigan.
Dataverse Network is a collection of social science research
data contained in virtual data archives called "dataverses".
FORS : Schweizer Kompetenzzentrum für
Sozialwissenschaften. FORS führt große nationale und
internationale Umfragen durch, bietet Daten- und
Forschungsinformationsdienste für Forscher und akademische
Einrichtungen an.
SSOAR : Social Science Open Access Repository
Discipline-related repositories
28. Exercise 2.3
Look through discipline-related repositories in your field.
• Have a close look at the records to see the ways repositories
have made their records discoverable and accessible. List
positive and negative aspects of the search in those
repositories.
• Can you already find data that you could use?
Save one dataset you maybe could use.
Discipline-related repositories
29. DataSearch
As of June 2016, they are (completely or partially) indexing the following content
sources:
a) Tables, figures and supplementary data associated with papers in ScienceDirect, arXiv
b) EarthChem Portal , Dryad, ICPSR, Harvard Dataverse, Mendeley Data, NeuroElectro,
PANGAEA and ThemoML
Data Search Machine
30. Google Dataset Search
Data Search Machine
How well does the Google Search work, after your knowledge and
experiences with the data repositories you have looked at?
31. Exercise
How well do these two search machines work, after your knowledge and experiences
with the data repositories you have looked at?
Can you refind the data you got out of the repositories?
Data Search Machine
39. CODATA Data Science Journal
39
The CODATA Data Science Journal is a peer-reviewed, open access, electronic journal, publishing papers
on the management, dissemination, use and reuse of research data and databases across all research
domains, including science, technology, the humanities and the arts.
Data papers & data journals
40. Data in Brief
40
The CODATA Data Science Journal is a peer-reviewed, open access, electronic journal, publishing papers
on the management, dissemination, use and reuse of research data and databases across all research
domains, including science, technology, the humanities and the arts.
Data papers & data journals
42. Exercise 2.4
A list of further data journals is here:
https://www.wiki.ed.ac.uk/display/datashare/Sources+of+datase
t+peer+review
Data papers & data journals
Have a look at the “About” of one of these journals.
1. What didn’t you expect to see?
2. What do think is the advantage to publish your
data in a special data journal?
3. What could be the advantage for the progress of
science and for the public
44. Data Citation Index (II)
• started in October 2012
• There are about 380 Repositories in DCI
• crossdisciplinary, main focus on science; 50% of the
data are from medicine
• Linked with the bibliographic record in Web of
Science
• Linking of Peer-Reviewed-Articles with underlying
reserach data
• Uniform metadata schema
66. Exercise 2.5
1. Have a look at Web of Science Data Citation Index. In which
respect could this Database become of use for your master
thesis? Can you find something you can use?
2. Choose one or two of the open statistics sites you would
like to have a closer look at, just for interest and/or your
private life.
67. data.ac.uk/
“A landmark site for academia providing a single point of
contact for linked open data development.”
It not only provides access to the know-how and tools to discuss and
create linked data and data aggregation sites, but also enables access
to, and the creation of, large aggregated data sets providing powerful
and flexible collections of information.
69. Tips for searching for data (from the
Data Journalism Handbook)
• When searching for data, make sure that you
include both search terms relating to the
content of the data you’re trying to find as
well as some information on the format or
source that you would expect it to be in.
• Google and other search engines allow you to
search by file type.
http://datajournalismhandbook.org/1.0/en/getting_data_0.html
70. For example, you can look only for…
• Spreadsheets (by appending your search with ‘filetype:XLS
filetype:CSV’)
• Geodata (‘filetype:shp’)
• Database extracts (‘filetype:MDB, filetype:SQL, filetype:DB’).
• PDFs (‘filetype:pdf’).
• You can also search by part of a URL. Googling for
‘inurl:downloads filetype:xls’ will try to find all Excel files that
have “downloads” in their web address.
71. • Another popular trick is not to search for content
directly, but for places where bulk data may be
available.
• (if you find a single download, it’s often worth just
checking what other results exist for the same folder
on the web server). You can also limit your search to
only those results on a single domain name, by
searching for, e.g. ‘site:agency.gov’.
• For example, ‘site:agency.gov Directory Listing’ may
give you some listings generated by the web server
with easy access to raw files, while ‘site:agency.gov
Database Download’ will look for intentionally created
listings.
Editor's Notes
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License. Please cite this work as “Whitmire, Amanda L. (2014). Research Data Management Curriculum, Lecture 15: Plan for Archiving and Preservation of Data. Oregon State University Libraries. Retrieved [date] from: http://guides.library.oregonstate.edu/grad521lectures.”
Slides attributed to the UKDA have the following citation: “Research Data Management Team, UK Data Archive, University Of Essex (2012). Managing and Sharing Data: Training Resources. UK Data Service. Retrieved 29 May, 2012 from: http://data-archive.ac.uk/media/335419/trainingresources.zip.”
Slides attributed to the DCC have the following citation: “Whyte, A. & Wilson, A. (2010). "How to Appraise and Select Research Data for Curation". DCC How-to Guides. Edinburgh: Digital Curation Centre. http://www.dcc.ac.uk/resources/how-guides.“
Slides attributed to DataONE have the following citation: “DataONE Education Module: Protecting Your Data: Backups, Archives, and Data Preservation. DataONE. Retrieved Jan. 5, 2014. From http://www.dataone.org/sites/all/documents/L06_DataProtectionBackups.pptx.”
Image credit: Surveying by Luis Prado from The Noun Project
“All journals with either integrated data submission or sponsored Data Publishing Charges are listed below. Our institutional sponsors fund submissions by their affiliated researchers. Note that for large data packages, submitters will be asked to pay an additional $15 for the first GB beyond 10GB and $10 for each GB thereafter.” I counted on 2/19/2014, and there were 64 journals listed as being integrated and/or sponsored entities.
“$80 per data package, payable by the submitter”
About figshare:
free
unlimited storage
take files of any format
all items receive DOIs
all content is CC0 (data) or CC-BY (other stuff)
reciprocal linking with PLOS, Nature, Faculty of 1000, more…
metadata: no rules or constraints (can of worms…)
Content under the “Health Sciences” category: formal datasets, posters, a blog post (?!), a citizen science dataset, PowerPoint presentation, etc.
We have completely or partially indexed the following:
Dryad
EarthChem Portal from The Interdisciplinary Earth Data Alliance (IEDA) :
Geochemistry of Rocks of the Oceans and Continents (GEOROC)
MetPetDB
The North American Volcanic and Intrusive Rock Database (NAVDAT)
PetDB
U.S. Geological Survey (USGS) Mineral Resources National Geochemical Database (MR NGDB)
Harvard Dataverse
The Inter-university Consortium for Political and Social Research (ICPSR)
Mendeley Data
NeuroElectro
PANGAEA
ThermoML - Thermodynamic Research Center (TRC) at the National Institute of Standards and Technology (NIST)
Metadata from:
4TU.Centre of Research Data
Apollo - University of Cambridge
DataSpace - Princeton University
DSpace - University of Washington
LSHTM Data Compass - London School of Hygiene & Tropical Medicine
Médecins Sans Frontières (MSF)
Smithsonian
Zenodo
Tables, figures and supplementary data associated with papers from:
arXiv
ScienceDirect
What is a data paper? http://guides.library.oregonstate.edu/data-management-data-papers-journals
“Data papers facilitate the sharing of data in a standardized framework that provides value, impact, and recognition for authors. Data papers also provide much more thorough context and description than datasets that are simply deposited to a repository (which may have very minimal metadata requirements).”
“Data papers thoroughly describe datasets, and do not usually include any interpretation or discussion (an exception may be discussion of different methods to collect the data, e.g.).”
Earth System Science Data (ESSD) is an international, interdisciplinary journal for the publication of articles on original research data(sets), furthering the reuse of high (reference) quality data of benefit to Earth System Sciences. The editors encourage submissions on original data or data collections which are of sufficient quality and potential impact to contribute to these aims.
Geoscience Data Journal provides an Open Access platform where scientific data can be formally published, in a way that includes scientific peer-review. Thus the dataset creator attains full credit for their efforts, while also improving the scientific record, providing version control for the community and allowing major datasets to be fully described, cited and discovered.
An online-only journal, GDJ publishes short data papers cross-linked to – and citing – datasets that have been deposited in approved data centres and a warded DOIs. The journal will also accept articles on data services, and articles which support and inform data publishing best practices.
broad range of geoscience disciplines, including, but not limited to: Weather and Climate; Oceanography; Atmospheric and Ocean Chemistry; Cryosphere; Biosphere, Land Surface and Geology, Hydrology, Geochemistry, Geophysics, Planetary and Space Sciences.
Biodiversity Data Journal (BDJ) is a community peer-reviewed, open-access, comprehensive online platform, designed to accelerate publishing, dissemination and sharing of biodiversity-related data of any kind. All structural elements of the articles – text, morphological descriptions, occurrences, data tables, etc. – will be treated and stored as DATA, in accordance with the Data Publishing Policies and Guidelines of Pensoft Publishers.
taxonomic, floristic/faunistic, morphological, genomic, phylogenetic, ecological or environmental data
Scientific Data is a peer-reviewed, open-access journal for descriptions of scientifically valuable datasets, and research that advances the sharing and reuse of scientific data.
broad range of natural science disciplines, including, but not limited to, data from the life, biomedical and environmental science
The Journal of Open Psychology Data (JOPD) features peer reviewed data papers describing psychology datasets with high reuse potential. Data papers may describe data from unpublished work, including replication research, or from papers published previously in a traditional journal.
Any kind of psychology data is acceptable, including from correlational, descriptive and experimental research, e.g. case studies, computer simulations, experimental results, interviews and surveys, neuroimaging data, etc.
geoscientific model descriptions, from statistical models to box models to GCMs; new parameterizations or technical aspects of running models such as the reproducibility of results;
new methods for assessment of models, including work on developing new metrics for assessing model performance and novel ways of comparing model results with observational data;
papers describing new standard experiments for assessing model performance or novel ways of comparing model results with observational data;
model experiment descriptions, including experimental details and project protocols;
full evaluations of previously published models.
Vorbild war die Kooperation von Pangaea und Elsevier
- fachlicher Schwerpunkt: 80% Nawi, 18% Sowi, 2% Geisteswiss.
- Kategorien sind hierarchisch zu verstehen: Repositorien beinhalten Datenstudien, Datenstudien beinhalten Datensätze [??]
- reine Publikationsserver von Institutionen werden nicht aufgenommen, ebenso keine Metadatenkataloge, die auf Daten verweisen, die woanders archiviert sind