Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Welcome!

The guide you are reading contains:

  • a high-level introduction to the Fatcat catalog and software
  • a bibliographic style guide for editors, also useful for understanding metadata found in the catalog
  • technical details and guidance for use of the catalog's public REST API, for developers building bots, services, or contributing to the server software
  • policies and licensing details for all contributors and downstream users of the catalog

What is Fatcat?

Fatcat is an open bibliographic catalog of written works. The scope of works is somewhat flexible, with a focus on published research outputs like journal articles, pre-prints, and conference proceedings. Records are collaboratively editable, versioned, available in bulk form, and include URL-agnostic file-level metadata.

Both the Fatcat software and the metadata stored in the service are free (in both the libre and gratis sense) for others to share, reuse, fork, or extend. See Policies for licensing details, and Sources for attribution of the foundational metadata corpuses we build on top of.

Fatcat is currently used internally at the Internet Archive, but interested folks are welcome to contribute to it's design and development, and we hope to ultimately crowd-source corrections and additional to bibliographic metadata, and receive direct automated feeds of new content.

You can contact the Archive by email at webservices@archive.org, or the author directly at bnewbold@archive.org.

High-Level Overview

This section gives an introduction to:

  • the goals of the project, and now it relates to the rest of the Open Access and archival ecosystem
  • how catalog data is represented as entities and revisions with full edit history, and how entities are referred to and cross-referenced with identifiers
  • how humans and bots propose changes to the catalog, and how these changes are reviewed
  • the major sources of bulk and continuously updated metadata that form the foundation of the catalog
  • a rough sketch of the software back-end, database, and libraries
  • roadmap for near-future work

Project Goals and Ecosystem Niche

The Internet Archive has two primary use cases for Fatcat:

  • Tracking the "completeness" of our holdings against all known published works. In particular, allow us to monitor progress, identify gaps, and prioritize further collection work.
  • Be a public-facing catalog and access mechanism for our open access holdings.

In the larger ecosystem, Fatcat could also provide:

  • A work-level (as opposed to title-level) archival dashboard: what fraction of all published works are preserved in archives? KBART, CLOCKSS, Portico, and other preservation networks don't provide granular metadata
  • A collaborative, independent, non-commercial, fully-open, field-agnostic, "completeness"-oriented catalog of scholarly metadata
  • Unified (centralized) foundation for discovery and access across repositories and archives: discovery projects can focus on user experience instead of building their own catalog from scratch
  • Research corpus for meta-science, with an emphasis on availability and reproducibility (metadata corpus itself is open access, and file-level hashes control for content drift)
  • Foundational infrastructure for distributed digital preservation
  • On-ramp for non-traditional digital works (web-native and "grey literature") into the scholarly web

Scope

What types of works should be included in the catalog?

The goal is to capture the "scholarly web": the graph of written works that cite other works. Any work that is both cited more than once and cites more than one other work in the catalog is likely to be in scope. "Leaf nodes" and small islands of intra-cited works may or may not be in scope.

Fatcat does not include any fulltext content itself, even for clearly licensed open access works, but does have verified hyperlinks to fulltext content, and includes file-level metadata (hashes and fingerprints) to help identify content from any source. File-level URLs with context ("repository", "publisher", "webarchive") should make Fatcat more useful for both humans and machines to quickly access fulltext content of a given mimetype than existing redirect or landing page systems. So another factor in deciding scope is whether a work has "digital fixity" and can be contained in immutable files or can be captured by web archives.

References and Previous Work

The closest overall analog of Fatcat is MusicBrainz, a collaboratively edited music database. Open Library is a very similar existing service, which exclusively contains book metadata.

Wikidata seems to be the most successful and actively edited/developed open bibliographic database at this time (early 2018), including the wikicite conference and related Wikimedia/Wikipedia projects. Wikidata is a general purpose semantic database of entities, facts, and relationships; bibliographic metadata has become a large fraction of all content in recent years. The focus there seems to be linking knowledge (statements) to specific sources unambiguously. Potential advantages Fatcat has are a focus on a specific scope (not a general-purpose database of entities) and a goal of completeness (capturing as many works and relationships as rapidly as possible). With so much overlap, the two efforts might merge in the future.

The technical design of Fatcat is loosely inspired by the git branch/tag/commit/tree architecture, and specifically inspired by Oliver Charles' "New Edit System" blog posts from 2012.

There are a number of proprietary, for-profit bibliographic databases, including Web of Science, Google Scholar, Microsoft Academic Graph, aminer, Scopus, and Dimensions. There are excellent field-limited databases like dblp, MEDLINE, and Semantic Scholar. Large, general-purpose databases also exist that are not directly user-editable, including the OpenCitation corpus, CORE, BASE, and CrossRef. We do not know of any large (more than 60 million works), open (bulk-downloadable with permissive or no license), field agnostic, user-editable corpus of scholarly publication bibliographic metadata.

Further Reading

"From ISIS to CouchDB: Databases and Data Models for Bibliographic Records" by Luciano G. Ramalho. code4lib, 2013. https://journal.code4lib.org/articles/4893

"Representing bibliographic data in JSON". github README file, 2017. https://github.com/rdmpage/bibliographic-metadata-json

"Citation Style Language", https://citationstyles.org/

"Functional Requirements for Bibliographic Records", Wikipedia article, https://en.wikipedia.org/wiki/Functional_Requirements_for_Bibliographic_Records

OpenCitations and I40C http://opencitations.net/, https://i4oc.org/

Data Model

Entity Types and Ontology

Loosely following "Functional Requirements for Bibliographic Records" (FRBR), but removing the "manifestation" abstraction, and favoring files (digital artifacts) over physical items, the primary bibliographic entity types are:

  • work: representing an abstract unit of creative output. Does not contain any metadata itself; used only to group release entities. For example, a journal article could be posted as a pre-print, published on a journal website, translated into multiple languages, and then re-published (with minimal changes) as a book chapter; these would all be variants of the same work.
  • release: a specific "release" or "publicly published" version of a work. Contains traditional bibliographic metadata (title, date of publication, media type, language, etc). Has relationships to other entities:
    • child of a single work (required)
    • multiple creator entities as "contributors" (authors, editors)
    • outbound references to multiple other release entities
    • member of a single container, for example a journal or book series
  • file: a single concrete, fixed digital artifact; a manifestation of one or more releases. Machine-verifiable metadata includes file hashes, size, and detected file format. Verified URLs link to locations on the open web where this file can be found or has been archived. Has relationships:
    • multiple release entities that this file is a complete manifestation of (almost always a single release)
  • fileset: a list of muliple concrete files, together forming complete release manifestation. Primarily intended for datasets and supplementary materials; could also contain a paper "package" (source file and figures).
  • webcapture: a single snapshot (point in time) of a webpage or small website (multiple pages) which are a complete manifestation of a release. Not a landing page or page referencing the release.
  • creator: persona (pseudonym, group, or specific human name) that has contributed to one or more release. Not necessarily one-to-one with a human person.
  • container (aka "venue", "serial", "title"): a grouping of releases from a single publisher.

Note that, compared to many similar bibliographic ontologies, the current one does not have entities to represent:

  • physical artifacts, either generically or specific copies
  • funding sources
  • publishing entities
  • "events at a time and place"

Each entity type has it's own relations and fields (captured in a schema), but there are are also generic operations and fields common across all entities. The API for creating, updating, querying, and inspecting entities is roughly the same regardless of type.

Identifiers and Revisions

A specific version of any entity in the catalog is called a "revision". Revisions are generally immutable (do not change and are not editable), and are not normally referred to directly. Instead, persistent "fatcat identifiers" (ident) can be created, which "point to" a single revision at a time. This distinction means that entities referred to by an identifier can change over time (as metadata is corrected and expanded). Revision objects do not "point" back to specific identifiers, so they are not the same as a simple "version number" for an identifier.

Identifiers also have the ability to be merged (by redirecting one identifier to another) and "deleted" (by pointing the identifier to no revision at all). All changes to identifiers are captured as an "edit" object. Edit history can be fetched and inspected on a per-identifier basis, and any changes can easily be reverted (even merges/redirects and "deletion").

"Work in progress" or "proposed" updates are staged as edit objects without updating the identifiers themselves.

Controlled Vocabularies

Some individual fields have additional constraints, either in the form of pattern validation ("values must be upper case, contain only certain characters"), or membership in a fixed set of values. These may include:

  • license and open access status
  • work "types" (article vs. book chapter vs. proceeding, etc)
  • contributor types (author, translator, illustrator, etc)
  • human languages
  • identifier namespaces (DOI, ISBN, ISSN, ORCID, etc; but not the identifiers themselves)

Other fixed-set "vocabularies" become too large to easily maintain or express in code. These could be added to the backend databases, or be enforced by bots (instead of the system itself). These mostly include externally-registered identifiers or types, such as:

  • file mimetypes
  • identifiers themselves (DOI, ORCID, etc), by checking for registration against canonical APIs and databases

Global Edit Changelog

As part of the process of "accepting" an edit group, a row is written to an immutable, append-only table (which internally is a SQL table) documenting each identifier change. This changelog establishes a monotonically increasing version number for the entire corpus, and should make interaction with other systems easier (eg, search engines, replicated databases, alternative storage backends, notification frameworks, etc.).

Workflow

Basic Editing Workflow and Bots

Both human editors and bots should have edits go through the same API, with humans using either the default web interface, client software, or third-party integrations.

The normal workflow is to create edits (or updates, merges, deletions) on individual entities. Individual changes are bundled into an "edit group" of related edits (eg, correcting authorship info for multiple works related to a single author). When ready, the editor "submits" the edit group for review. During the review period, human editors vote and bots can perform automated checks. During this period the editor can make tweaks if necessary. After some fixed time period (one week?) with no changes and no blocking issues, the edit group would be accepted if no merge conflicts have be created by other edits to the same entities. This process balances editing labor (reviews are easy, but optional) against quality (cool-down period makes it easier to detect and prevent spam or out-of-control bots). More sophisticated roles and permissions could allow some certain humans and bots to push through edits more rapidly (eg, importing new works from a publisher API).

Bots need to be tuned to have appropriate edit group sizes (eg, daily batches, instead of millions of works in a single edit) to make human QA review and reverts manageable.

Data provenance and source references are captured in the edit metadata, instead of being encoded in the entity data model itself. In the case of importing external databases, the expectation is that special-purpose bot accounts are be used, and tag timestamps and external identifiers in the edit metadata. Human editors can leave edit messages to clarify their sources.

A style guide and discussion forum are intended to be be hosted as separate stand-alone services for editors to propose projects and debate process or scope changes. These services should have unified accounts and logins (OAuth?) for consistent account IDs across all services.

Reference Graph (refcat)

In Summer 2021, the first version of a reference graph dataset, named "refcat", was released and integrated into the fatcat.wiki web interface. The dataset contains billions of references between papers in the fatcat catalog, as well as partial coverage of references from papers to books, to websites, and from Wikipedia articles to papers. This is a first step towards identifying links and references between scholarly works of all types preserved in archive.org.

The refcat dataset can be downloaded in JSON lines format from the archive.org "Fatcat Database Snapshots and Bulk Metadata Exports" collection, and is released under a CC-0 license for broad reuse. Acknowledgement and attribution for both the aggregated dataset and the original metadata sources is strongly encouraged (see below for provenance notes).

References can be browsed on fatcat.wiki on an "outbound" ("References") and "inbound" ("Cited By") basis for individual release entities. There are also special pages for Wikipedia articles ("outbound", such as Internet) and Open Library books ("inbound", such as The Gift). JSON versions of these pages are available, but do not yet represent a stable API. The backend reference graph is available via the Elasticsearch API under the fatcat_ref index, but these schema and semantics of this index are also not yet stable.

How It Works

Raw reference data comes from multiple sources (see "provenance" below), but has the common structure of a "source" entity (which could be a paper, Wikipedia article, etc) and a list of raw references. There might be duplicate references for a single "source" work coming from different providers (eg, both Pubmed and Crossref reference lists). The goal is to match as many references as possible to the "target" work being referenced, creating a link from source to target. If a robust match is not found, the "unmatched" reference is retained and displayed in a human readable fashion if possible.

Depending on the source, raw references may be a simple "raw" string in an arbitrary citation style; may have been parsed or structured in fields like "title", "year", "volume", "issue"; might include a URL or identifier like an arxiv.org identifier; or may have already been matched to a specific target work by another party. It is also possible the reference is vague, malformed, mis-parsed, or not even a reference to a specific work (eg, "personal communication"). Based on the available structure, we might be able to do a simple identifier lookup, or may need to parse a string, or do "fuzzy" matching against various catalogs of known works. As a final step we take all original and potential matches, verify the matches, and attempt to de-duplicate references coming from different providers into a list of matched and unmatched references as output. The refcat corpus is the output of this process.

Two dominant modes of reference matching are employed: identifier based matching and fuzzy matching. Identifier based matching currently works with DOI, Arxiv ids, PMID and PMCID and ISBN. Fuzzy matching employs a scalable way to cluster documents (with pluggable clustering algorithms). For each cluster of match candidates we run a more extensive verification process, which yields a match confidence category, ranging from weak over strong to exact. Strong and exact matches are included in the graph.

All the code for this process is available open source:

  • refcat: batch processing and matching pipeline, in Python and Go
  • fuzzycat: Python verification code and "live" fuzzy matching

Metadata Provenance

The provenance for each reference in the index is tracked and exposed via the match_provenance field. A fatcat- prefix to the field means that the reference came through the refs metadata field stored in the fatcat catalog, but originally came from the indicated source. In the absence of fatcat-, the reference was found, updated, or extracted at indexing time and is not recorded in the release entity metadata.

Specific sources:

  • crossref (and fatcat-crossref): citations deposited by publishers as part of DOI registration. Crossref is the largest single source of citation metadata in refcat. These references may be linked to a specific DOI; contain structured metadata fields; or be in the form of a raw citation string. Sometimes they are "complete" for the given work, and sometimes they only include references which could be matched/linked to a target work with a DOI.
  • fatcat-datacite: same as crossref, but for the Datacite DOI registrar.
  • fatcat-pubmed: references, linked or not linked, from Pubmed/MEDLINE metadata
  • fatcat: references in fatcat where the original provenance can't be inferred (but could be manually found by inspecting the release edit history)
  • grobid: references parsed out of full-text PDFs using GROBID
  • wikipedia: citations extracted from Wikipedia (see below for details)

Note that sources of reference metadata which have formal licensing restrictions, even CC-BY or ODC-BY licenses as used by several similar datasets, are not included in refcat.

Current Limitations and Known Issues

The initial Summer 2021 version of the index has a number of limitations. Feedback on features and coverage are welcome! We expect this dataset to be iterated over regularly as there are a few dimensions along which the dataset can be improved and extended.

The reference matching process is designed to eventually operate in both "batch" and "live" modes, but currently only "batch" output is in the index. This means that references from newly published papers are not added to the index in an ongoing fashion.

Fatcat "release" entities (eg, papers) are matched from a Spring 2021 snapshot. References to papers published after this time will not be linked.

Wikipedia citations come from the dataset Wikipedia Citations: A comprehensive dataset of citations with identifiers extracted from English Wikipedia, by Singh, West, and Colavizza. This is a one-time corpus based on a May 2020 snapshot of English Wikipedia only, and is missing many current references and citations. Additionally, only direct identifier lookups (eg, DOI matches) are used, not fuzzy metadata matching.

Open Library "target" matches are based on a snapshot of Open Library works, and are matched either ISBN (extracted from citation string) or fuzzy metadata matching.

Crossref references are extracted from a January 2021 snapshot of Crossref metadata, and do not include many updates to existing works.

Hundreds of millions of raw citation strings ("unstructured") have not been parsed into a structured for fuzzy matching. We plan to use GROBID to parse these citation strings, in addition to the current use of GROBID parsing for references from fulltext documents.

The current GROBID parsing used version v0.6.0. Newer versions of GROBID have improved citation parsing accuracy, and we intend to re-parse all PDFs over time. Additional manually-tagged training datasets could improve GROBID performance even further.

In a future update, we intend to add Wayback (web archive) capture status and access links for references to websites (distinct from references to online journal articles or books). For example, references to an online news article or blog post would indicate the closest (in time, to the "source" publication date) Wayback captures to that web page, if available.

References are only displayed on fatcat.wiki, not yet on scholar.archive.org.

There is no current or planned mechanism for searching, sorting, or filtering article search results by (inbound) citation count. This would require resource-intensive transformations and continuous re-indexing of search indexes.

It is unclear how the batch-generated refcat dataset and API-editable release refs metadata will interact in the future. The original refs may eventually be dropped from the fatcat API, or at some point the refcat corpus may stabilize and be imported in to fatcat refs instead of being maintained as a separate dataset and index. It would be good to retain a mechanism for human corrections and overrides to the machine-generated reference graph.

Sources

The core metadata bootstrap sources, by entity type, are:

  • releases: Crossref metadata, with DOIs as the primary identifier, and PubMed (central), Wikidata, and CORE identifiers cross-referenced
  • containers: munged metadata from the DOAJ, ROAD, and Norwegian journal list, with ISSN-Ls as the primary identifier. ISSN provides an "ISSN to ISSN-L" mapping to normalize electronic and print ISSN numbers.
  • creators: ORCID metadata and identifier.

Initial file metadata and matches (file-to-release) come from earlier Internet Archive matching efforts, and in particular efforts to extra bibliographic metadata from PDFs (using GROBID) and fuzzy match (with conservative settings) to Crossref metadata.

The intent is to continuously ingest and merge metadata from a small number of large (~2-3 million more more records) general-purpose aggregators and catalogs in a centralized fashion, using bots, and then support volunteers and organizations in writing bots to merge high-quality metadata from field or institution-specific catalogs.

Progeny information (where the metadata comes from, or who "makes specific claims") is stored in edit metadata in the data model. Value-level attribution can be achieved by looking at the full edit history for an entity as a series of patches.

Implementation

The canonical backend datastore exposes a microservice-like HTTP API, which could be extended with gRPC or GraphQL interfaces. The initial datastore is a transactional SQL database, but this implementation detail is abstracted by the API.

As little "application logic" as possible should be embedded in this back-end; as much as possible would be pushed to bots which could be authored and operated by anybody. A separate web interface project talks to the API backend and can be developed more rapidly with less concern about data loss or corruption.

A cronjob will create periodic database dumps, both in "full" form (all tables and all edit history, removing only authentication credentials) and "flattened" form (with only the most recent version of each entity).

One design goal is to be linked-data/RDF/JSON-LD/semantic-web "compatible", but not necessarily "first". It should be possible to export the database in a relatively clean RDF form, and to fetch data in a variety of formats, but internally Fatcat is not backed by a triple-store, and is not tied to any specific third-party ontology or schema.

Microservice daemons should be able to proxy between the primary API and standard protocols like ResourceSync and OAI-PMH, and third party bots can ingest or synchronize the database in those formats.

Fatcat Identifiers

Fatcat identifiers are semantically meaningless fixed-length random numbers, usually represented in case-insensitive base32 format. Each entity type has its own identifier namespace.

128-bit (UUID size) identifiers encode as 26 characters (but note that not all such strings decode to valid UUIDs), and in the backend can be serialized in UUID columns:

work_rzga5b9cd7efgh04iljk8f3jvz
https://fatcat.wiki/work/rzga5b9cd7efgh04iljk8f3jvz

In comparison, 96-bit identifiers would have 20 characters and look like:

work_rzga5b9cd7efgh04iljk
https://fatcat.wiki/work/rzga5b9cd7efgh04iljk

and 64-bit:

work_rzga5b9cd7efg
https://fatcat.wiki/work/rzga5b9cd7efg

Fatcat identifiers can used to interlink between databases, but are explicitly not intended to supplant DOIs, ISBNs, handle, ARKs, and other "registered" persistent identifiers for general use.

Internal Schema

Internally, identifiers are lightweight pointers to "revisions" of an entity. Revisions are stored in their complete form, not as a patch or difference; if comparing to distributed version control systems (for managing changes to source code), this follows the git model, not the mercurial model.

The entity revisions are immutable once accepted; the editing process involves the creation of new entity revisions and, if the edit is approved, pointing the identifier to the new revision. Entities cross-reference between themselves by identifier not revision number. Identifier pointers also support (versioned) deletion and redirects (for merging entities).

Edit objects represent a change to a single entity; edits get batched together into edit groups (like "commits" and "pull requests" in git parlance).

SQL tables look something like this (with separate tables for entity type a la work_revision and work_edit):

entity_ident
    id (uuid)
    current_revision (entity_revision foreign key)
    redirect_id (optional; points to another entity_ident)
    is_live (boolean; whether newly created entity has been accepted)

entity_revision
    revision_id
    <all entity-style-specific fields>
    extra: json blob for schema evolution

entity_edit
    timestamp
    editgroup_id (editgroup foreign key)
    ident (entity_ident foreign key)
    new_revision (entity_revision foreign key)
    new_redirect (optional; points to entity_ident table)
    previous_revision (optional; points to entity_revision)
    extra: json blob for provenance metadata

editgroup
    editor_id (editor table foreign key)
    description
    extra: json blob for provenance metadata

An individual entity can be in the following "states", from which the given actions (transition) can be made:

  • wip (not live; not redirect; has rev)
    • activate (to active)
  • active (live; not redirect; has rev)
    • redirect (to redirect)
    • delete (to deleted)
  • redirect (live; redirect; rev or not)
    • split (to active)
    • delete (to delete)
  • deleted (live; not redirect; no rev)
    • redirect (to redirect)
    • activate (to active)

"WIP, redirect" or "WIP, deleted" are invalid states.

Additional entity-specific columns hold actual metadata. Additional tables (which reference both entity_revision and entity_id foreign keys as appropriate) represent things like authorship relationships (creator/release), citations between works, etc. Every revision of an entity requires duplicating all of these associated rows, which could end up being a large source of inefficiency, but is necessary to represent the full history of an object.

Roadmap

Contributions would be helpful to implement:

  • spam/abuse mitigation
  • "work aggolomeration" interfaces, for merging related releases under the same work
  • import (bulk and/or continuous updates) for more metadata sources
  • better handling of work/release distinction in, eg, search results and citation counting
  • de-duplication (via merging) for all entity types
  • matching improvements, eg, for references (citations), contributions (authorship), work grouping, and file/release matching
  • internationalization of the web interface (translation to multiple languages)
  • accessibility review of user interface

Possible breaking API and schema changes:

  • new entity type for research institutions, to track author affiliation. Use the new (2019) ROR identifier/registry
  • container nesting, or some method to handle conferences (event vs. series) and other "series" or "group" containers

Other longer term projects could include:

  • bi-directional synchronization with other user-editable catalogs, such as Wikidata
  • generic tagging of entities. Needs design/scoping; a separate service? editor-specific? tag by slugs, free-form text, or wikidata entities? "delicious for papers"?. Something as an alternative to traditional hierarchal categorization.

Known Issues

  • changelog index may have gaps due to PostgreSQL sequence and transaction roll-back behavior

Unresolved Questions

How to handle translations of, eg, titles and author names? To be clear, not translations of works (which are just separate releases), these are more like aliases or "originally known as".

Should contributor/author affiliation and contact information be retained? It could be very useful for disambiguation, but we don't want to build a huge database for "marketing" and other spam.

Can general-purpose SQL databases like Postgres or MySQL scale well enough to hold several tables with billions of entity revisions? Right from the start there are hundreds of millions of works and releases, many of which having dozens of citations, many authors, and many identifiers, and then we'll have potentially dozens of edits for each of these. This multiplies out to `1e8 * 2e1

  • 2e1 = 4e10`, or 40 billion rows in the citation table. If each row was 32 bytes on average (uncompressed, not including index size), that would be 1.3 TByte on its own, larger than common SSD disks. I do think a transactional SQL datastore is the right answer. In my experience locking and index rebuild times are usually the biggest scaling challenges; the largely-immutable architecture here should mitigate locking. Hopefully few indexes would be needed in the primary database, as user interfaces could rely on secondary read-only search engines for more complex queries and views.

There is a tension between focus and scope creep. If a central database like Fatcat doesn't support enough fields and metadata, then it will not be possible to completely import other corpuses, and this becomes "yet another" partial bibliographic database. On the other hand, accepting arbitrary data leads to other problems: sparseness increases (we have more "partial" data), potential for redundancy is high, humans will start editing content that might be bulk-replaced, etc.

Cataloging Style Guide

Language and Translation of Metadata

The Fatcat data model does not include multiple titles or names for the same entity, or even a "native"/"international" representation as seems common in other bibliographic systems. This most notably applies to release titles, but also to container and publisher names, and likely other fields.

For now, editors must use their own judgment over whether to use the title of the release listed in the work itself

This is not to be confused with translations of entire works, which should be treated as an entirely separate release.

External Identifiers

"Fake identifiers", which are actually registered and used in examples and documentation (such as DOI 10.5555/12345678) are allowed (and the entity should be tagged as a fake or example). Non-registered "identifier-like strings", which are semantically valid but not registered, should not exist in Fatcat metadata in an identifier column. Invalid identifier strings can be stored in "extra" metadata. Crossref has blogged about this distinction.

Editgroups and Meta-Meta-Data

Editors are expected to group their edits in semantically meaningful editgroups of a reasonable size for review and acceptance. For example, merging two creators and updating related releases could all go in a single editgroup. Large refactors, conversions, and imports, which may touch thousands of entities, should be grouped into reasonable size editgroups; extremely large editgroups may cause technical issues, and make review unmanageable. 50 edits is a decent batch size, and 100 is a good upper limit (and may be enforced by the server).

Common Entity Fields

All entities have:

  • extra (dict, optional): free-form JSON metadata

The "extra" field is an "escape hatch" to include extra fields not in the regular schema. It is intended to enable gradual evolution of the schema, as well as accommodating niche or field-specific content. Reasonable care should be taken with this extra metadata: don't include large text or binary fields, hundreds of fields, duplicate metadata, etc.

All full entities (distinct from revisions) also have the following fields:

  • state (string, read-only): summarizes the status of the entity in the catalog. One of a small number of fixed values, see vocabulary below.
  • ident (string, Fatcat identifier, read-only): the Fatcat entity identifier
  • revision (string, UUID): the current revision record that this entity ident points to
  • redirect (string, Fatcat identifier, optional): if set, this entity ident has been redirected to the redirect one. This is a mechanism of merging or "deduplicating" entities.
  • edit_extra (dict, optional): not part of the bibliographic schema, but can be included when creating or updating entities using the API, and the contents of field will be included in the entity's edit history.

state Vocabulary

  • active: entity exists in the catalog
  • redirect: the entity ident exists in the catalog, but is a redirect to another entity ident.
  • deleted: an entity with the ident did exist in the catalog previously, but it was deleted. The ident is retained as a "tombstone" record (aka, there is a record that an entity did exist previously).
  • wip ("Work in Progress"): an entity identifier has been created as part of an editgroup, but that editgroup has not been accepted yet into the catalog, and there is no previous/current version of the entity.

Container Entity Reference

Fields

  • name (string, required): The title of the publication, as used in international indexing services. Eg, "Journal of Important Results". Not necessarily in the native language, but also not necessarily in English. Alternative titles (and translations) can be stored in "extra" metadata (see below)
  • container_type (string): eg, journal vs. conference vs. book series. Controlled vocabulary is described below.
  • publication_status (string): whether actively publishing, never published anything, or discontinued. Controlled vocabularity is described below.
  • publisher (string): The name of the publishing organization. Eg, "Society of Curious Students".
  • issnl (string): an external identifier, with registration controlled by the ISSN organization. Registration is relatively inexpensive and easy to obtain (depending on world region), so almost all serial publications have one. The ISSN-L ("linking ISSN") is one of either the print (issp) or electronic (issne) identifiers for a serial publication; not all publications have both types of ISSN, but many do, which can cause confusion. The ISSN master list is not gratis/public, but the ISSN-L mapping is.
  • issne (string): Electronic ISSN ("ISSN-E")
  • issnp (string): Print ISSN ("ISSN-P")
  • wikidata_qid (string): external linking identifier to a Wikidata entity.

extra Fields

  • abbrev (string): a commonly used abbreviation for the publication, as used in citations, following the [ISO 4][] standard. Eg, "Journal of Polymer Science Part A" -> "J. Polym. Sci. A"
  • acronym (string): acronym of publication name. Usually all upper-case, but sometimes a very terse, single-word truncated form of the name (eg, a pun).
  • coden (string): an external identifier, the [CODEN code][]. 6 characters, all upper-case.
  • default_license (string, slug): short name (eg, "CC-BY-SA") for the default/recommended license for works published in this container
  • original_name (string): native name (if name is translated)
  • platform (string): hosting platform: OJS, wordpress, scielo, etc
  • mimetypes (array of string): formats that this container publishes all works under (eg, 'application/pdf', 'text/html')
  • first_year (integer): first year of publication
  • last_year (integer): final year of publication (implies that container is no longer active)
  • languages (array of strings): ISO codes; the first entry is considered the "primary" language (if that makes sense)
  • country (string): ISO abbreviation (two characters) for the country this container is published in
  • aliases (array of strings): significant alternative names or abbreviations for this container (not just capitalization/punctuation)
  • region (string, slug): continent/world-region (vocabulary is TODO)
  • discipline (string, slug): highest-level subject aread (vocabulary is TODO)
  • urls (array of strings): known homepage URLs for this container (first in array is default)
  • issnp (deprecated; string): Print ISSN; deprecated now that there is a top-level field
  • issne (deprecated; string): Electronic ISSN; deprecated now that there is a top-level field

Additional fields used in analytics and "curration" tracking:

  • doaj (object)
    • as_of (string, ISO datetime): datetime of most recent check; if not set, not actually in DOAJ
    • seal (bool): has DOAJ seal
    • work_level (bool): whether work-level publications are registered with DOAJ
    • archive (array of strings): preservation archives
  • road (object)
    • as_of (string, ISO datetime): datetime of most recent check; if not set, not actually in ROAD
  • kbart (object)
    • lockss, clockss, portico, jstor etc (object)
      • year_spans (array of arrays of integers (pairs)): year spans (inclusive) for which the given archive has preserved this container
      • volume_spans (array of arrays of integers (pairs)): volume spans (inclusive) for which the given archive has preserved this container
  • sherpa_romeo (object):
    • color (string): the SHERPA/RoMEO "color" of the publisher of this container
  • doi: TODO: include list of prefixes and which (if any) DOI registrar is used
  • dblp (object):
    • prefix (string): prefix of dblp keys published as part of this container (eg, 'journals/blah' or 'conf/xyz')
  • ia (object): Internet Archive specific fields
    • sim (object): same format as kbart preservation above; coverage in microfilm collection
    • longtail (bool): is this considered a "long-tail" open access venue
  • publisher_type (string): controlled vocabulary

For KBART and other "coverage" fields, we "over-count" on the assumption that works with "in-progress" status will soon actually be preserved. Elements of these arrays are either an integer (means that single year is preserved), or an array of length two (meaning everything between the two numbers (inclusive) is preserved).

container_type Vocabulary

  • journal
  • proceedings
  • conference-series
  • book-series
  • blog
  • magazine
  • trade
  • test

publication_status Vocabulary

  • active: ongoing publication of new releases
  • suspended: publication has stopped, but may continue in the future
  • discontinued: publication has permanently ceased
  • vanished: publication has stopped, and public traces have vanished (eg, publisher website has disappeared with no notice)
  • never: no works were ever published under this container
  • one-time: releases were all published as a one-time even. for example, a single instance of a conference, or a fixed-size book series

Creator Entity Reference

Fields

  • display_name (string, required): Full name, as will be displayed in user interfaces. Eg, "Grace Hopper"
  • given_name (string): Also known as "first name". Eg, "Grace".
  • surname (string): Also known as "last name". Eg, "Hooper".
  • orcid (string): external identifier, as registered with ORCID.
  • wikidata_qid (string): external linking identifier to a Wikidata entity.

See also "Human Names" sub-section of style guide.

extra Fields

All are optional.

  • also-known-as (list of objects): additional names that this creator may be known under. For example, previous names, aliases, or names in different scripts. Can include any or all of display_name, given_name, or surname as keys.

Human Names

Representing names of human beings in databases is a fraught subject. For some background reading, see:

Particular difficult issues in the context of a bibliographic database include:

  • the non-universal concept of "family" vs. "given" names and their relationship to first and last names
  • the inclusion of honorary titles and other suffixes and prefixes to a name
  • the distinction between "preferred", "legal", and "bibliographic" names, or other situations where a person may not wish to be known under the name they are commonly referred
  • language and character set issues
  • different conventions for sorting and indexing names
  • the sprawling world of citation styles
  • name changes
  • pseudonyms, anonymous publications, and fake personas (perhaps representing a group, like Bourbaki)

The general guidance for Fatcat is to:

  • not be a "source of truth" for representing a persona or human being; ORCID and Wikidata are better suited to this task
  • represent author personas, not necessarily 1-to-1 with human beings
  • balance the concerns of readers with those of the author
  • enable basic interoperability with external databases, file formats, schemas, and style guides
  • when possible, respect the wishes of individual authors

The data model for the creator entity has three name fields:

  • surname and given_name: needed for "aligning" with external databases, and to export metadata to many standard formats
  • display_name: the "preferred" representation for display of the entire name, in the context of international attribution of authorship of a written work

Names to not necessarily need to expressed in a Latin character set, but also does not necessarily need to be in the native language of the creator or the language of their notable works

Ideally all three fields are populated for all creators.

It seems likely that this schema and guidance will need review.

File Entity Reference

Fields

  • size (integer, positive, non-zero): Size of file in bytes. Eg: 1048576.
  • md5 (string): MD5 hash in lower-case hex. Eg: "d41efcc592d1e40ac13905377399eb9b".
  • sha1 (string): SHA-1 hash in lower-case hex. Not technically required, but the most-used of the hash fields and should always be included. Eg: "f013d66c7f6817d08b7eb2a93e6d0440c1f3e7f8".
  • sha256: SHA-256 hash in lower-case hex. Eg: "a77e4c11a57f1d757fca5754a8f83b5d4ece49a2d28596889127c1a2f3f28832".
  • urls: An array of "typed" URLs. Order is not meaningful, and may not be preserved.
    • url (string, required): Eg: "https://example.edu/~frau/prcding.pdf".
    • rel (string, required): Eg: "webarchive", see vocabulary below.
  • mimetype (string): Format of the file. If XML, specific schema can be included after a +. Example: "application/pdf"
  • content_scope (string): for situations where the file does not simply contain the full representation of a work (eg, fulltext of an article, for an article-journal release), describes what that scope of coverage is. Eg, entire issue, corrupt file. See vocabulary below.
  • release_ids (array of string identifiers): references to release entities that this file represents a manifestation of. Note that a single file can contain multiple release references (eg, a PDF containing a full issue with many articles), and that a release will often have multiple files (differing only by watermarks, or different digitizations of the same printed work, or variant MIME/media types of the same published work).
  • extra (object with string keys): additional metadata about this file
    • path: filename, with optional path prefix. path must be "relative", not "absolute", and should use UNIX-style forward slashes, not Windows-style backward slashes

URL rel Vocabulary

  • web: generic public web sites; for http/https URLs, this should be the default
  • webarchive: full URL to a resource in a long-term web archive
  • repository: direct URL to a resource stored in a repository (eg, an institutional or field-specific research data repository)
  • academicsocial: academic social networks (such as academia.edu or ResearchGate)
  • publisher: resources hosted on publisher's website
  • aggregator: fulltext aggregator or search engine, like CORE or Semantic Scholar
  • dweb: content hosted on distributed/decentralized web protocols, such as dat:// or ipfs:// URLs

content_scope Vocabulary

This same vocabulary is shared between file, fileset, and webcapture entities; not all the fields make sense for each entity type.

  • if not set, assume that the artifact entity is valid and represents a complete copy of the release
  • issue: artifact contains an entire issue of a serial publication (eg, issue of a journal), representing several releases in full
  • abstract: contains only an abstract (short description) of the release, not the release itself (unless the release_type itself is abstract, in which case it is the entire release)
  • index: index of a journal, or series of abstracts from a conference
  • slides: slide deck (usually in "landscape" orientation)
  • front-matter: non-article content from a journal, such as editorial policies
  • supplement: usually a file entity which is a supplement or appendix, not the entire work
  • component: a sub-component of a release, which may or may not be associated with a component release entity. For example, a single figure or table as part of an article
  • poster: digital copy of a poster, eg as displayed at conference poster sessions
  • sample: a partial sample of the entire work. eg, just the first page of an article. distinct from truncated
  • truncated: the file has been truncated at a binary level, and may also be corrupt or invalid. distinct from sample
  • corrupt: broken, mangled, or corrupt file (at the binary level)
  • stub: any other out-of-scope artifact situations, where the artifact represents something which would not link to any possible in-scope release in the catalog (except a stub release)
  • landing-page: for webcapture, the landing page of a work, as opposed to the work itself
  • spam: content is spam. articles, webpages, or issues which include incidental advertisements within them are not counted as spam

Fileset Entity Reference

Fields

  • manifest (array of objects): each entry represents a file

    • path (string, required): relative path to file (including filename)
    • size (integer, required): in bytes
    • md5 (string): MD5 hash in lower-case hex
    • sha1 (string): SHA-1 hash in lower-case hex
    • sha256 (string): SHA-256 hash in lower-case hex
    • mimetype (string): Content type in MIME type schema
    • extra (object): any extra metadata about this specific file. all are optional
      • original_url: live web canonical URL to download this file
      • webarchive_url: web archive capture of this file
  • urls: An array of "typed" URLs. Order is not meaningful, and may not be preserved. These are URLs for the entire fileset, not individual files.

    • url (string, required): Eg: "https://example.edu/~frau/prcding.pdf".
    • rel (string, required): Eg: "archive-base", "webarchive".
  • release_ids (array of string identifiers): references to release entities

  • content_scope (string): for situations where the fileset does not simply contain the full representation of a work (eg, all files in dataset, for a dataset release), describes what that scope of coverage is. Uses same vocabulary as File entity.

  • extra (object with string keys): additional metadata about this group of files, including upstream platform-specific metadata and identifiers

    • platform_id: platform-specific identifier for this fileset

URL rel types

Any ending in "-base" implies that a file path (from the manifest) can be appended to the "base" URL to get a file download URL. Any "bundle" implies a direct link to an archive or "bundle" (like .zip or .tar) which contains all the files in this fileset

  • repository or platform or web: URL of a live-web landing page or other location where content can be found. May or may not be machine-reachable.
  • webarchive: web archive version of repository landing page
  • repository-bundle: direct URL to a live-web "archive" file, such as .zip, which contains all of the individual files in this fileset
  • webarchive-bundle: web archive version of repository-bundle
  • archive-bundle: file archive version of repository-bundle
  • repository-base: live-web base URL/directory from which file path can be appended to fetch individual files
  • archive-base: base URL/directory from which file path can be appended to fetch individual files

Web Capture Entity Reference

Fields

Warning: This schema is not yet stable.

  • cdx (array of objects): each entry represents a distinct web resource (URL). First is considered the primary/entry. Roughly aligns with CDXJ schema.
    • surt (string, required): sortable URL format
    • timestamp (string, datetime, required): ISO format, UTC timezone, with Z prefix required, with second (or finer) precision. Eg, "2016-09-19T17:20:24Z". Wayback timestamps (like "20160919172024") should be converted naively.
    • url (string, required): full URL
    • mimetype (string): content type of the resource
    • status_code (integer, signed): HTTP status code
    • sha1 (string, required): SHA-1 hash in lower-case hex
    • sha256 (string): SHA-256 hash in lower-case hex
  • archive_urls: An array of "typed" URLs where this snapshot can be found. Can be wayback/memento instances, or direct links to a WARC file containing all the capture resources. Often will only be a single archive. Order is not meaningful, and may not be preserved.
    • url (string, required): Eg: "https://example.edu/~frau/prcding.pdf".
    • rel (string, required): Eg: "wayback" or "warc"
  • original_url (string): base URL of the resource. May reference a specific CDX entry, or maybe in normalized form.
  • timestamp (string, datetime): same format as CDX line timestamp (UTC, etc). Corresponds to the overall capture timestamp. Can be the earliest of CDX timestamps if that makes sense
  • content_scope (string): for situations where the webcapture does not simply contain the full representation of a work (eg, HTML fulltext, for an article-journal release), describes what that scope of coverage is. Eg, landing-page it doesn't contain the full content. Landing pages are out-of-scope for fatcat, but if they were accidentally imported, should mark them as such so they aren't re-imported. Uses same vocabulary as File entity.
  • release_ids (array of string identifiers): references to release entities

Release Entity Reference

Fields

  • title (string, required): the display title of the release. May include subtitle.
  • subtitle (string): intended only to be used primarily with books, not journal articles. Subtitle may also be appended to the title instead of populating this field.
  • original_title (string): the full original language title, if title is translated
  • work_id (fatcat identifier; required): the (single) work that this release is grouped under. If not specified in a creation (POST) action, the API will auto-generate a work.
  • container_id (fatcat identifier): a (single) container that this release is part of. When expanded the container field contains the full container entity.
  • release_type (string, controlled set): represents the medium or form-factor of this release; eg, "book" versus "journal article". Not necessarily the same across all releases of a work. See definitions below.
  • release_stage (string, controlled set): represents the publishing/review lifecycle status of this particular release of the work. See definitions below.
  • release_date (string, ISO date format): when this release was first made publicly available. Blank if only year is known.
  • release_year (integer): year when this release was first made publicly available; should match release_date if both are known.
  • withdrawn_status (optional, string, controlled set):
  • withdrawn_date (optional, string, ISO date format): when this release was withdrawn. Blank if only year is known.
  • withdrawn_year (optional, integer): year when this release was withdrawn; should match withdrawn_date if both are known.
  • ext_ids (key/value object of string-to-string mappings): external identifiers. At least an empty ext_ids object is always required for release entities, so individual identifiers can be accessed directly.
  • volume (string): optionally, stores the specific volume of a serial publication this release was published in. type: string
  • issue (string): optionally, stores the specific issue of a serial publication this release was published in.
  • pages (string): the pages (within a volume/issue of a publication) that this release can be looked up under. This is a free-form string, and could represent the first page, a range of pages, or even prefix pages (like "xii-xxx").
  • version (string): optionally, describes distinguishes this release version from others. Generally a number, software-style version, or other short/slug string, not a freeform description. Book "edition" descriptions can also go in an edition extra field. Often used in conjunction with external identifiers. If you're not certain, don't use this field!
  • number (string): an inherent identifier for this release (or work), often part of the title. For example, standards numbers, technical memo numbers, book series number, etc. Not a book chapter number however (which can be stored in extra). Depending on field or series-specific norms, the number may be stored here, in the title, or in both fields.
  • publisher (string): name of the publishing entity. This does not need to be populated if the associated container entity has the publisher field set, though it is acceptable to duplicate, as the publishing entity of a container may differ over time. Should be set for singleton releases, like books.
  • language (string, slug): the primary language used in this particular release of the work. Only a single language can be specified; additional languages can be stored in "extra" metadata (TODO: which field?). This field should be a valid RFC1766/ISO639 language code (two letters). AKA, a controlled vocabulary, not a free-form name of the language.
  • license_slug (string, slug): the license of this release. Usually a creative commons short code (eg, CC-BY), though a small number of other short names for publisher-specific licenses are included (TODO: list these).
  • contribs (array of objects): an array of authorship and other creator contributions to this release. Contribution fields include:
    • index (integer, optional): the (zero-indexed) order of this author. Authorship order has significance in many fields. Non-author contributions (illustration, translation, editorship) may or may not be ordered, depending on context, but index numbers should be unique per release (aka, there should not be "first author" and "first translator")
    • creator_id (identifier): if known, a reference to a specific creator
    • raw_name (string): the name of the contributor, as attributed in the text of this work. If the creator_id is linked, this may be different from the display_name; if a creator is not linked, this field is particularly important. Syntax and name order is not specified, but most often will be "display order", not index/alphabetical (in Western tradition, surname followed by given name).
    • role (string, of a set): the type of contribution, from a controlled vocabulary. TODO: vocabulary needs review.
    • extra (string): additional context can go here. For example, author affiliation, "this is the corresponding author", etc.
  • refs (array of ident strings): references (aka, citations) to other releases. References can only be linked to a specific target release (not a work), though it may be ambiguous which release of a work is being referenced if the citation is not specific enough. IMPORTANT: release refs are distinct from the reference graph API. Reference fields include:
    • index (integer, optional): reference lists and bibliographies almost always have an implicit order. Zero-indexed. Note that this is distinct from the key field.
    • target_release_id (fatcat identifier): if known, and the release exists, a cross-reference to the Fatcat entity
    • extra (JSON, optional): additional citation format metadata can be stored here, particularly if the citation schema does not align. Common fields might be "volume", "authors", "issue", "publisher", "url", and external identifiers ("doi", "isbn13").
    • key (string): works often reference works with a short slug or index number, which can be captured here. For example, "[BROWN2017]". Keys generally supersede the index field, though both can/should be supplied.
    • year (integer): year of publication of the cited release.
    • container_title (string): if applicable, the name of the container of the release being cited, as written in the citation (usually an abbreviation).
    • title (string): the title of the work/release being cited, as written.
    • locator (string): a more specific reference into the work/release being cited, for example the page number(s). For web reference, store the URL in "extra", not here.
  • abstracts (array of objects): see below
    • sha1 (string, hex, required): reference to the abstract content (string). Example: "3f242a192acc258bdfdb151943419437f440c313"
    • content (string): The abstract raw content itself. Example: <jats:p>Some abstract thing goes here</jats:p>
    • mimetype (string): not formally required, but should effectively always get set. text/plain if the abstract doesn't have a structured format
    • lang (string, controlled set): the human language this abstract is in. See the lang field of release for format and vocabulary.

External Identifiers (ext_ids)

The ext_ids object name-spaces external identifiers and makes it easier to add new identifiers to the schema in the future.

Many identifier fields must match an internal regex (string syntax constraint) to ensure they are properly formatted, though these checks aren't always complete or correct in more obscure cases.

  • doi (string): full DOI number, lower-case. Example: "10.1234/abcde.789". See section below for more about DOIs specifically
  • wikidata_qid (string): external identifier for Wikidata entities. These are integers prefixed with "Q", like "Q4321". Each release entity can be associated with at most one Wikidata entity (this field is not an array), and Wikidata entities should be associated with at most a single release. In the future it may be possible to associate Wikidata entities with work entities instead.
  • isbn13 (string): external identifier for books. ISBN-9 and other formats should be converted to canonical ISBN-13.
  • pmid (string): external identifier for the PubMed database. These are bare integers, but stored in a string format.
  • pmcid (string): external identifier for PubMed Central database. These are integers prefixed with "PMC" (upper case), like "PMC4321". Versioned PMCIDs can also be stored (eg, "PMC4321.1"; future clarification of whether versions should always be stored will be needed.
  • core (string): external identifier for the CORE open access aggregator. Not used much in practice. These identifiers are integers, but stored in string format.
  • arxiv (string): external identifier to a (version-specific) arxiv.org work. For releases, must always include the vN suffix (eg, v3).
  • jstor (string): external identifier for works in JSTOR which do not have a valid registered DOI.
  • ark (string): ARK identifier.
  • mag (DEPRECATED; string): Microsoft Academic Graph (MAG) identifier. As of December 2021, no entities in the catalog have a value for this field.
  • doaj (string): DOAJ article-level identifier
  • dblp (string): dblp article-level identifier
  • oai (string): OAI-PMH record id. Only use if no other identifier is available
  • hdl (string): handle.net identifier. While DOIs are technically handles, do not put DOIs in this field. Handles are normalized to lower-case in the catalog (server-side).

extra Fields

  • crossref (object), for extra crossref-specific metadata
    • subject (array of strings) for subject/category of content
    • type (string) raw/original Crossref type
    • alternative-id (array of strings)
    • archive (array of strings), indicating preservation services deposited
    • funder (object/dictionary)
  • aliases (array of strings) for additional titles this release might be known by
  • container_name (string) if not matched to a container entity
  • group-title (string) for releases within an collection/group
  • translation_of (release identifier) if this release is a translation of another (usually under the same work)
  • superceded (boolean) if there is another release under the same work that should be referenced/indicated instead. Intended as a temporary hint until proper work-based search is implemented. As an example use, all arxiv release versions except for the most recent get this set.
  • is_work_alias (boolean): if true, then this release is an alias or pointer to the entire work, or the most recent version of the work. For example, some data repositories have separate DOIs for each version of the dataset, then an additional DOI that points to the "latest" version/DOI.

release_type Vocabulary

This vocabulary is based on the CSL types, with a small number of (proposed) extensions:

  • article-magazine
  • article-journal, including pre-prints and working papers
  • book
  • chapter is allowed as they are frequently referenced and read independent of the entire book. The data model does not currently support linking a subset of a release to an entity representing the entire release. The release/work/file distinctions should not be used to group multiple chapters under a single work; a book chapter can be it's own work. A paper which is republished as a chapter (eg, in a collection, or "edited" book) can have both releases under one work. The criteria of whether to "split" a book and have release entities for each chapter is whether the chapter has been cited/reference as such.
  • dataset
  • entry, which can be used for generic web resources like question/answer site entries.
  • entry-encyclopedia
  • manuscript
  • paper-conference
  • patent
  • post-weblog for blog entries
  • report
  • review, for things like book reviews, not the "literature review" form of article-journal, nor peer reviews (see peer_review). Note review-book for book reviews specifically.
  • speech can be used for eg, slides and recorded conference presentations themselves, as distinct from paper-conference
  • thesis
  • webpage
  • peer_review (fatcat extension)
  • software (fatcat extension)
  • standard (fatcat extension), for technical standards like RFCs
  • abstract (fatcat extension), for releases that are only an abstract of a larger work. In particular, translations. Many are granted DOIs.
  • editorial (custom extension) for columns, "in this issue", and other content published along peer-reviewed content in journals. Many are granted DOIs.
  • letter for "letters to the editor", "authors respond", and sub-article-length published content. Many are granted DOIs.
  • stub (fatcat extension) for releases which have notable external identifiers, and thus are included "for completeness", but don't seem to represent a "full work".
  • component (fatcat extension) for sub-components of a full paper or other work. Eg, tables, or individual files as part of a dataset.

An example of a stub might be a paper that gets an extra DOI by accident; the primary DOI should be a full release, and the accidental DOI can be a stub release under the same work. stub releases shouldn't be considered full releases when counting or aggregating (though if technically difficult this may not always be implemented). Other things that can be categorized as stubs (which seem to often end up mis-categorized as full articles in bibliographic databases):

  • commercial advertisements
  • "trap" or "honey pot" works, which are fakes included in databases to detect re-publishing without attribution
  • "This page is intentionally blank"
  • "About the author", "About the editors", "About the cover"
  • "Acknowledgments"
  • "Notices"

All other CSL types are also allowed, though they are mostly out of scope:

  • article (generic; should usually be some other type)
  • article-newspaper
  • bill
  • broadcast
  • entry-dictionary
  • figure
  • graphic
  • interview
  • legislation
  • legal_case
  • map
  • motion_picture
  • musical_score
  • pamphlet
  • personal_communication
  • post
  • review-book
  • song
  • treaty

For the purpose of statistics, the following release types are considered "papers":

  • article
  • article-journal
  • chapter
  • paper-conference
  • thesis

release_stage Vocabulary

These roughly follow the DRIVER publication version guidelines, with the addition of a retracted status.

  • draft is an early version of a work which is not considered for peer review. Sometimes these are posted to websites or repositories for early comments and feedback.
  • submitted is the version that was submitted for publication. Also known as "pre-print", "pre-review", "under review". Note that this doesn't imply that the work was every actually submitted, reviewed, or accepted for publication, just that this is the version that "would be". Most versions in pre-print repositories are likely to have this status.
  • accepted is a version that has undergone peer review and accepted for published, but has not gone through any publisher copy editing or re-formatting. Also known as "post-print", "author's manuscript", "publisher's proof".
  • published is the version that the publisher distributes. May include minor (gramatical, typographical, broken link, aesthetic) corrections. Also known as "version of record", "final publication version", "archival copy".
  • updated: post-publication significant updates (considered a separate release in Fatcat). Also known as "correction" (in the context of either a published "correction notice", or the full new version)
  • retraction for post-publication retraction notices (should be a release under the same work as the published release)

Note that in the case of a retraction, the original publication does not get state retracted, only the retraction notice does. The original publication does get a withdrawn_status metadata field set.

When blank, indicates status isn't known, and wasn't inferred at creation time. Can often be interpreted as published, but be careful!

withdrawn_status Vocabulary

Don't know of an existing controlled vocabulary for things like retractions or other reasons for marking papers as removed from publication, so invented my own. These labels should be considered experimental and subject to change.

Note that some of these will apply more to pre-print servers or publishing accidents, and don't necessarily make sense as a formal change of status for a print journal publication.

Any value at all indicates that the release should be considered "no longer published by the publisher or primary host", which could mean different things in different contexts. As some concrete examples, works are often accidentally generated a duplicate DOI; physics papers have been taken down in response to government order under national security justifications; papers have been withdrawn for public health reasons (above and beyond any academic-style retraction); entire journals may be found to be predatory and pulled from circulation; individual papers may be retracted by authors if a serious mistake or error is found; an author's entire publication history may be retracted in cases of serious academic misconduct or fraud.

  • withdrawn is generic: the work is no longer available from the original publisher. There may be no reason, or the reason may not be known yet.
  • retracted for when a work is formally retracted, usually accompanied by a retraction notice (a separate release under the same work). Note that the retraction itself should not have a withdrawn_status.
  • concern for when publishers release an "expression of concern", often indicating that the work is not reliable in some way, but not yet formally retracted. In this case the original work is probably still available, but should be marked as suspect. This is not the same as presence of errata.
  • safety for works pulled for public health or human safety concerns.
  • national-security for works pulled over national security concerns.
  • spam for content that is considered spam (eg, bogus pre-print or repository submissions). Not to be confused with advertisements or product reviews in journals.

contribs.role Vocabulary

  • author
  • translator
  • illustrator
  • editor

All other CSL role types are also allowed, though are mostly out of scope for Fatcat:

  • collection-editor
  • composer
  • container-author
  • director
  • editorial-director
  • editortranslator
  • interviewer
  • original-author
  • recipient
  • reviewed-author

If blank, indicates that type of contribution is not known; this can often be interpreted as authorship.

More About DOIs

All DOIs stored in an entity column should be registered (aka, should be resolvable from doi.org). Invalid identifiers may be cleaned up or removed by bots.

DOIs should always be stored and transferred in lower-case form. Note that there are almost no other constraints on DOIs (and handles in general): they may have multiple forward slashes, whitespace, of arbitrary length, etc. Crossref has a number of examples of such "valid" but frustratingly formatted strings.

In the Fatcat ontology, DOIs and release entities are one-to-one.

It is the intention to automatically (via bot) create a Fatcat release for every Crossref-registered DOI from an allowlist of media types ("journal-article" etc, but not all), and it would be desirable to auto-create entities for in-scope publications from all registrars. It is not the intention to auto-create a release for every registered DOI. In particular, "sub-component" DOIs (eg, for an individual figure or table from a publication) aren't currently auto-created, but could be stored in "extra" metadata, or on a case-by-case basis.

Work Entity Reference

Works have no fields! They just group releases.

REST API

The Fatcat HTTP API is mostly a classic REST "CRUD" (Create, Read, Update, Delete) API, with a few twists.

A declarative specification of all API endpoints, JSON data models, and response types is available in OpenAPI 2.0 format. Code generation tools are used to generate both server-side type-safe endpoint routes and client-side libraries. Auto-generated reference documentation is, for now, available at https://api.fatcat.wiki.

All API traffic is over HTTPS; there is no HTTP endpoint, even for read-only operations. All endpoints accept and return only JSON serialized content.

Entity Endpoints/Actions

Actions could, in theory, be directed at any of:

entities (ident)
revision
edit

Top-level entity actions (resulting in edits):

create (new rev)
update (new rev)
delete
redirect
split (remove redirect)

On existing entity edits (within a group):

update
delete

An edit group as a whole can be:

create
submit
accept

Other per-entity endpoints:

lookup (by external persistent identifier)
match (by field/context; unimplemented)

Editgroups

All mutating entity operations (create, update, delete) accept a required editgroup_id query parameter. Editgroups (with contextual metadata) should be created before starting edits.

Related edits (to multiple entities) should be collected under a single editgroup, up to a reasonable size. More than 50 edits per entity type, or more than 100 edits total in an editgroup become unwieldy.

After creating and modifying the editgroup, it may be "submitted", which flags it for review by bot and human editors. The editgroup may be "accepted" (merged), or if changes are necessary the edits can be updated and re-submitted.

Sub-Entity Expansion

To reduce the need for multiple GET queries when looking for common related metadata, it is possible to include linked entities in responses using the expand query parameter. For example, by default the release model only includes an optional container_id field which points to a container entity. If the expand parameter is set:

https://api.fatcat.wiki/v0/release/aaaaaaaaaaaaarceaaaaaaaaam?expand=container

Then the full container model will be included under the container field. Multiple expand parameters can be passed, comma-separated.

Authentication and Authorization

There are two editor types: bots and humans. Additionally, either type of editor may have additional privileges which allow them to, eg, directly accept editgroups (as opposed to submitting edits for review).

All mutating API calls (POST, PUT, DELETE HTTP verbs) require token-based authentication using an HTTP Bearer token. New tokens can be generated in the web interface.

Autoaccept Flag

Currently only on batch creation (POST) for entities.

For all bulk operations, optional 'editgroup' query parameter overrides individual editgroup parameters.

If autoaccept flag is set and editgroup is not, a new editgroup is automatically created and overrides for all entities inserted. Note that this is different behavior from the "use current or create new" default behavior for regular creation.

Unfortunately, "true" and "false" are the only values acceptable for boolean rust/openapi2 query parameters

QA Instance

The intent is to run a public "sandbox" QA instance of the catalog, using a subset of the full catalog, running the most recent development branch of the API specification. This instance can be used by developers for prototyping and experimentation, though note that all data is periodically wiped, and this endpoint is more likely to have bugs or be offline.

Search Index

The Elasticsearch indices used to power metadata search, statistics, and graphs on the fatcat web interface are exposed publicly at https://search.fatcat.wiki/. Third parties can make queries using the Elasticsearch API, which is well documented online and has client libraries in many programming languages.

A thin proxy (es-public-proxy) filters requests to avoid expensive queries which could cause problems for search queries on the web interface, but most of the Elasticsearch API is supported, including powerful aggregation queries. CORS headers are supported, meaning that queries can be made directly from web browsers.

There is a short delay between updates to the fatcat catalog (via the main API) and updates to the search index.

Notable indices include:

  • fatcat_release: release entity metadata (schema)
  • fatcat_container: container entity metadata (schema)
  • fatcat_ref: reference graph (schema)
  • scholar_fulltext: scholar.archive.org full-text index (body text can be queried, but not downloaded or extracted from index) (schema)

Schemas for these indices can be fetched directly from the index (eg, https://search.fatcat.wiki/fatcat_release/_mapping), and are versioned in the fatcat git repository under fatcat:extra/eleasticsearch/. They are a simplification and transform of the regular entity schemas, and include some synthesized fields (such as "preservation status" for releases). Note that the search schemas are likely to change over time with less notice and stability guarantees than the primary catalog API schema.

Bulk Exports

There are several types of bulk exports and database dumps folks might be interested in:

  • complete database dumps
  • changelog history with all entity revisions and edit metadata
  • identifier snapshot tables
  • entity exports

All exports and dumps get uploaded to the Internet Archive under the "Fatcat Database Snapshots and Bulk Metadata Exports" collection.

Complete Database Dumps

The most simple and complete bulk export. Useful for disaster recovery, mirroring, or forking the entire service. The internal database schema is not stable, so not as useful for longitudinal analysis. These dumps will include edits-in-progress, deleted entities, old revisions, etc, which are potentially difficult or impossible to fetch through the API.

Public copies may have some tables redacted (eg, API credentials).

Dumps are in PostgreSQL pg_dump "tar" binary format, and can be restored locally with the pg_restore command. See ./extra/sql_dumps/ for commands and details. Dumps are on the order of 100 GBytes (compressed) and will grow over time.

Changelog History

These are currently unimplemented; would involve "hydrating" sub-entities into changelog exports. Useful for some mirrors, and analysis that needs to track provenance information. Format would be the public API schema (JSON).

All information in these dumps should be possible to fetch via the public API, including on a feed/streaming basis using the sequential changelog index. All information is also contained in the database dumps.

Identifier Snapshots

Many of the other dump formats are very large. To save time and bandwidth, a few simple snapshot tables can be exported directly in TSV format. Because these tables can be dumped in single SQL transactions, they are consistent point-in-time snapshots.

One format is per-entity identifier/revision tables. These contain active, deleted, and redirected identifiers, with revision and redirect references, and are used to generate the entity dumps below.

Other tables contain external identifier mappings or file hashes.

Release abstracts can be dumped in their own table (JSON format), allowing them to be included only by reference from other dumps. The copyright status and usage restrictions on abstracts are different from other catalog content; see the policy page for more context. Abstracts are immutable and referenced by hash in the database, so the consistency of these dumps is not as much of a concern as with other exports.

Unlike all other dumps and public formats, the Fatcat identifiers in these dumps are in raw UUID format (not base32-encoded), though this may be fixed in the future.

See ./extra/sql_dumps/ for scripts and details. Dumps are on the order of a couple GBytes each (compressed).

Entity Exports

Using the above identifier snapshots, the Rust fatcat-export program outputs single-entity-per-line JSON files with the same schema as the HTTP API. These might contain the default fields, or be in "expanded" format containing sub-entities for each record.

Only "active" entities are included (not deleted, work-in-progress, or redirected entities).

These dumps can be quite large when expanded (over 100 GBytes compressed), but do not include history so will not grow as fast as other exports over time. Not all entity types are dumped at the moment; if you would like specific dumps get in touch!

Cookbook

These quickstart examples gloss over a lot of details in the API. The canonical API documentation (generated from the OpenAPI specification) is available at https://api.fatcat.wiki/redoc.

The first two simple cookbook examples here include full headers. Later examples only show python client library code snippets.

Lookup Fulltext URLs by DOI

Often you have a DOI or other paper identifier and want to find open copies of the paper to read. In fatcat terms, you want to lookup a release by external identifier, then sort through any associated file entities to find the best files and URLs to download. Note that the Unpaywall API is custom designed for this task and you should look in to using that instead.

This is read-only task and requires no authentication. The simple summary is to:

  1. GET the release lookup endpoint with the external identifier as a query parameter. Also set the hide parameter to elide unused fields, and the expand parameter to files to include related files in a single request.
  2. If you get a hit (HTTP 200), sort through the files field (an array) and for each file the urls field (also an array) to select the best URL(s).

The URL to use would look like https://api.fatcat.wiki/v0/release/lookup?doi=10.1088/0264-9381/19/7/380&expand=files&hide=abstracts,refs in a browser. The query parameters should be URL encoded (eg, the DOI / characters replaced with with %20), but almost all HTTP tools and libraries will do this automatically.

The raw HTTP request would look like:

GET /v0/release/lookup?doi=10.1088%2F0264-9381%2F19%2F7%2F380&expand=files&hide=abstracts%2Crefs HTTP/1.1
Accept: */*
Accept-Encoding: gzip, deflate
Connection: keep-alive
Host: api.fatcat.wiki
User-Agent: HTTPie/0.9.8

And the response (with some headers removed and JSON body paraphrased):

HTTP/1.1 200 OK
Connection: keep-alive
Content-Length: 1996
Content-Type: application/json
Date: Tue, 17 Sep 2019 22:47:54 GMT
X-Frame-Options: SAMEORIGIN
X-Span-ID: caa70cff-967d-4429-96c6-71909738ab4c

{
    "ident": "3j36alui7fcwncbc4xdaklywb4", 
    "title": "LIGO sensing system performance", 
    "publisher": "IOP Publishing", 
    "release_date": "2002-03-19", 
    "release_stage": "published", 
    "release_type": "article-journal", 
    "release_year": 2002, 
    "revision": "2e36dfbe-9a4b-4917-95bb-f02b04f6b5d0", 
    "state": "active", 
    "work_id": "ejllv7xq4rgrrffpsf3prqurwq"
    "container_id": "j5iizqxt2rainmxg6nfmpg2ds4", 
    "contribs": [],
    "ext_ids": {
        "doi": "10.1088/0264-9381/19/7/380"
    }, 
    "files": [
        {
            "ident": "vmfyqb77r5gs3pkoekzfcjgsb4", 
            "mimetype": "application/pdf", 
            "release_ids": [
                "3j36alui7fcwncbc4xdaklywb4"
            ], 
            "revision": "66639928-d9e2-45e2-a883-36616d5b0a67", 
            "sha1": "54244fe8d35bff2db2a3ff946e60c194f68821ae", 
            "state": "active", 
            "urls": [
                {
                    "rel": "web", 
                    "url": "http://www.gravity.uwa.edu.au/amaldi/papers/Landry.pdf"
                }, 
                {
                    "rel": "webarchive", 
                    "url": "https://web.archive.org/web/20081011163648/http://www.gravity.uwa.edu.au/amaldi/papers/Landry.pdf"
                }
            ]
        }, 
        {
            "ident": "3ta26geysncdxlgswjoaiqlbyu", 
            "mimetype": "application/pdf", 
            "release_ids": [
                "3j36alui7fcwncbc4xdaklywb4"
            ], 
            "revision": "5c7a8cb0-4710-415a-93d5-d7cb6c42dfd1", 
            "sha1": "954c0fb370af7f72a0cb47505b8793e8e5e23136", 
            "state": "active", 
            "urls": [
                {
                    "rel": "webarchive", 
                    "url": "https://web.archive.org/web/20050624182645/http://www.gravity.uwa.edu.au/amaldi/papers/Landry.pdf"
                }, 
                {
                    "rel": "webarchive", 
                    "url": "https://web.archive.org/web/20091024040004/http://www.gravity.uwa.edu.au/amaldi/papers/Landry.pdf"
                }, 
                {
                    "rel": "web", 
                    "url": "http://www.gravity.uwa.edu.au/amaldi/papers/Landry.pdf"
                }
            ]
        }
    ], 
}

An httpie and jq one-liner to grap the first URL would be:

http https://api.fatcat.wiki/v0/release/lookup doi==10.1088/0264-9381/19/7/380 expand==files hide==abstracts,refs | jq '.files[0].urls[0].url' -r

Using the python client library (fatcat-openapi-client), you might do something like:

import fatcat_openapi_client
from fatcat_openapi_client.rest import ApiException

conf = fatcat_openapi_client.Configuration()
conf.host = "https://api.fatcat.wiki/v0"
api = fatcat_openapi_client.DefaultApi(fatcat_openapi_client.ApiClient(conf))
doi = "10.1088/0264-9381/19/7/380"

try:
    r = api.lookup_release(doi=doi, expand="files", hide="abstracts,refs")
except ApiException as ae:
    if ae.status == 404:
        print("DOI not found!")
        return
    else:
        raise ae

print("Fatcat release found: https://fatcat.wiki/release/{}".format(r.ident))

for f in r.files:
    if f.mimetype != 'application/pdf':
        continue
    for u in r.urls:
        if u.rel == 'webarchive' and '//web.archive.org/' in u.url:
            print("Wayback PDF URL: {}".format(u.url)
            return

print("No Wayback PDF URL found")

A more advanced lookup tool would check for sibling releases under the same work and provide both alternative links ("no version of record available, but here is the pre-print") and notify the end user about any updates or retractions to the work as a whole.

Creating an Entity

Let's use a container (journal) entity as a simple example of mutation of the catalog. This assumes you already have an editor account and API token, both obtained through the web interface.

In summary:

  1. Create (POST) an editgroup
  2. Create (POST) the container entity as part of editgroup
  3. Submit the editgroup for review
  4. (privileged) Accept the editgroup

See the API docs for full details of authentication.

To create an editgroup, the raw HTTP request (to https://api.fatcat.wiki/v0/editgroup) and response would look like:

POST /v0/editgroup HTTP/1.1
Accept: application/json, */*
Accept-Encoding: gzip, deflate
Authorization: Bearer AgEPZGV2LmZhdGNhdC53aWtpAhYyMDE5MDEwMS1kZXYtZHVtbXkta2V5AAImZWRpdG9yX2lkID0gYWFhYWFhYWFhYWFhYmt2a2FhYWFhYWFhYWkAAht0aW1lID4gMjAxOS0wMS0wOVQwMDo1Nzo1MloAAAYgnroNha1hSftChtxHGTnLEmM/pY8MeQS/jBSV0UNvXug=
Connection: keep-alive
Content-Length: 2
Content-Type: application/json
Host: api.fatcat.wiki
User-Agent: HTTPie/0.9.8

{}

HTTP/1.1 201 Created
Connection: keep-alive
Content-Length: 126
Content-Type: application/json
Date: Tue, 17 Sep 2019 23:25:55 GMT
X-Span-ID: cc016e0e-77ae-4ca0-b1da-b0a38e48a130

{
    "created": "2019-09-17T23:25:55.273836Z", 
    "editgroup_id": "aqhyo2ulmzfbrewn3rv7dhl65u", 
    "editor_id": "4vmpwdwxxneitkonvgm2pk6kya"
}

It is important to parse the response to get the editgroup_id. Next POST to https://fatcat.wiki/v0/editgroup/EDITGROUP_ID/container (with the editgroup_id substituted) and the JSON container entity as the body:

POST /v0/editgroup/aqhyo2ulmzfbrewn3rv7dhl65u/container HTTP/1.1
Accept: application/json, */*
Accept-Encoding: gzip, deflate
Authorization: Bearer AgEPZGV2LmZhdGNhdC53aWtpAhYyMDE5MDEwMS1kZXYtZHVtbXkta2V5AAImZWRpdG9yX2lkID0gYWFhYWFhYWFhYWFhYmt2a2FhYWFhYWFhYWkAAht0aW1lID4gMjAxOS0wMS0wOVQwMDo1Nzo1MloAAAYgnroNha1hSftChtxHGTnLEmM/pY8MeQS/jBSV0UNvXug=
Connection: keep-alive
Content-Length: 54
Content-Type: application/json
Host: api.fatcat.wiki
User-Agent: HTTPie/0.9.8

{
    "issnl": "1234-5678", 
    "name": "Journal of Something"
}

HTTP/1.1 201 Created
Connection: keep-alive
Content-Length: 181
Content-Type: application/json
Date: Tue, 17 Sep 2019 23:30:32 GMT
X-Span-ID: eb2f4243-ed43-4a21-bbf0-d653590fcfe2

{
    "edit_id": "ea203496-ecb9-45c7-ac50-3cb24cdbb58f", 
    "editgroup_id": "aqhyo2ulmzfbrewn3rv7dhl65u", 
    "ident": "g3kyxylxjbej7drf6apqpfkl6i", 
    "revision": "796429d2-44a4-4ece-a9b2-e80edcd4277a"
}

To submit an editgroup, use the update endpoint with the submit query parameter set to true. The body should be the editgroup object (as JSON), but is mostly ignored:

PUT /v0/editgroup/aqhyo2ulmzfbrewn3rv7dhl65u?submit=true HTTP/1.1
Accept: application/json, */*
Accept-Encoding: gzip, deflate
Authorization: Bearer AgEPZGV2LmZhdGNhdC53aWtpAhYyMDE5MDEwMS1kZXYtZHVtbXkta2V5AAImZWRpdG9yX2lkID0gYWFhYWFhYWFhYWFhYmt2a2FhYWFhYWFhYWkAAht0aW1lID4gMjAxOS0wMS0wOVQwMDo1Nzo1MloAAAYgnroNha1hSftChtxHGTnLEmM/pY8MeQS/jBSV0UNvXug=
Connection: keep-alive
Content-Length: 131
Content-Type: application/json
Host: api.fatcat.wiki
User-Agent: HTTPie/0.9.8

{
    "created": "2019-09-17T23:25:55.273836Z", 
    "editgroup_id": "aqhyo2ulmzfbrewn3rv7dhl65u", 
    "editor_id": "4vmpwdwxxneitkonvgm2pk6kya"
}

HTTP/1.1 200 OK
Connection: keep-alive
Content-Length: 168
Content-Type: application/json
Date: Tue, 17 Sep 2019 23:37:06 GMT
X-Span-ID: c0ac0406-83ce-4e07-a892-3f83c02ec207

{
    "created": "2019-09-17T23:25:55.273836Z", 
    "editgroup_id": "aqhyo2ulmzfbrewn3rv7dhl65u", 
    "editor_id": "4vmpwdwxxneitkonvgm2pk6kya", 
    "submitted": "2019-09-17T23:37:06.288434Z"
}

Lastly, if your editor account as the admin role, you can "accept" the editgroup using the accept endpoint:

POST /v0/editgroup/aqhyo2ulmzfbrewn3rv7dhl65u/accept HTTP/1.1
Accept: */*
Accept-Encoding: gzip, deflate
Authorization: Bearer AgEPZGV2LmZhdGNhdC53aWtpAhYyMDE5MDEwMS1kZXYtZHVtbXkta2V5AAImZWRpdG9yX2lkID0gYWFhYWFhYWFhYWFhYmt2a2FhYWFhYWFhYWkAAht0aW1lID4gMjAxOS0wMS0wOVQwMDo1Nzo1MloAAAYgnroNha1hSftChtxHGTnLEmM/pY8MeQS/jBSV0UNvXug=
Connection: keep-alive
Content-Length: 0
Host: api.fatcat.wiki
User-Agent: HTTPie/0.9.8



HTTP/1.1 200 OK
Connection: keep-alive
Content-Length: 36
Content-Type: application/json
Date: Tue, 17 Sep 2019 23:40:21 GMT
X-Span-ID: cb4d66f0-9e67-4908-8dff-97489cc87ca2

{
    "message": "horray!", 
    "success": true
}

This whole exchange is, of course, must faster with the python library:

import fatcat_openapi_client
from fatcat_openapi_client.rest import ApiException

conf = fatcat_openapi_client.Configuration()
conf.host = "https://api.fatcat.wiki/v0"
conf.api_key_prefix["Authorization"] = "Bearer"
conf.api_key["Authorization"] = "AgEPZGV2LmZhdGNhdC53aWtpAhYyMDE5MDEwMS1kZXYtZHVtbXkta2V5AAImZWRpdG9yX2lkID0gYWFhYWFhYWFhYWFhYmt2a2FhYWFhYWFhYWkAAht0aW1lID4gMjAxOS0wMS0wOVQwMDo1Nzo1MloAAAYgnroNha1hSftChtxHGTnLEmM/pY8MeQS/jBSV0UNvXug="
api = fatcat_openapi_client.DefaultApi(fatcat_openapi_client.ApiClient(conf))

c = fatcat_openapi_client.ContainerEntity(
    name="Test Journal",
    issnl="1234-5678",
)
editgroup = api.create_editgroup(description="my test editgroup")
c_edit = api.create_container(editgroup.editgroup_id, c)
api.update_editgroup(editgroup.editgroup_id, submit=True)

# only if you have permissions
api.accept_editgroup(editgroup.editgroup_id)

Updating an Existing Entity

It is important to ensure that edits/updates are idempotent, in this case meaning that if you ran the same script twice in quick succession, no mutation or update would occur the second time. This is usually achieved by always fetching entities just before an edit and checking that updates are actually necessary.

The basic process is to:

  1. Fetch (GET) or Lookup (GET) the existing entity. Check that edit is actually necessary!
  2. Create (POST) a new editgroup
  3. Update (PUT) the entity
  4. Submit (PUT) the editgroup for review

Python example code:

import fatcat_openapi_client
from fatcat_openapi_client.rest import ApiException

conf = fatcat_openapi_client.Configuration()
conf.host = "https://api.fatcat.wiki/v0"
conf.api_key_prefix["Authorization"] = "Bearer"
conf.api_key["Authorization"] = "AgEPZGV2LmZhdGNhdC53aWtpAhYyMDE5MDEwMS1kZXYtZHVtbXkta2V5AAImZWRpdG9yX2lkID0gYWFhYWFhYWFhYWFhYmt2a2FhYWFhYWFhYWkAAht0aW1lID4gMjAxOS0wMS0wOVQwMDo1Nzo1MloAAAYgnroNha1hSftChtxHGTnLEmM/pY8MeQS/jBSV0UNvXug="
api = fatcat_openapi_client.DefaultApi(fatcat_openapi_client.ApiClient(conf))

new_name = "Classical and Quantum Gravity"
c = api.get_container('j5iizqxt2rainmxg6nfmpg2ds4')
if c.name == new_name:
    print("Already updated!")
    return

c.name = new_name

editgroup = api.create_editgroup(description="my test container editgroup")
c_edit = api.update_container(editgroup.editgroup_id, c.ident, c)
api.update_editgroup(editgroup.editgroup_id, submit=True)

Merging Duplicate Entities

Like other mutations, be careful that any merge oprations do not clobber the catalog if run multiple times.

Summary:

  1. Fetch (GET) both entities. Ensure that merging is still required.
  2. Decide which will be the "primary" entity (the other will redirect to it)
  3. Create (POST) a new editgroup
  4. Update (PUT) the "primary" entity with any updated metadata merged from the other entity (optional), and the editgroup id set
  5. Update (PUT) the "other" entity with the redirect flag set to the primary's identifier.
  6. Submit (PUT) the editgroup for review
  7. Somebody (human or bot) with admin privileges will Accept (POST) the editgroup.

Python example code:

import fatcat_openapi_client
from fatcat_openapi_client.rest import ApiException

conf = fatcat_openapi_client.Configuration()
conf.host = "https://api.fatcat.wiki/v0"
conf.api_key_prefix["Authorization"] = "Bearer"
conf.api_key["Authorization"] = "AgEPZGV2LmZhdGNhdC53aWtpAhYyMDE5MDEwMS1kZXYtZHVtbXkta2V5AAImZWRpdG9yX2lkID0gYWFhYWFhYWFhYWFhYmt2a2FhYWFhYWFhYWkAAht0aW1lID4gMjAxOS0wMS0wOVQwMDo1Nzo1MloAAAYgnroNha1hSftChtxHGTnLEmM/pY8MeQS/jBSV0UNvXug="
api = fatcat_openapi_client.DefaultApi(fatcat_openapi_client.ApiClient(conf))

left = api.get_creator('iimvc523xbhqlav6j3sbthuehu')
right = api.get_creator('lav6j3sbthuehuiimvc523xbhq')

# check that merge/redirect hasn't happened yet
assert left.state == 'active' and right.state == 'active'
assert left.redirect = None and right.redirect_id = None
assert left.revision != right.revision

# decide to merge "right" into "left"
if not left.orcid:
    left.orcid = right.orcid
if not left.surname:
    left.surname = right.surname

editgroup = api.create_editgroup(description="my test creator merge editgroup")
left_edit = api.update_creator(editgroup.editgroup_id, left.ident, left)
right_edit = api.update_creator(eidtgroup.editgroup_id, right.ident,
    fatcat_openapi_client.CreatorEntity(redirect=left.ident))
api.update_editgroup(editgroup.editgroup_id, submit=True)

Batch Create Entities

When importing large numbers (thousands) of entities, it can be faster to use the batch create operations instead of individual editgroup and entity creation. Using the batch endpoints requires care because the potential to pollute the catalog with bad entities (and the effort required to clean up) can be much larger.

These methods always require the admin role, because they are the equivalent of creation and editgroup accept.

It is not currently possible to do batch updates or deletes in a single request.

The basic process is:

  1. Confirm that input entities should be created (eg, using identifier lookups), and bundle into groups of 50-100 entities.
  2. Batch create (POST) a set of entities, with editgroup metadata included along with list of entities (all of a single type). Entire batch is inserted in a single request.
import fatcat_openapi_client
from fatcat_openapi_client.rest import ApiException

conf = fatcat_openapi_client.Configuration()
conf.host = "https://api.fatcat.wiki/v0"
conf.api_key_prefix["Authorization"] = "Bearer"
conf.api_key["Authorization"] = "AgEPZGV2LmZhdGNhdC53aWtpAhYyMDE5MDEwMS1kZXYtZHVtbXkta2V5AAImZWRpdG9yX2lkID0gYWFhYWFhYWFhYWFhYmt2a2FhYWFhYWFhYWkAAht0aW1lID4gMjAxOS0wMS0wOVQwMDo1Nzo1MloAAAYgnroNha1hSftChtxHGTnLEmM/pY8MeQS/jBSV0UNvXug="
api = fatcat_openapi_client.DefaultApi(fatcat_openapi_client.ApiClient(conf))

releases = [
    ReleaseEntity(
      ext_ids=ReleaseExtIds(doi="10.123/456"),
      title="Dummy Release",
    ),
    ReleaseEntity(
      ext_ids=ReleaseExtIds(doi="10.123/789"),
      title="Another Dummy Release",
    ),
    # ... more releases here ...
]

# check that releases don't exist already; this could be a filter
for r in releases:
    existing = None
    try:
        existing = api.lookup_release(doi=r.ext_ids.doi)
    except ApiException as ae
        assert ae.status == 404
    assert existing is None

# ensure our batch size isn't too large
assert len(releases) <= 100

editgroup = api.create_release_auto_batch(
    fatcat_openapi_client.ReleaseAutoBatch(
        editgroup=fatcat_openapi_client.Editgroup(
            description="my test batch",
        ),
        entity_list=releases,
    )
)

Import New Files Linked to Releases

Let's say you knew of many open access PDFs, including their SHA-1, size, and a URL:

10.123/456  7043946a7afe0ee32c9d4c22a9b3fc2ba6d34b42    7238    https://archive.org/download/open_access_files/456.pdf
10.123/789  350a8d5c6fac151ec2c81d4df5d58d14aeefc72f    1277    https://archive.org/download/open_access_files/789.pdf
10.123/900  9d9a9868a661b13c32fd38021addadb7b4a31122     166    https://archive.org/download/open_access_files/900.pdf
[...]

The process for adding these could be something like:

  1. For each row, check if file with SHA-1 exists; if so, skip
  2. For each row, lookup the release by DOI; if it doesn't exist, skip
  3. Transform into File entities
  4. Group entities into batches
  5. Submit batches to API

There are multiple ways to structure code to do this. You may want to look at the importer class under python/fatcat_tools/importers/common.py, and other existing import scripts in that directory for a framework to structure this type of import.

Here is a simpler example using only the python library:


# TODO: actually test this code

import sys
import fatcat_openapi_client
from fatcat_openapi_client import *
from fatcat_openapi_client.rest import ApiException

conf = fatcat_openapi_client.Configuration()
conf.host = "https://api.fatcat.wiki/v0"
conf.api_key_prefix["Authorization"] = "Bearer"
conf.api_key["Authorization"] = "AgEPZGV2LmZhdGNhdC53aWtpAhYyMDE5MDEwMS1kZXYtZHVtbXkta2V5AAImZWRpdG9yX2lkID0gYWFhYWFhYWFhYWFhYmt2a2FhYWFhYWFhYWkAAht0aW1lID4gMjAxOS0wMS0wOVQwMDo1Nzo1MloAAAYgnroNha1hSftChtxHGTnLEmM/pY8MeQS/jBSV0UNvXug="
api = fatcat_openapi_client.DefaultApi(fatcat_openapi_client.ApiClient(conf))

HAVE_ADMIN = False

def try_row(fields):
    # remove any extra whitespace
    fields = [f.strip() for f in fields]
    doi = fields[0]
    sha1 = fields[1]
    size = int(fields[2])
    url = fields[3]

    # check for existing file
    try:
        existing = api.lookup_file(sha1=sha1)
        print("File with SHA-1 exists: {}".format(sha1))
        return None
    except ApiException as ae:
        if ae.status != 404:
            raise ae

    # lookup release by DOI
    try:
        release = api.lookup_release(doi=doi)
    except ApiException as ae:
        if ae.status == 404:
            print("No existing release for DOI: {}".format(doi))
            return None
        else:
            raise ae

    fe = FileEntity(
        release_ids=[release.ident],
        sha1=sha1,
        size=size,
        urls=[FileUrl(rel="archive", url=url)],
    )
    return fe

def run(input_file):
    file_entities = []
    for line in input_file:
        fe = try_row(line.split('\t'))
        if fe:
            file_entities.append(fe)
    if not file_entities:
        print("Tried all lines, nothing to do!")

    # TODO: iterate over fixed-size batches
    first_batch = file_entities[:100]

    # easy way: create as a batch if you have permission
    if HAVE_ADMIN:
        editgroup = api.create_release_auto_batch(
            fatcat_openapi_client.ReleaseAutoBatch(
                editgroup=fatcat_openapi_client.Editgroup(
                    description="my test batch",
                ),
                entity_list=releases,
            )
        )
        return

    # longer way: create one-at-a-time
    editgroup = api.create_editgroup(Editgroup(
        description="batch import of files-by-DOI. Data from XYZ",
        extra={
            # put the name of your script/project here
            'agent': 'tutorial_example_script',
        },
    ))

    for fe in first_batch:
        edit = api.create_file(editgroup.editgroup_id, fe)

    # submit for review
    api.update_editgroup(editgroup.editgroup_id, editgroup, submit=True)
    print("Submitted editgroup: https://fatcat.wiki/editgroup/{}".format(editgroup.editgroup_id))

    print("Done!")

if __name__=='__main__':
    if len(sys.argv) != 2:
        print("Pass input TSV file as argument")
        sys.exit(-1)
    with open(sys.argv[1], 'r') as input_file:
        run(input_file)

# ensure our batch size isn't too large
assert len(releases) <= 100

editgroup = api.create_release_auto_batch(
    fatcat_openapi_client.ReleaseAutoBatch(
        editgroup=fatcat_openapi_client.Editgroup(
            description="my test batch",
        ),
        entity_list=releases,
    )
)

Contributing

Our aspiration is for this to be an open, collaborative project, with individuals and organization of all sizes able to participate. There is not much structure or documentation on how volunteers can get started or be most helpful, but perhaps we can work together on that as well!

The best place to organize and coordinate right now is the gitter chatroom. Gitter is described as "for developers", but we use it for everybody, and you don't need an invitation.

Want to help out? Below are a few example roles you could play.

Anybody: Find Bugs, Suggest Improvements

The user sign-up and editing workflow on fatcat.wiki is currently pretty poor. How could this experience be improved and better documented? Specific ideas, suggestions and diagrams would be very helpful. You don't need to know how to program or about web technologies to contribute; hand drawings and example text can be sufficient.

Community Organizer: Partner and Volunteer Organizing

Are you passionate about Open Access and want to help build a community around preservation and universal access to knowledge? We could use help structuring an editing community, and communicating with partner projects like Wikidata to ensure we are not duplicating efforts.

A good example of a project to organize would be improving journal-level metadata in wikidata, including journal homepages, and linking to fatcat "container" entities.

Research Librarian: Identify Missing Content

If you have an interest in a specific scholarly field, you could give us feedback on how good of a job fatcat is doing preserving at-risk open access content. We know we have a lot of work to do, but both specific examples of missing publications, as well as broader patterns and missing holes are helpful to know about. Some missing content we know we don't have, but there are surely entire categories of in-scope content that we do not even know are missing!

Metadata Librarian: Schema Improvements

Are you an experienced wrangler of BibFrame, MARC, bibtext, RDF, OAI-PMH, and Citation Style Language? Our data model and entity schemas are bespoke (sorry!) and designed to evolved over time. There might be related efforts and new controlled vocabularies we could adopt or align with, or small changes to the schema might enable new use cases. It could be as simple as identifying and prioritizing new external identifiers (PIDs) to allow. Let us know what we got right and what needs improvement!

Power Editor: Better Interfaces

Are you super experienced with data entry, editing, and corrections? Do you have ideas on how our interface could be improved, or what kinds of new interfaces and tools could be build to support effective editing? Our open API allows third-party interfaces to make edits on individuals' behalf, meaning new tools can be build for specific patterns of editing or user contribution.

Data Scientist: Wrangling and Visualization

We have hundreds of gigabytes of metadata to transform and normalize before importing, and already have a rich open dataset with millions of linked entities. Our elasticsearch analytics database has an open read-only endpoint (https://search.fatcat.wiki), which are used to power our coverage interface. What other interactive visualizations could be built? What tools should we be using to wrangle bibliographic metadata better and faster?

Author: Verify Metadata

Do you publish research documents, and want to ensure it is accessible to the broadest audience today and in the future? Like many academic search engines, you can add papers and link an author profile to specific publications. Unlike others, you can also ensure uploaded pre-prints and other open versions of your research are found and linked using the "save paper now" feature, and you can any errors made by publishers and bots.

Translation and Accessibility Advocate

Some of our web interfaces have existing internationalization infrastructure, and translations can be contributed directly.

Other projects need help getting translation infrastructure in place, and all of our projects could use review and recommendations for improvement by experts in web accessibility. For example, if you use a screen reader, feedback on which parts of our services are most difficult to use are very helpful.

Software Developer: Bot Wrangling

Fatcat is structured such that all changes to the catalog go through an open API. This includes human edits through the web interface, but the large majority of edits are made by bots. You could write a new bot to help...

  • review human edits (from the "reviewable" queue) to "lint" for typos, missing fields, or other problems, and then leave an annotation
  • harvest, transform, and import metadata from addition subject- and region-specific sources
  • find and clean-up patterns of poor or incorrect metadata already in the catalog

SQL Expert: Database Scaling

We have a large (500+ GByte) PostgreSQL database backing the catalog. This is working great so far, but we have concerns about how the catalog will scale further, especially if bots start making multiple updates per entity. You could review our SQL schema and recommend improvements, or give feedback and advice on how to switch to a distributed primary datastore.

Financial Supporter

Short on time? As a US 501(c)(3) non-profit, the Internet Archive always appreciates and makes good use of donations.

Software Contributions

Bugs and patches can be filed on Github at: https://github.com/internetarchive/fatcat

When considering making a non-trivial contribution, it can save review time and duplicated work to post an issue with your intentions and plan. New code and features must include unit tests before being merged, though we can help with writing them.

Editing Quickstart

This tutorial describes how to make edits to the Fatcat catalog using the web interface. We will add a new file to an existing release, then update the release to point at a different container. You can follow these directions on either the QA (NOTE: QA not available as of Spring 2021) or production public catalogs. You will:

  • create an editor account and log-in
  • create a new file entity
  • update an existing release entity
  • submit editgroup for review

First create an editor account and log-in. If you don't have an account with any of the existing federated log-in services (eg, Wikipedia, ORCID, Github), you can create a few Internet Archive account, confirm your email, and then log-in to Fatcat using that. You should see your username in the upper right-hand corner of every page when you are successfully logged in.

Next find the release's fatcat identifier for the paper we want to add a file to. You can search by title, or lookup a paper by an identifier (such as a DOI or arXiv ID). If the release you are looking for doesn't exist yet, you'll need to create a new one. All of these actions are linked from the Fatcat front page for each entity type.

The release fatcat identifier is the garbled looking string like hsmo6p4smrganpb3fndaj2lon4 which you can find under the title of the paper's entity page, and also in the URL. You'll need this identifier to link the file to the release.

Before creating a new file entity (or any entity for that matter), check that there isn't already an entity referencing the exact same file. Download the file (eg, PDF) that you want to add to your local computer, and calculate the SHA-1 hash of the file using a tool like sha1sum on the command line. If you aren't familiar with command line tools, you can upload to a free online service. The SHA-1 hash will look like de9aefc4522b385121e72faaee75bda9fbb8bf6e, and you can do a file lookup. If a file already exists, you could edit it to add new URLs (locations), or add/update any release links.

Assuming a file entity doesn't already exist, go to create file. We will want to start a new "editgroup" for these changes. If you don't have any editgroups in progress, you can just enter a description sentence and a new one will be created; if you did have edits in progress, you'll need to select the "create new editgroup" option from the drop-down of your existing editgroups.

Enter the basic file metadata in the fields provided. The red stared fields are required (size in bytes and SHA-1). Add a URL on the public web where the file can be found. It's best if PDFs are uploaded to repositories (eg, Zenodo) or hosted on the publisher's website. A second archival location can be added (eg, using the Wayback Machine's "save page now" feature), or you could skip this and wait for a bot to verify and archive the URL later. The left drop-down menu lets you set the "type" of each URL. Add the release identifier you found earlier to the "Releases" list.

Add a one-sentence description of your change, and submit the form. You will be redirected to a provisional ("work in progress") view of the new entity. Edits are not immediately merged into the catalog proper; the first need to be "submitted" and then accepted (eg, by a human moderator or robot).

Let's add a second edit to the same editgroup before continuing. The new file view should have a link to the release entity; follow that link, then click the "edit" button (either the tab or the blue link at the bottom of the infobox). This time, the most recent editgroup should already be selected, so you don't need to enter a description at the top. If there are any problems with basic metadata, go ahead and fix them, but otherwise skip down to the "Container" section and update the fatcat identifier ("FCID") to point to the correct journal. You can lookup journals by ISSN-L, or search by title. Add a short description of your change ("Updated journal to XYZ") and then submit.

You now have two edits in your editgroup. There should be links to the editgroup itself from the "work-in-progress" pages, or you can find all your editgroups from the drop-down link in the upper right-hand corner of every page (your username, then "Edit History"). The editgroup page shows all the entities created, updated, or deleted, and allows you to make tweaks (re-edit) or remove changes. If the release/container update you made was bogus (just as a learning exercise), you could remove it here. It's a good practice to group related edits into the same editgroup, but only up to 50 or so edits at a time (more than that becomes difficult hard to review).

If things look good, click the "submit" button on the editgroup page. This will mark your changes as "ready for review", and they will show up on the global reviewable editgroups list. If you change your mind, you can "unsubmit" the editgroup and make more changes. Humans and bots can make annotations to editgroups, recommending changes. At the current time there are no email or other update notifications, so you need to check in on annotations and other status manually.

When your changes have been reviewed, a moderator will "accept" them, and the entities will be updated in the catalog. Every accepted editgroup ends up in the changelog.

And then you're done, thanks for your contribution!

Norms and Policies

These social norms are explicitly expected to evolve and mature if the number of contributors to the project grows. It is important to have some policies as a starting point, but also important not to set these policies in stone until they have been reviewed.

See also the Code of Conduct and Privacy Policy.

Metadata Licensing

The Fatcat catalog content license is the Creative Commons Zero ("CC-0") license, which is effectively a public domain grant. This applies to the catalog metadata itself (titles, entity relationships, citation metadata, URLs, hashes, identifiers), as well as "meta-meta-data" provided by editors (edit descriptions, provenance metadata, etc).

The core catalog is designed to contain only factual information: "this work, known by this title and with these third-party identifiers, is believed to be represented by these files and published under such-and-such venue". As a norm, sourcing metadata (for attribution and provenance) is retained for each edit made to the catalog.

A notable exception to this policy are abstracts, for which no copyright claims or license is made. Abstract content is kept separate from core catalog metadata; downstream users need to make their own decision regarding reuse and distribution of this material.

As a social norm, it is expected (and appreciated!) that downstream users of the public API and/or bulk exports provide attribution, and even transitive attribution (acknowledging the original source of metadata contributed to Fatcat). As an academic norm, researchers are encouraged to cite the corpus as a dataset (when this option becomes available). However, neither of these norms are enforced via the copyright mechanism.

As a strong norm, editors should expect full access to the full corpus and edit history, including all of their contributions.

Immutable History

All editors agree to the licensing terms, and understand that their full public history of contributions is made irrevocably public. Edits and contributions may be reverted, but the history (and content) of their edits are retained. Edit history is not removed from the corpus on the request of an editor or when an editor closes their account.

In an emergency situation, such as non-bibliographic content getting encoded in the corpus by bypassing normal filters (eg, base64 encoding hate crime content or exploitative photos, as has happened to some blockchain projects), the ecosystem may decide to collectively, in a coordinated manner, expunge specific records from their history.

Documentation Licensing

This guide ("The Fatcat Guide") is licensed under the Creative Commons Attribution license.

Software Licensing

The Fatcat software project licensing policy is to adopt strong copyleft licenses for server software (where the majority of software development takes place), permissive licenses for client library and bot framework software, and CC-0 (public grant) licensing for declarative interface specifications (such as SQL schemas and REST API specifications).

Fatcat Code of Conduct

In this early stage of the project, this document is a work in progress. In particular there is no moderation team or policy for responding to concerns in online discussions. However, it is important to clarify norms and expectations as early as possible.

To contact the Internet Archive privately about conduct concerns or to report unacceptable behavior, you can email ethics@archive.org.

Overview

  • We are committed to providing a friendly, safe and welcoming environment for all, regardless of level of experience, gender identity and expression, sexual orientation, disability, personal appearance, body size, race, ethnicity, age, religion, nationality, or other similar characteristic.

  • Please avoid using overtly sexual aliases or other nicknames that might detract from a friendly, safe and welcoming environment for all.

  • Please be kind and courteous. There’s no need to be mean or rude.

  • Respect that people have differences of opinion and that every design or implementation choice carries a trade-off and numerous costs. There is seldom a right answer.

  • Please keep unstructured critique to a minimum. If you have solid ideas you want to experiment with, make a fork and see how it works.

  • This Code of Conduct applies to all online project spaces (including the catalog itself, code repositories, mailing lists, chat rooms, forums, and comment threads), as well as any physical spaces such as conference gatherings or meetups.

  • All participants are expected to respect this code, regardless of their position or record of contributions to the project.

Unacceptable behavior

The following types of behavior are unacceptable in the Fatcat project, both online and in-person, and constitute code of conduct violations.

Abusive behavior

  • Harassment: including offensive verbal comments related to gender, sexual orientation, disability, physical appearance, body size, race, or religion, as well as sexual images in public spaces, deliberate intimidation, stalking, following, harassing photography or recording, inappropriate physical contact, and unwelcome sexual or romantic attention.

  • Threats: threatening someone physically or verbally. For example, threatening to publicize sensitive information about someone’s personal life.

Unwelcoming behavior

  • Blatant-isms: saying things that are explicitly racist, sexist, homophobic, etc. For example, arguing that some people are less intelligent because of their gender, race or religion. Subtle -isms and small mistakes made in conversation are not code of conduct violations. However, repeating something after it has been pointed out to you that you broke a social rule, or antagonizing or arguing with someone who has pointed out your subtle -ism is considered unwelcoming behavior, and is not allowed in the project.

  • Maliciousness towards other participants: deliberately attempting to make others feel bad, name-calling, singling out others for derision or exclusion. For example, telling someone they’re not a real programmer or that they don’t belong in the project.

  • Being especially unpleasant: for example, if multiple community members report annoying, rude, or especially distracting behavior.

  • Spamming, trolling, flaming, baiting or other attention-stealing behavior is not welcome.

About This Document

The Fatcat Code of Conduct is inspired by, and derived from:

Privacy Policy

It is important to note that this section is currently aspirational: the servers hosting early deployments of Fatcat are largely in a defaults configuration and have not been audited to ensure that these guidelines are being followed.

It is a goal for Fatcat to conduct as little surveillance of reader and editor behavior and activities as possible. In practical terms, this means minimizing the overall amount of logging and collection of identifying information. This is in contrast to submitted edit content, which is captured, preserved, and republished as widely as possible.

The general intention is to:

  • not use third-party tracking (via extract browser-side requests or javascript)
  • collect aggregate metrics (overall hit numbers), but not log individual interactions ("this IP visited this page at this time")

Exceptions will likely be made:

  • temporary caching of IP addresses may be necessary to implement rate-limiting and debug traffic spikes
  • exception logging, abuse detection, and other exceptional situations

Some uncertain areas of privacy include:

  • should third-party authentication identities be linked to editor ids? what about the specific case of ORCID if used for login?
  • what about discussion and comments on edits? should conversations be included in full history dumps? should editors be allowed to update or remove comments?

For Publishers

This page addresses common questions and concerns from publishers of research works indexed in Fatcat, as well as the Internet Archive Scholar service built on top of it. The for authors has some information on updates and metadata corrections that are also relevant to publishers.

For help in exceptional cases, contact Internet Archive through our usual support channels.

Metadata Indexing

Many publishers will find that metadata records are already included in fatcat if they register persistent identifiers for their research works. This pipeline is based on our automated harvesting of DOI, Pubmed, dblp, DOAJ, and other metadata catalogs. This process can take some time (eg, days from registration), does not (yet) cover all persistent identifiers, and will only cover those works which get identifiers.

For publishers who find that they are not getting indexed in fatcat, our primary advice is to register ISSNs for venues (journals, repositories, conferences, etc), and to register DOIs for all current and back-catalog works. DOIs are the most common and integrated identifier in the scholarly ecosystem, and will result in automatic indexing in many other aggregators in addition to fatcat/scholar. There may be funding or resources available for smaller publishers to cover the cost of DOI registration, and ISSN registration is usually no-cost or affordable through national institutions.

We do not recommend that journal or conference publishers use general-purpose repositories like Zenodo to obtain no-cost DOIs for journal articles. These platforms are a great place for pre-publication versions, datasets, software, and other artifacts, but not for primary publication-version works (in our opinion).

If DOI registration is not possible, one good alternative is to get included in the Directory of Open Access Journals and deposit article metadata there. This process may take some time, but is a good basic indicator of publication quality. DOAJ article metadata is periodically harvested and indexed in fatcat, after a de-duplication process.

Fatcat does not yet support OAI-PMH as an identifier and mechanism for automated journal ingest, but we likely will in the future. This would particularly help publishers using the Open Journal System (OJS). Fatcat also does not yet support crawling journal sites and extracting bibliographic metadata from HTML tags.

Lastly, publishers could use the fatcat catalog web interface or API to push metadata records about their works programmatically. We don't know of any publishers actually doing this today.

Improving Automatic Preservation

In alignment with it's mission, Internet Archive makes basic automated attempts to capture and preserve all open access research publications on the public web, at no cost. This effort comes with no guarantees around completeness, timeliness, or support communications.

Preservation coverage can be monitored through the journal-specific dashboards or via the coverage search interface.

There are a few technical things publishers can do to increase their preservation coverage, in addition to the metadata indexing tips above:

  • use the citation_pdf_url HTML meta tag, when appropriate, to link directly from article landing pages to PDF URLs
  • use simple HTML to represent landing pages and article content, and do not require Javascript to render page content or links
  • ensure that hosting server robots.txt rules are not preventing or overly restricting automated crawling
  • use simple, accessible PDF access links. Do not use time-limited or IP-limited URLs, require specific referrer headers, or use cookies to authenticate access to OA PDFs
  • minimize the number of HTTP redirects and HTML hops between DOI and fulltext content
  • paywalls, loginwalls, geofencing, and anti-bot measures are all obviously antithetical to open crawling and indexing

Publishers are also free to submit "Save Paper Now" requests, or edit the catalog itself either manually or in bulk through the API. If an individual work persistently fails to ingest, try running a "Save Page Now" request first from web.archive.org and verify that the content is available through Wayback replay, then submit the "Save Paper Now" request again.

Official Preservation

Internet Archive is developing preservation services for scholarly content on the web. Contact us at webservices@archive.org for details.

Existing web archiving services offered to universities, national libraries, and other institutions may already be appropriate for some publications. Check if your affiliated institutions already have an Archive-IT account or other existing relationship with Internet Archive.

Small publishers using Open Journal System (OJS) should be aware of the PKP preservation project.

For Authors

This page addresses common questions and concerns from individual authors of works indexed in Fatcat, as well as the Internet Archive Scholar service built on top of it.

For help in exceptional cases, contact Internet Archive through our usual support channels.

Updating Works

A frequent request from authors is to remove outdated versions of works.

The philosophy of the catalog is to go beyond "the version of record" and instead collect "the record of versions". This means that drafts, manuscripts, working papers, and other alternative versions of works can be fully included and differentiated using metadata in the catalog. Even in the case of retractions, expressions of concern, or other serious issues with earlier versions, it is valuable to keep out-of-date versions in the catalog. Corrected or updated versions will generally be preferred and linked to publicly, for example on scholar.archive.org. Outright removing content reduces context and can result in additional confusion for readers and librarians.

Because of this, it is strongly preferred to add new updated content instead of requesting the removal of old out-of-date content. Depending on the situation, this could involve creating a new post-publication release entity with the date of update, with status updated or retracted; or a new pre-publication release; or crawling an updated PDF and adding to an existing release entity.

Correcting Metadata

Sometimes the bibliographic metadata in fatcat is incorrect, incomplete, or out of date. This is a particularly sensitive subject when it comes to representing information about individuals. While we aspire to automating metadata updates and improvements as much as possible, often a human touch is best.

Any person can contribute to the catalog directly by creating an account and submitting changes for review. This includes, but is not limited to, authors or a person acting on their behalf submitting corrections. The editing quickstart is a good place to start. Please remember that corrections are considered part of the public record of the catalog and will be preserved even if a contributor later deletes their account. Editor usernames can be changed at any time.

Fatcat is in some sense a non-authoritive catalog, which means that it is usually best if corrections are made in "upstream" sources first (or at the same time) as being corrected in fatcat. For example, updating metadata in publisher databases, repositories, or ORCiD in addition to in fatcat.

Name Changes

The preferred workflow for author name changes depends on the author's sensitivity to having prior names accessible and searchable.

If "also known as" behvior is desirable, contributor names on the release record should remain unchanged (matching what the publication at the time indicated), and a linked creator entity should include the currently-preferred name for display.

If "also known as" is not acceptable, and the work has already been updated in authoritative publication catalogs, then the contributor name can be updated on release records as well.

See also the creator style guide.

Author Relation Completeness

creator records are not always generated when importing release records; the current practice is to create and/or link them if there is ORCiD metadata linking specific authors to a published work.

This means that author/work is often very incomplete or non-existent. At this time we would recommend using other services like dblp.org or openalex.org for more complete (but possibly less accurate) author/work metadata.

Resolving Publication Disputes

Authorship and publication ethics disputes should generally be resolved with the original publisher first, then updated in fatcat.

Presentations

2020 Workshop On Open Citations And Open Scholarly Metadata 2020 - Fatcat (video on archive.org)

2019-10-25 FORCE2019 - Perpetual Access Machines: Archiving Web-Published Scholarship at Scale (video on youtube.com)

Blog Posts And Press

2021-03-09: blog.archive.org - Search Scholarly Materials Preserved in the Internet Archive

2020-09-17 blog.dshr.org - Don't Say We Didn't Warn You

2020-09-15: blog.archive.org - How the Internet Archive is Ensuring Permanent Access to Open Access Journal Articles

2020-02-18 blog.dshr.org - The Scholarly Record At The Internet Archive

2019-04-18 blog.dshr.org - Personal Pods and Fatcat

2018-10-03 blog.dshr.org - Brief Talk At Internet Archive Event

2018-03-05 blog.archive.org - Andrew W. Mellon Foundation Awards Grant to the Internet Archive for Long Tail Journal Preservation

Background / Bibliography

Brainard, Jeffrey. “Dozens of Scientific Journals Have Vanished from the Internet, and No One Preserved Them.” Science | AAAS. Last modified September 8, 2020. Accessed August 6, 2021. https://www.sciencemag.org/news/2020/09/dozens-scientific-journals-have-vanished-internet-and-no-one-preserved-them.
Chen, Xiaotian. “Embargo, Tasini, and ‘Opted Out’: How Many Journal Articles Are Missing from Full-Text Databases.” Internet Reference Services Quarterly 7, no. 4 (September 2002): 23–34.
Eve, Martin Paul, and Jonathan Gray, eds. Reassembling Scholarly Communications: Histories, Infrastructures, and Global Politics of Open Access. Cambridge, Massachusetts: The MIT Press, 2020.
Ito, Joichi. “Citing Blogs.” Joi Ito’s Web (2018). Accessed March 11, 2019. https://joi.ito.com/weblog/2018/05/28/citing-blogs.html.
Karaganis, Joe, ed. Shadow Libraries: Access to Knowledge in Global Higher Education. Cambridge, MA : Ottawa, ON: The MIT Press ; International Development Research Centre, 2018.
Khabsa, Madian, and C. Lee Giles. “The Number of Scholarly Documents on the Public Web.” PLOS ONE 9, no. 5 (May 9, 2014): e93949.
Knoth, Petr, and Zdenek Zdrahal. “CORE: Three Access Levels to Underpin Open Access.” D-Lib Magazine 18, no. 11/12 (November 2012). Accessed March 11, 2019. http://www.dlib.org/dlib/november12/knoth/11knoth.html.
Kwon, Diana. “More than 100 Scientific Journals Have Disappeared from the Internet.” Nature (September 10, 2020). Accessed August 6, 2021. https://www.nature.com/articles/d41586-020-02610-z.
Laakso, Mikael, Lisa Matthias, and Najko Jahn. “Open Is Not Forever: A Study of Vanished Open Access Journals.” Journal of the Association for Information Science and Technology 72, no. 9 (September 2021): 1099–1112.
Ortega, Jose Luis. Academic Search Enghines: New Information Trends and Services for Scientists on the Web. Chandos information professional series. Philadelphia, PA: Elsevier, 2014.
Page, Roderic. “Notes on Bibliographic Metadata in JSON.” Last modified July 12, 2017. Accessed March 11, 2019. https://github.com/rdmpage/bibliographic-metadata-json.
Pettifer, S., P. McDERMOTT, J. Marsh, D. Thorne, A. Villeger, and T.K. Attwood. “Ceci n’est Pas Un Hamburger: Modelling and Representing the Scholarly Article.” Learned Publishing 24, no. 3 (July 2011): 207–220.
Piwowar, Heather, Jason Priem, Vincent Larivière, Juan Pablo Alperin, Lisa Matthias, Bree Norlander, Ashley Farley, Jevin West, and Stefanie Haustein. “The State of OA: A Large-Scale Analysis of the Prevalence and Impact of Open Access Articles.” PeerJ 6 (February 13, 2018): e4375.
Ramalho, Luciano G. “From ISIS to CouchDB: Databases and Data Models for Bibliographic Records.” The Code4Lib Journal, no. 13 (April 11, 2011). Accessed March 11, 2019. https://journal.code4lib.org/articles/4893.
rclark1. “DOI-like Strings and Fake DOIs.” Website. Crossref. Accessed March 11, 2019. https://www.crossref.org/blog/doi-like-strings-and-fake-dois/.
Svenonius, Elaine. The Intellectual Foundation of Information Organization. First MIT Press paperback ed. Digital libraries and electronic publishing. Cambridge, Mass.: MIT Press, 2009.
Van de Sompel, Herbert, Robert Sanderson, Martin Klein, Michael L. Nelson, Bernhard Haslhofer, Simeon Warner, and Carl Lagoze. “A Perspective on Resource Synchronization.” D-Lib Magazine 18, no. 9/10 (September 2012). Accessed March 11, 2019. http://www.dlib.org/dlib/september12/vandesompel/09vandesompel.html.
Wright, Alex. Cataloging the World: Paul Otlet and the Birth of the Information Age. Oxford ; New York: Oxford University Press, 2014.
“Citation Style Language.” Citation Style Language. Accessed March 11, 2019. https://citationstyles.org/.
“Open Archives Initiative Protocol for Metadata Harvesting.” Accessed March 11, 2019. https://www.openarchives.org/pmh/.

About This Guide

This guide is generated from markdown text files using the mdBook tool. The source is mirrored on Github at https://github.com/internetarchive/fatcat.

Contributions and corrections are welcome! If you create a (free) account on github you can submit comments and corrections as "Issues", or directly edit the source and submit "Pull Requests" with changes.

This guide is licensed under a Creative Commons Attribution (CC-BY) license, meaning you are free to redistribute, sell, and extend it without special permission, as long as you credit the original authors.