PDF Documents: A Primer For Data Curators: Portable Document Format PDF
PDF Documents: A Primer For Data Curators: Portable Document Format PDF
PDF Documents: A Primer For Data Curators: Portable Document Format PDF
PDF Documents: A Primer for Data Curators
Overview
Portable Document Format (PDF)
Extension .pdf
Structure 7-bit ASCII file that consists of a subset of PostScript for layout and graphics along with a
font-embedding/replacement system and a structured storage system bundling embedded elements
and associated content into one file.
Versions Recent versions: PDF 1.7 (ISO 32000:1:2008) does not include Adobe Extensions; however, PDF 2.0
(ISO 32000-2:2017) is fully inclusive and open technology (PDF Association, 2017).
1.4 (2001) RC4 encryption key lengths 40-128 bits, embedded FDF 5.0
files, accessibility features, XMP metadata streams,
importing content from other PDF documents
1
Primary fields or Ubiquitous use
areas of use
Source and Versions 1.0 -1.6 were proprietary - developed and managed by Adobe Systems, adding new features
affiliation from 1993-2006. Versions 1.7 and onward are open standards, managed by ISO.
2
Table of Contents
Overview
Table of Contents
Primer for Data Curators: PDF Documents
Description of Format
Overview
Features
Standards, Specifications, and Subsets
Typical Purposes and Functions
Data Description
Reporting Related Methods and Results
Data Storage and Sharing
Software for Viewing or Analyzing Data
PDF CURATED Checklist
Additional Resources
References
3
Primer for Data Curators: PDF Documents
by Peace Ossom-Williamson, Nicole Contaxis, Margaret Lam, and Adam Kriesberg
Description of Format
Overview
The Portable Document Format (PDF) created by Adobe Systems is currently the de facto standard for fixed-format
electronic documents (Johnson, 2014). This format was developed and primarily used for desktop publishing because it
allows for reliable and consistent display and printing, regardless of the computer opening the document (Adobe Systems,
n.d.a). It was initially a less commonly used format until the integration of increased functionality (external hyperlinks)
and freely available software (Adobe Reader version 2.0 and onward, which became Acrobat Reader) (“History of the
Portable Document Format (PDF)”, n.d.). PDF documents may be created natively, converted from other electronic
formats, or digitized from paper, microform, or other hard copy format while keeping the data in originating files
integrated into the document, including text, graphics, spreadsheets, and other integrations. PDF documents typically
contain a combination of vector graphics, text, and bitmap graphics (Adobe Systems, Inc., 2008). Some may contain
multimedia objects and other content. As a highly-used document publication format, PDF documents represent
considerable bodies of important information globally and have become commonly used for publishing data and related
files.
Features
As a format, the PDF is preferred for document sharing and e-publishing due to the following features:
● preservation of document fidelity independent of the housing or viewing device or platform,
● merging of content from diverse sources and file types into one self-contained document while maintaining the
integrity of all original source documents,
● digital signatures to certify authenticity,
● security and permissions to allow the creator to retain control of the document and associated rights,
● accessibility of content to those with disabilities,
● extraction and reuse of content for use with other file formats and applications, and
● electronic forms to gather data and integrate with business systems.
ttps://www.adobe.com/content/dam/acom/en/devnet/pdf/PDF32000_2008.pdf, p vii.
List adapted from h
4
Subsets of the PDF standard include:
● PDF for Archive (PDF/A) ● PDF for Universal Access (PDF/UA)
● PDF for Exchange (PDF/X) ● PDF for Healthcare (PDF/H)
● PDF for Engineering (PDF/E) ● PDF for Variable and Transactional Printing
(PDF/VT)
It is important to note that the preferred subset of the PDF format to use for preservation is PDF/A. PDF/A is designed
for long-term preservation and archiving and does not include features that would make the format unsuitable for
preservation, such as encryption, including audio or video objects in the PDF/A file, or allowing for the use of copyrighted
fonts (Arm & Fleischhauer, 2019).
Data Description
Data description documents provide additional details relating to the data to allow other users to understand how the data
were collected, defined, and structured and the relationships between this dataset and other data. Data description
documents include data dictionaries, codebooks, and survey instruments. Easily editable files (e.g. TXT) are
recommended along with providing the file duplicated in PDF, formatted for ease of referral. Machine-actionable files are
recommended to allow for reproducing and adapting surveys and codebooks. (See the “Reporting Related Methods and
Results” section for these uses.)
Recommended Components
● Creator(s) names, contact information, and affiliation.
● Description of the project.
● Description of the data files with each file name listed along with its description and how each dataset/database
relates to one another and to other existing datasets/databases.
● Specific definitions of all abbreviations, measurements, and any detail necessary for interpretation.
● For all the variables, the exact name as it appears in the dataset or database, its full description, data type, and
acceptable and null values.
Common Types
● Codebooks and Data Dictionaries:
○ Additional Recommendations: www.dataone.org/best-practices/create-data-dictionary
○ Example: American Time Use Survey data dictionaries - w ww.bls.gov/tus/dictionaries.htm
○ Blank Template: data.nal.usda.gov/data-dictionary-blank-template
● README Files:
○ Additional Recommendations: data.research.cornell.edu/content/readme
○ Example: Implicit Association Test README - o sf.io/s27xd
○ Blank Template: cornell.app.box.com/v/ReadmeTemplate
● Survey Instruments - usually exported from the program
○ Additional Recommendations:
res.mdpi.com/data/data-03-00045/article_deploy/data-03-00045-v2.pdf?filename=&attachment=1
(Section 3.1.3. Survey Instrument)
5
○ Example: Child Care Market Rate Survey instrument - doi.org/10.3886/ICPSR23262.v2
(Questionnaire.pdf)
6
Software for Viewing or Analyzing Data
Adobe has several programs used most commonly for PDF documents: Acrobat Reader - only for viewing and signing
PDF documents, Acrobat Pro, InDesign, Illustrator, and Photoshop. However, since PDF is an open format, there are
thousands of software that can be used. Adobe Systems holds the PDF patents but licenses them for royalty-free use for
PDF software development.
The table below lists o ther commonly used tools:
Software Uses* Notes
PDFBox (Apache) converting not available for Mac OS, converts PDF
documents to text, images, html, and other
file types
Pdf-parser (public domain) reading, analyzing extraction and analysis tppl, handles corrupt
and malicious PDF documents
Mozilla Firefox (MPL 2.0) reading built-in PDF document viewer (PDF.js) in
web browser
Nitro PDF Reader (freeware) reading, creating, converting allows for limited editing - text highlighting,
drawing lines, and measuring distances;
extracts images from PDF documents
pdftk (GPL) reading, creating, editing, analyzing, Command-line tools for manipulating PDF
converting documents and filling PDF forms with
FDF/XFDF data
Mobile applications allowing for reading PDF documents include Amazon Kindle app, Google Drive app, iBooks, and Hancom
Office Editor. Web tools include Smallpdf (conversion); PDFVue, A.nnotate, DigiSigner (reading, annotating, filling out forms,
signing); and Docstoc, Issuu, PDF.js, and PDFTron Systems (reading).
* Uses Definitions: creating - saving a document as .pdf, editing - editing a document that began as a .pdf and saving it as .pdf,
reading - opening and viewing .pdf files, converting - converting content from a .pdf file to another type, analyzing - analyzing
content in .pdf files.
ttps://en.wikipedia.org/wiki/List_of_PDF_software
Information on this page adapted from h
7
PDF CURATED Checklist
The following checklist is adapted from the original Data Curation Network (2018) checklist to assist curators when
encountering diverse formats of digital objects. The modified version below includes considerations for the PDF
document in order to guide you when working with PDF files provided by research stakeholders in your organization. It
includes questions to ask and steps to take which will help ensure that documents meet expectations according to The
FAIR Data Principles and data curation best practices before bringing them into a curatorial or preservation environment.
First figure out the purpose of files as PDF documents in the larger research workflow a nd determine whether the
information in PDF could be more useful in a machine-readable format. If PDF is appropriate (or there is a duplicate in
another file type), refer to the steps below:
8
9
○ Accessible
❑ Retrievable via a standard protocol (e.g., HTTP).
❑ Free, open (e.g., download link).
❑ Embedded files comply with requirements of PDF/A-compliant attachments.
○ Interoperable
❑ Metadata formatted in a standard schema (e.g., Dublin Core).
❑ Metadata provided in machine-readable format (OAI feed).
○ Reusable
❑ Data include sufficient metadata about the data characteristics to reuse.
❑ Contact info displayed if the direct assistance of the author needed.
❑ Clear indicators of who created, owns, and stewards the data.
❑ Data are released with clear data usage terms (e.g., a CC License).
● Documentthroughout curation activities
○ Accessioning & deposit records
❑ Names, dates, contact information, submission agreements, etc.
○ Repository collection metadata
○ Provenance logs
○ Service workflow
○ Preservation packaging
○ Any additional requirements at your institution
10
Additional Resources
1. The FAIR Data Principles (2016). Available at force11.org/group/fairgroup/fairprinciples
2. PDF/A Family, PDF for Long-term Preservation (2019). Available at
www.loc.gov/preservation/digital/formats/fdd/fdd000318.shtml
References
Adobe Systems, Inc. (n.d.a). About Adobe PDF. Retrieved from
https://acrobat.adobe.com/us/en/acrobat/about-adobe-pdf.html
Adobe Systems, Inc. (n.d.b). PDF Reference and Adobe Extensions to the PDF Specification. Retrieved from
https://www.adobe.com/devnet/pdf/pdf_reference.html
Adobe Systems, Inc. (2008). D ocument management - Portable document format - Part 1: PDF 1.7. R etrieved from
https://www.adobe.com/content/dam/acom/en/devnet/pdf/PDF32000_2008.pdf
Arms, C. R., & Fleischhauer, C. (2019). PDF/A-3, PDF for long-term preservation, use of ISO 32000-1, with
embedded files. Retrieved from https://www.loc.gov/preservation/digital/formats/fdd/fdd000360.shtml
Data Curation Network. (2018). C hecklist of CURATED steps. R
etrieved from
https://datacurationnetwork.org/resources-2
Dunning, A., de Smaele, M., & Böhmer, Jasmin. (2017). Are the FAIR Data Principles fair? International Journal of
Digital Curation. 12(2) . Retrieved from https://doi.org/10.2218/ijdc.v12i2.567.
History of the Portable Document Format (PDF). (2018, December 16). In Wikipedia. Retrieved from
https://en.wikipedia.org/w/index.php?title=History_of_the_Portable_Document_Format_(PDF)&oldid=87
3991566
Janée, G., Sawchuk, S., & Yoo, H. J. (2019). Microsoft Excel data curation primer. D ata Curation Network Primers,
1. Retrieved from https://conservancy.umn.edu/handle/11299/202816
Johnson, D. (2014, February 17). The 8 most popular document formats on the web [Blog post]. Retrieved from
http://duff-johnson.com/2014/02/17/the-8-most-popular-document-formats-on-the-web
Johnston, L. (Ed.). (2017). Curating research data. Volume one: Practical strategies for your digital repository.
Chicago, Illinois: Association of College and Research Libraries, a division of the American Library
Association.
PDF Association. (2017, July 31). ISO 32000-2 (PDF 2.0). Retrieved from
https://www.pdfa.org/resource/iso-32000-2-pdf-2-0
Rivers, C. (2018, April 8). “Send me your data: PDF is fine” said no one ever (how to share your data effectively)
[Blog post]. Retrieved from
http://www.caitlinrivers.com/blog/send-me-your-data-pdf-is-fine-said-no-one-ever-how-to-share-your-data-eff
ectively
11