Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

PDF Documents: A Primer For Data Curators: Portable Document Format PDF

Download as pdf or txt
Download as pdf or txt
You are on page 1of 11

 

 
 
PDF Documents: A Primer for Data Curators 

Overview 
Portable Document Format (PDF) 
 
Extension  .pdf 

MIME Type  application/pdf  application/x-bzpdf 


application/x-pdf  application/x-gzpdf 
application/acrobat  text/pdf 
application/vnd.pdf  text/x-pdf 

Structure  7-bit ASCII file that consists of a subset of PostScript for layout and graphics along with a 
font-embedding/replacement system and a structured storage system bundling embedded elements 
and associated content into one file. 

Versions  Recent versions: PDF 1.7 (ISO 32000:1:2008) does not include Adobe Extensions; however, PDF 2.0 
(ISO 32000-2:2017) is fully inclusive and open technology (PDF Association, 2017).

Notable Past Versions 

PDF Version  Significant Features   Acrobat Reader 


(Year)  Added  Version (No.) 

1.0 (1993)  Hyperlinks, bookmarks  Carousel 

1.2 (1996)  Interactive page elements (radio buttons, checkboxes),  3.0 


AcroForm and FDF 

1.3 (2000)  Digital signatures; capture, conversion, and mapping  4.0 


functionality 

1.4 (2001)  RC4 encryption key lengths 40-128 bits, embedded FDF  5.0 
files, accessibility features, XMP metadata streams, 
importing content from other PDF documents 

1.5  XML FDF (XFDF)  6.0 

1.6  OpenType font embedding, cross-document linking  7.0 

Information adapted from​ ​https://en.wikipedia.org/wiki/History_of_the_Portable_Document_Format_(PDF)


Primary fields or  Ubiquitous use 
areas of use 

Source and  Versions 1.0 -1.6 were proprietary - developed and managed by Adobe Systems, adding new features 
affiliation  from 1993-2006. Versions 1.7 and onward are open standards, managed by ISO.  

Date created  October 29, 2019 

Created by  Peace Ossom-Williamson (​peace@uta.edu​), Nicole Contaxis (​nicole.contaxis@nyulangone.org​), 


Margaret Lam (​mlam3@gmu.edu​), Adam Kriesberg (​akriesberg@gmail.com​) 
 
Mentor: ​Jake Carlson (​jakecar@umich.edu​)  

Date updated and   


summary of 
changes made 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
Suggested Citation: Ossom-Williamson, Peace, Contaxis, Nicole, Lam, Margaret, Kriesberg, Adam. 
(2019). PDF Data Curation Primer. Retrieved from the University of Minnesota Digital 
Conservancy. ​http://hdl.handle.net/11299/210210​.  
 
This work was created as part of the “Specialized Data Curation” Workshop #2 held at Johns Hopkins 
University in Baltimore, MD on April 17-18, 2019. These workshops have been generously funded by 
the Institute of Museum and Library Services # RE-85-18-0040-18. See more primers at 
https://datacurationnetwork.org/​.  
   


Table of Contents 
 
Overview 
Table of Contents 
Primer for Data Curators: PDF Documents 
Description of Format 
Overview 
Features 
Standards, Specifications, and Subsets 
Typical Purposes and Functions 
Data Description 
Reporting Related Methods and Results 
Data Storage and Sharing 
Software for Viewing or Analyzing Data 
PDF CURATED Checklist 
Additional Resources 
References 
 
 

   


Primer for Data Curators: PDF Documents 
by Peace Ossom-Williamson, Nicole Contaxis, Margaret Lam, and Adam Kriesberg 
 

Description of Format 
 

Overview 
The Portable Document Format (PDF) created by Adobe Systems is currently the de facto standard for fixed-format 
electronic documents (Johnson, 2014). This format was developed and primarily used for desktop publishing because it 
allows for reliable and consistent display and printing, regardless of the computer opening the document (Adobe Systems, 
n.d.a). It was initially a less commonly used format until the integration of increased functionality (external hyperlinks) 
and freely available software (Adobe Reader version 2.0 and onward, which became Acrobat Reader) (“History of the 
Portable Document Format (PDF)”, n.d.). PDF documents may be created natively, converted from other electronic 
formats, or digitized from paper, microform, or other hard copy format while keeping the data in originating files 
integrated into the document, including text, graphics, spreadsheets, and other integrations. PDF documents typically 
contain a combination of vector graphics, text, and bitmap graphics (Adobe Systems, Inc., 2008). Some may contain 
multimedia objects and other content. As a highly-used document publication format, PDF documents represent 
considerable bodies of important information globally and have become commonly used for publishing data and related 
files. 

Features 
As a format, the PDF is preferred for document sharing and e-publishing due to the following features: 
● preservation of document fidelity independent of the housing or viewing device or platform, 
● merging of content from diverse sources and file types into one self-contained document while maintaining the 
integrity of all original source documents, 
● digital signatures to certify authenticity, 
● security and permissions to allow the creator to retain control of the document and associated rights, 
● accessibility of content to those with disabilities, 
● extraction and reuse of content for use with other file formats and applications, and 
● electronic forms to gather data and integrate with business systems. 
 
​ ttps://www.adobe.com/content/dam/acom/en/devnet/pdf/PDF32000_2008.pdf​, p vii. 
List adapted from h
 

Standards, Specifications, and Subsets 


PDF versions, beginning with 1.7, are published by the International Organization for Standardization (ISO), with Adobe 
as one of the technical committee members. Full Function PDF documents conforming to ISO 32000-1 carry the PDF 
version number 1.7 (Adobe Systems, Inc., n.d.b). PDF standards are published under ISO specifications and are backward 
inclusive; therefore, the PDF 1.7 specification includes the functionality of versions 1.0 through 1.6. Some features are 
marked as deprecated. Where Adobe removed certain features of PDF from their standard, they are not contained in ISO 
32000-1; however, future versions, beginning from ISO 32000-2 (PDF 2.0), will no longer include proprietary 
functionality. PDF documents conforming to ISO 32000-2 are known as “PDF 2.0 documents” or “PDF-2.0” (“History 
of the Portable Document Format (PDF)”, 2018).  


Subsets of the PDF standard include: 
 
● PDF for Archive (PDF/A)  ● PDF for Universal Access (PDF/UA) 
● PDF for Exchange (PDF/X)  ● PDF for Healthcare (PDF/H) 
● PDF for Engineering (PDF/E)  ● PDF for Variable and Transactional Printing 
  (PDF/VT) 
 
It is important to note that the preferred subset of the PDF format to use for preservation is PDF/A. PDF/A is designed 
for long-term preservation and archiving and does not include features that would make the format unsuitable for 
preservation, such as encryption, including audio or video objects in the PDF/A file, or allowing for the use of copyrighted 
fonts (Arm & Fleischhauer, 2019). 

Typical Purposes and Functions 


PDF documents are used for a variety of purposes - the three most common being (1) data description, (2) reporting 
related methods and results, and (3) data storage and sharing. Below are recommendations along with examples and 
templates. 

Data Description 
Data description documents provide additional details relating to the data to allow other users to understand how the data 
were collected, defined, and structured and the relationships between this dataset and other data. Data description 
documents include data dictionaries, codebooks, and survey instruments. Easily editable files (e.g. TXT) are 
recommended along with providing the file duplicated in PDF, formatted for ease of referral. Machine-actionable files are 
recommended to allow for reproducing and adapting surveys and codebooks. (See the “Reporting Related Methods and 
Results” section for these uses.)  
 
Recommended Components 
● Creator(s) names, contact information, and affiliation. 
● Description of the project. 
● Description of the data files with each file name listed along with its description and how each dataset/database 
relates to one another and to other existing datasets/databases. 
● Specific definitions of all abbreviations, measurements, and any detail necessary for interpretation. 
● For all the variables, the exact name as it appears in the dataset or database, its full description, data type, and 
acceptable and null values. 

Common Types 
● Codebooks and Data Dictionaries:  
○ Additional Recommendations: ​www.dataone.org/best-practices/create-data-dictionary  
○ Example: American Time Use Survey data dictionaries - w​ ww.bls.gov/tus/dictionaries.htm 
○ Blank Template: ​data.nal.usda.gov/data-dictionary-blank-template 
● README Files:  
○ Additional Recommendations: ​data.research.cornell.edu/content/readme 
○ Example: Implicit Association Test README - o​ sf.io/s27xd 
○ Blank Template: ​cornell.app.box.com/v/ReadmeTemplate 
● Survey Instruments - usually exported from the program 
○ Additional Recommendations: 
res.mdpi.com/data/data-03-00045/article_deploy/data-03-00045-v2.pdf?filename=&attachment=1 
(Section 3.1.3. Survey Instrument) 


○ Example: Child Care Market Rate Survey instrument - ​doi.org/10.3886/ICPSR23262.v2 
(Questionnaire.pdf)  

Reporting Related Methods and Results 


Other related files that are often stored as PDF documents are those describing or including the research methods or 
findings. These include protocols, figures, and the research manuscript or article itself. See below for examples. 
 
● Data collection methods 
○ Example 1: computational biology research steps - c​ onservancy.umn.edu/handle/11299/176334 
(Methods.pdf) 
○ Example 2: systematic review protocol - 
datadryad.org/bitstream/handle/10255/dryad.211406/Search%20protocol.pdf 
 
● Charts, tables, and visualizations of findings - Example: figures from a geohistorical immigrant study - 
doi.org/10.7910/DVN/8PY6Q6/0VO9FK 
 
Sharing the underlying data as only the publication or in graphs or charts is common but impractical or labor-intensive. 
“‘Send me your data—pdf is fine,’ said no one ever” by Rivers (2013) details basic steps to better share these files along 
with the machine-actionable data. These supplemental files reporting methods and findings, including the manuscript 
itself, as PDF documents along with the data files can assist with interpretation of the data and related findings. 
 

Data Storage and Sharing 


PDF documents are not recommended for sharing data because it “restricts reuse by encapsulating otherwise useful data 
in this traditional publication format” (Johnston, 2017, p. 127), and data should be made available through “‘reuse-ready 
sharing,’ ‘fit-for-purpose sharing,’ or ‘source file sharing.’”​ ​(p. 132). However, many tools only allow for export of the data 
in proprietary formats or in PDF. In these particular cases, the PDF documents, provided along with the proprietary file, 
allows for viewing of the data by a larger number of users. 

   


Software for Viewing or Analyzing Data  
Adobe has several programs used most commonly for PDF documents: Acrobat Reader - only for viewing and signing 
PDF documents, Acrobat Pro, InDesign, Illustrator, and Photoshop. However, since PDF is an open format, there are 
thousands of software that can be used. Adobe Systems holds the PDF patents but licenses them for royalty-free use for 
PDF software development. 
 
The table below lists o​ ther commonly used tools​:  
 
Software  Uses*  Notes 

Microsoft Office 2007 and later  creating   

Google Docs  creating, reading   

LibreOffice (GNU LGPLv3 /  reading, creating, converting   


MPLv2.0) 

PDFBox (Apache)  converting  not available for Mac OS, converts PDF 
documents to text, images, html, and other 
file types 

Pdf-parser (public domain)  reading, analyzing  extraction and analysis tppl, handles corrupt 
and malicious PDF documents 

Google Chrome  reading, converting  built-in PDF document viewer in web 


browser, converts HTML to PDF via “print 
to PDF” functionality 

Mozilla Firefox (MPL 2.0)  reading  built-in PDF document viewer (PDF.js) in 
web browser 

Nitro PDF Reader (freeware)  reading, creating, converting  allows for limited editing - text highlighting, 
drawing lines, and measuring distances; 
extracts images from PDF documents 

Bluebeam Revu  reading, creating, editing, converting   

Nitro PDF Pro  reading, creating, editing, converting   

PDF Studio  reading, creating, editing    

pdftk (GPL)  reading, creating, editing, analyzing,  Command-line tools for manipulating PDF 
converting  documents and filling PDF forms with 
FDF/XFDF data 
 
Mobile applications allowing for reading PDF documents include Amazon Kindle app, Google Drive app, iBooks, and Hancom 
Office Editor. Web tools include Smallpdf (conversion); PDFVue, A.nnotate, DigiSigner (reading, annotating, filling out forms, 
signing); and Docstoc, Issuu, PDF.js, and PDFTron Systems (reading). 
 
* Uses Definitions:​ creating - saving a document as .pdf, editing - editing a document that began as a .pdf and saving it as .pdf, 
reading - opening and viewing .pdf files, converting - converting content from a .pdf file to another type, analyzing - analyzing 
content in .pdf files. 
 
​ ttps://en.wikipedia.org/wiki/List_of_PDF_software 
Information on this page adapted from h


PDF CURATED Checklist 
The following checklist is adapted from the original Data Curation Network (2018) checklist to assist curators when 
encountering diverse formats of digital objects. The modified version below includes considerations for the PDF 
document in order to guide you when working with PDF files provided by research stakeholders in your organization. It 
includes questions to ask and steps to take which will help ensure that documents meet expectations according to The 
FAIR Data Principles and data curation best practices before bringing them into a curatorial or preservation environment. 

First figure out the purpose of files as PDF documents in the larger research workflow a​ nd determine whether the 
information in PDF could be more useful in a machine-readable format. If PDF is appropriate (or there is a duplicate in 
another file type), refer to the steps below: 

● Check ​data files and read documentation 


○ Files open as expected 
❑ Troubleshooting Issues: 
❑ Cannot open PDF file in browser/on computer: 
https://helpx.adobe.com/acrobat/kb/cant-open-pdf.html 
❑ Cannot open PDF file in Acrobat created from InDesign or Illustrator: 
https://helpx.adobe.com/indesign/kb/cannot-open-pdf-file-acrobat.html  
❑ Lack of embedded fonts renders PDF incorrectly (not relevant if transforming to PDF/A, see below 
for additional information): 
https://helpx.adobe.com/acrobat/kb/missing-or-garbled-text-converting.html  
❑ Find solutions to additional issues on the Adobe Support Community Forum: 
https://community.adobe.com/ 
❑ Use third-party services/tools to repair PDF files. Please note that these are examples of tools to 
work with PDF files and not an exhaustive list. You should experiment with these and/or other 
tools before implementing in your organization. 
❑ SysInfoTools PDF Recovery Tool: ​https://www.sysinfotools.com/recovery/pdf-recovery.php 
❑ Sejda Repair PDF: ​https://www.sejda.com/repair-pdf  
❑ Kernel for PDF Repair: ​https://www.nucleustechnologies.com/pdf-repair-tool.html 
❑ Other Issues __________ 
○ File does not have inappropriate protections or security features enabled preventing curation (not relevant if 
transforming to PDF/A, see below for additional information) 
❑ Troubleshooting Issues: 
❑ Remove password protection on PDF files if you have access to Acrobat Pro: 
https://acrobat.adobe.com/us/en/acrobat/how-to/unlock-pdf.html  
❑ Find information on security for PDFs on the Adobe website: 
https://helpx.adobe.com/acrobat/using/overview-security-acrobat-pdfs.html#overview_of_securit
y_in_acrobat_and_pdfs  
○ Metadata quality is rich, accurate, and complete 
❑ Metadata has issues _________ 
○ Documentation Type (circle) 
❑ Readme / Codebook / Data Dictionary / Other: ________________________ 
❑ Missing/None 
❑ Needs work 
○ Human subjects data, if present 
❑ Request consent form / participation agreement 
 


 

● Understand ​the data (or try to) 


○ Organization of data well-structured 
○ Headers clearly defined 
❑ Define headers 
❑ Clarify use of “blanks” 
❑ Clarify units of measurement 
○ Quality control clearly defined 
❑ Unclear quality control 
❑ Update/add Methodology 
 
● Request m​ issing information or changes 
○ Describe concerns, issues, and needed improvements to the data submission. 
○ If content is unfamiliar, recommend changes to the creator for reusability (Janée et al,, 2019). 
 
● Augment t​he submission 
○ Discoverability sufficient 
❑ Recommend (circle one) full-text index / file rename / file reorder / file descriptions / zip files  
❑ Other ______________ 
○ Keywords Sufficient 
❑ Suggestions _______________ 
○ Linkages Sufficient 
❑ Link to report/paper 
❑ Link to related data sets 
❑ Link to source data 
❑ Link to other ____________ 
 
● Transform f​ile formats 
○ Preferred file formats in use 
❑ Convert to PDF/A-3, compliant with ISO-32000-2.  
(PDF/A-3 allows for embedding files.) (Arms et al., 2019) 
❑ Convert embedded files if they include the informational content of the document. 
❑ Recommend conversion from PDF to _________   
(Depends on purpose - see “​Typical Purposes and Functions​” section) 
❑ Retain PDF along with original formats 
○ Software needed is readily available 
❑ Unclear version of software 
❑ Unclear software used 
○ Visualization of data easily accessible 
❑ Recommend graphical representation ____________ 
❑ Recommend web-accessible surrogate ____________ 
 
● Evaluate​and rate the overall data record for FAIRness.  
(Rubric evaluating the FAIR principles are based on the scoring matrix by Dunning et al. (2017).) 
○ Findable 
❑ Metadata exceeds author/ title/ date, 
❑ Unique PID (DOI, Handle, PURL, etc.). 
❑ Discoverable via web search engines. 


 

○ Accessible 
❑ Retrievable via a standard protocol (e.g., HTTP). 
❑ Free, open (e.g., download link). 
❑ Embedded files comply with requirements of PDF/A-compliant attachments. 
○ Interoperable 
❑ Metadata formatted in a standard schema (e.g., Dublin Core). 
❑ Metadata provided in machine-readable format (OAI feed). 
○ Reusable 
❑ Data include sufficient metadata about the data characteristics to reuse. 
❑ Contact info displayed if the direct assistance of the author needed. 
❑ Clear indicators of who created, owns, and stewards the data. 
❑ Data are released with clear data usage terms (e.g., a CC License). 
 
● Document​throughout curation activities 
○ Accessioning & deposit records 
❑ Names, dates, contact information, submission agreements, etc. 
○ Repository collection metadata 
○ Provenance logs 
○ Service workflow 
○ Preservation packaging 
○ Any additional requirements at your institution 

   

10 
Additional Resources 
1. The FAIR Data Principles (2016). Available at ​force11.org/group/fairgroup/fairprinciples 
2. PDF/A Family, PDF for Long-term Preservation (2019). Available at 
www.loc.gov/preservation/digital/formats/fdd/fdd000318.shtml  

References 
Adobe Systems, Inc. (n.d.a). ​About Adobe PDF.​ Retrieved from 
https://acrobat.adobe.com/us/en/acrobat/about-adobe-pdf.html 
Adobe Systems, Inc. (n.d.b). PDF Reference and Adobe Extensions to the PDF Specification. Retrieved from 
https://www.adobe.com/devnet/pdf/pdf_reference.html 
Adobe Systems, Inc. (2008). D ​ ocument management - Portable document format - Part 1: PDF 1.7. R ​ etrieved from 
https://www.adobe.com/content/dam/acom/en/devnet/pdf/PDF32000_2008.pdf 
Arms, C. R., & Fleischhauer, C. (2019). PDF/A-3, PDF for long-term preservation, use of ISO 32000-1, with 
embedded files. Retrieved from https://www.loc.gov/preservation/digital/formats/fdd/fdd000360.shtml 
Data Curation Network. (2018). C ​ hecklist of CURATED steps. R
​ etrieved from 
https://datacurationnetwork.org/resources-2 
Dunning, A., de Smaele, M., & Böhmer, Jasmin. (2017). Are the FAIR Data Principles fair? ​International Journal of 
Digital Curation. 12​(2) . Retrieved from https://doi.org/10.2218/ijdc.v12i2.567. 
History of the Portable Document Format (PDF). (2018, December 16). In ​Wikipedia​. Retrieved from 
https://en.wikipedia.org/w/index.php?title=History_of_the_Portable_Document_Format_(PDF)&oldid=87
3991566 
Janée, G., Sawchuk, S., & Yoo, H. J. (2019). Microsoft Excel data curation primer. D ​ ata Curation Network Primers, 
1​. Retrieved from https://conservancy.umn.edu/handle/11299/202816 
Johnson, D. (2014, February 17). The 8 most popular document formats on the web [Blog post]. Retrieved from 
http://duff-johnson.com/2014/02/17/the-8-most-popular-document-formats-on-the-web 
Johnston, L. (Ed.). (2017). ​Curating research data. Volume one: Practical strategies for your digital repository​. 
Chicago, Illinois: Association of College and Research Libraries, a division of the American Library 
Association. 
PDF Association. (2017, July 31).​ ISO 32000-2 (PDF 2.0)​. Retrieved from 
https://www.pdfa.org/resource/iso-32000-2-pdf-2-0 
Rivers, C. (2018, April 8). “Send me your data: PDF is fine” said no one ever (how to share your data effectively) 
[Blog post]. Retrieved from 
http://www.caitlinrivers.com/blog/send-me-your-data-pdf-is-fine-said-no-one-ever-how-to-share-your-data-eff
ectively 
 
 

11 

You might also like