Deriving HTML From PDF
Deriving HTML From PDF
Deriving HTML From PDF
This work is licensed under the Creative Commons Attribution 4.0 International License.
To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/ or send a letter to Creative Commons, PO
Box 1866, Mountain View, CA 94042, USA.
PDF Association
Neue Kantstrasse 14
14057 Berlin, Germany
E-mail: copyright@pdfa.org
Web: www.pdfa.org
Each PDF Association member interested in a subject for which a TWG has been
established has the right to be represented on that committee. International
organizations, governmental and non-governmental, in liaison with the PDF Association,
also take part in the work. The PDF Association collaborates closely with the 3D PDF
Consortium and ISO on all matters of standardization.
The procedures used to develop this document and those intended for its maintenance
are described in the PDF Association's publication process.
Attention is drawn to the possibility that some of the elements of this document may be
the subject of patent rights. The PDF Association shall not be held responsible for
identifying any or all such patent rights. Details of any patent rights identified during the
development of the document will be in the Introduction.
Any trade name used in this document is information given for the convenience of users
and does not constitute an endorsement.
PDF Association
Table of Contents
Foreword ................................................................................................................................... ii
Introduction .............................................................................................................................. 1
References ................................................................................................................................. 2
1 Scope ................................................................................................................................... 3
2 Terms and definitions ......................................................................................................... 4
3 Notation ............................................................................................................................... 4
4 Algorithm for deriving HTML from Tagged PDF ................................................................. 5
4.1 Technical context ....................................................................................................... 5
4.2 Document handling .................................................................................................... 5
4.2.1 Head ................................................................................................................ 5
4.2.2 The structure tree root ................................................................................... 6
4.2.3 The ClassMap.................................................................................................. 6
4.2.4 Body ................................................................................................................ 8
4.3 PDF structure elements .............................................................................................. 9
4.3.1 General ........................................................................................................... 9
4.3.2 Common processing ...................................................................................... 9
4.3.3 Mapping PDF structure element types to HTML elements ......................... 11
4.3.4 Ensuring valid HTML .................................................................................... 15
4.3.5 Special cases ................................................................................................ 16
4.3.6 Structure element properties ...................................................................... 25
4.3.7 Attributes ...................................................................................................... 26
4.4 Processing of a content element ............................................................................. 33
4.4.1 Paths ............................................................................................................. 33
4.4.2 Text ............................................................................................................... 34
4.4.3 Image XObjects and inline images .............................................................. 34
4.4.4 Form XObjects .............................................................................................. 35
4.4.5 Shadings ....................................................................................................... 35
4.4.6 Artifacts......................................................................................................... 35
4.4.7 Handling marked content sequences ......................................................... 35
4.4.8 Processing of an object reference (OBJR) ................................................... 36
4.5 ECMAScript ............................................................................................................... 40
In the modern world of small devices, IoT and connected systems, where interchange and
reuse of data is critical, it is reasonable to question the continued relevance of PDF’s core
value proposition. In particular, search engines, machine learning and artificial
intelligence systems focus on accessing information contained in documents over visual
representation. In other cases, document producers wish to deliver data in a form that is
suitable for automated processing while using a PDF file as a record for trust purposes.
End users want electronic documents that adapt smoothly to viewing on diverse small
devices.
By describing the algorithm that produces conforming HTML from a tagged PDF, this
document shows how well-tagged PDF documents, containing both traditional fixed-
layout content and the semantic structures leveraged by modern devices and software,
can be reliably and consistently reused as HTML to support better user experiences and
renew PDF’s value proposition.
HTML was chosen as a derivation target because HTML is consumed on all platforms and
supported by all major vendors. With small modifications, developers can use this
document to export content from well-tagged PDF to any format.
Author
Contributors
References
ISO 14289-1:2014, Document management — Electronic Document File Format Enhancement
for Accessibility — Part 1: Use of ISO 32000-1 (PDF/UA-1)
ISO 21757-1, Document management – ECMAScript for PDF – Part 1: Use of ISO 32000-2 (PDF
2.0)
NOTE 1 As of this writing, this document is at ISO's Committee Draft stage, and is available
only to accredited members of ISO TC 171 SC 2 WG 8, or to members of the PDF
Association.
ISO 32000-2: 20xx, Document management — Portable Document Format — Part 2: PDF 2.0
NOTE 2 This document uses the forthcoming dated revision of ISO 32000-2. This
document remains under development and is only available to accredited members of ISO
TC 171 SC 2 WG 8 or to members of the PDF Association. A Draft International Standard
(DIS) of this document should be available for purchase from ISO in the early summer of
2019.
HTML 5, http://www.w3.org/TR/html5/
1 Scope
This document describes an algorithm that produces conforming HTML5 from a well-
tagged PDF.
It is important to see "well-tagged" in the context of known best practices for tagging that
require semantic appropriateness, recommend the best use of PDF structure elements in
diverse situations, and other practices.
This document identifies "well-tagged PDF" as those PDF files that conform to ISO 32000-
2, 14.8 "Tagged PDF", or ISO 14289-1 (PDF/UA-1).
The best results are achieved when tagged pdf files are both authored (by users) and
created (by software) with reuse in mind. In particular, the semantic structures defined in
Tagged PDF are fundamental to realizing the author’s intent in the derivation context.
Their presence as an accurate reflection of the author’s intent is the guarantor of an
expected user experience.
Provide adaptations for deriving PDF into HTML sub-structures (e.g., within a <div>)
Provide guidance for editing or modifying PDF files or HTML derived from PDF files
Provide guidance for addressing the security implementations of derivation
Substitute for best-practice documents focusing on accessibility
deterministic process of conversion of Tagged PDF files into a syntactically valid HTML file
derived HTML
derived CSS
media type
a two-part identifier for file formats and format contents, also known as MIME type or
content type
processor
any software, hardware or other active agent that derives HTML from a tagged PDF file
tagged PDF
3 Notation
Key names are given in boldface, while values are given in italics.
In example pseudo-code, standard PDF structure element entries are given with angled-
brackets (e.g., <Div>). The elements are not closed; instead, items contained within PDF
structure elements are enclosed by "{ }". Attributes are indicated using HTML conventions,
e.g. ‘<P lang="en-us">’, remarks or special characters are shown by [].
EXAMPLE
<Figure alt="PDF icon">
<Caption> {
<P> [remark or notice]
<P> {relevant content}
}
4.2.1 Head
The first element created in the HTML output shall be a head element with four child
elements, title, meta, viewport and link.
The value of the title element shall be derived from the value of the dc:title metadata
value (if present) in the PDF’s document-level XMP. If the PDF does not have a dc:title
specified, the value of the title element in the HTML shall be derived from the PDF’s
filename.
All text shall be encoded using UTF-8. A meta element shall be added with attributes of:
EXAMPLE
<!DOCTYPE html>
<html>
<head>
The value of each entry in the class map dictionary is an attribute object dictionary or an
array of attribute object dictionaries. The processor shall identify attributes that map to
CSS properties as described in 4.3.7, "Attributes", and for each, create a CSS declaration in
the derived CSS using the dictionary key as the property and using the value of this key
(converted into a string using common methods) as the declaration value.
If, after iterating over all attribute object dictionaries for a given key in the class map
dictionary, no appropriate attributes are located, the processor may either remove the
selector or provide an empty property list.
EXAMPLE
/ParaStyle
[
<<
/O /Layout
/Color [0 0 1] %blue
/BorderColor [0 1 0] %green
/TextAlign /Justify
>>
<<
/O /CSS-2.00
/color /red
/font-family ("Times New Roman", Times, serif)
/font-size (12px)
>>
]
>>
CSS output
.HeadingStyle {
text-align: center; color: red;
font-family: Arial, Helvetica, sans-serif;
font-size: 40px;
}
.ParaStyle {
font-family: "Times New Roman", Times, serif;
font-size: 12px;
color: red; /*coming from the CSS-2.00 attribute object
dictionary and overrides the Color attribute defined in the
Layout attribute object dictionary*/
border-color: green; /*coming from the Layout attribute object
dictionary*/
}
4.2.4 Body
A body element shall be created immediately after the head element. If the Lang key is
present in the PDF’s document catalog dictionary, the lang attribute shall be added to the
body element with the value of the PDF document’s Lang entry.
EXAMPLE
<body lang="EN-US">
The children of the body element are created as described in 4.3, "PDF structure
elements".
If the PDF contains one or more elements in the Fields array of the document’s interactive
form dictionary, then a form element shall be created as a child of the body element with
an attribute, name, whose value shall be acroform.
EXAMPLE
All other interactive form elements in the document are derived to corresponding HTML
form fields. They shall refer to the acroform using a "form" attribute of such HTML element
in the derived HTML.
EXAMPLE
<input name="FirstName" form="acroform"/>
4.3.1 General
As described in ISO 32000-2, 14.7.2, PDF structure elements are constructed in a
hierarchical fashion, referred to as the structure tree. Processing of the structure tree shall
begin with the root element and proceed in a depth-first, pre-order traversal of each
element and its children. The root element is handled according to 4.2.2, "The structure
tree root".
NOTE The processing order for nodes specifically indicates pre-order for the depth-first
traversal which is more explicit than logical content order.
4.3.2.2 When the PDF structure element does not use an explicit namespace
If the RoleMap entry is present in the structure tree root, and if it contains an entry
matching the structure type of the PDF structure element, the processor shall apply role
mapping – possibly transitively – until no further role mapping can be applied, as
described in ISO 32000-2, 14.8.6.2 "Role maps and namespaces". Based on the resulting
structure type – which by definition has to be a PDF 1.7 standard structure type for any
tagged PDF – the processor shall select corresponding HTML output (see 4.3.3, "Mapping
PDF structure element types to HTML elements").
NOTE Extra data attributes with PDF structure types are a unified way to preserve
information from PDF, and might help HTML developers to understand and rely on the
original structure that would otherwise be lost during derivation.
A data-pdf-se-type attribute with value of the PDF standard structure type’s key name
shall be added to the HTML element.
EXAMPLE
HTML output
<img data-pdf-se-type="Figure" data-pdf-se-type-
original="InlineShape Shape" href="image.jpg"/>
A data-pdf-se-type attribute with value of the PDF standard structure type’s key name
shall be added to the HTML element.
If the PDF structure element uses the MathML namespace – as defined in ISO 32000-2,
14.8.6.3 "Other namespaces" – then the processor shall use its structure type directly as a
MathML element.
If the PDF structure element uses the HTML namespace the processor may use its
structure type directly as the HTML element.
NOTE 1 Direct usage of the HTML namespace raises the same security concerns that apply
to HTML in general. See Annex A for additional guidance.
If the PDF structure element uses any other namespace – transitively, if applicable – the
processor shall apply role mapping until encountering a structure type that belongs to
one of the sets of structure types described above – PDF 1.7, PDF 2.0, MathML or optionally
HTML – and then determine the HTML element to use accordingly.
NOTE 2 This implies that not all role mappings on a given element are processed if one of
the defined sets is encountered first.
Annot Annot -
Art – article
– Artifact -
NOTE 2 The structure element is not
output, nor is any of its content or
descendent elements (see 4.3.5.7,
"NonStruct, Private and Artifact").
– Aside aside
BibEntry – p
BlockQuote – blockquote
Code – code
– DocumentFragment div
– Em em
– FENote div
H H h1..h6 / p
– H7..Hn p
Index – section
L L ul / ol / dl
LI LI li / div
Link Link a
NonStruct – -
NOTE 3 The structure element is not
processed, though content it contains is
processed normally. See 4.3.5.7,
"NonStruct, Private and Artifact".
Note – p
P P p
Private - -
NOTE 4 The PDF structure element is not
output, nor is any of its content or
descendent elements. See 4.3.5.7,
"NonStruct, Private and Artifact".
Quote – q
Reference - a
RB RB rb
RP RP rp
RT RT rt
– Strong strong
– Sub span
TD TD td
TH TH th
– Title div
TOC – ol
TOCI – li
TR TR tr
WT WT span
WP WP span
To achieve interoperable reuse of PDF content in syntactically valid HTML, the derivation
process has to account for these differences.
EXAMPLE
As shown below, direct derivation of the above example would not produce valid HTML
because the h1 element is not allowed as a descendant of the th element.
HTML output
<table>
<tr>
<th>
<h1>Heading inside TH</h1>
</th>
</tr>
</table>
PDF allows even more complex structures that don’t have a semantically equivalent
expression in HTML.
EXAMPLE
PDF allows tables to include captions which may themselves include tables:
<Table>{
<TR> {..}
<Caption> {
<Table> {..}
}
}
Whereas in HTML, even though the caption element is allowed as a descendant of a table
element, the caption is required to be the first table element cannot include another
table as its descendent.
HTML output
<table>
<tr>..</tr>
<caption>
<table>..</table>
</caption>
</table>
ISO 32000-2, 14.8.4.2 "Nesting of standard structure elements" defines rules that apply to
standard PDF structure elements and the context in which they can be used.
Additionally PDF structure elements with a type of Link or Form are special cases
according to 4.3.5.8, "Links and references" and 4.3.5.9, "Forms".
EXAMPLE
PDF
<H7 "O=ARIA-1.1" "role=heading" "aria-level=7" > { Heading 7 }
HTML output
<p role="heading" aria-level="7">Heading 7</p>
4.3.5.2 Caption
4.3.5.2.1 Captions of Figures and Formulas
If a Caption structure element is a direct child or an immediate sibling of a Figure or
Formula structure element, then it shall be mapped to the HTML element figcaption and
shall become the direct and first child of the corresponding HTML figure element.
If, using this method, a caption element containing a table or ol/ul /dl becomes a child of
another table element - to avoid invalid HTML, a processor may decide to:
Move the table or ol/ul/dl sub-structure from within the Caption to immediately
follow the parent table. If not allowed to be nested there continue to move up in the
tree, or
derive all PDF structure elements to span if visual representation is more critical.
EXAMPLE
HTML output
<div>
<table>
<caption>
Some Text
</caption>
<tr> </tr>
</table>
<table> </table>
</div>
4.3.5.3 Lbl
4.3.5.3.1 Lbl within a LI (list item)
If deriving L to ol or ul, and if a child LI structure element contains a Lbl structure element
as its first child, then:
the ul or ol elements derived from the parent L’s structure element shall have an
additional style attribute with value list-style-type:none.
Lbl is mapped to span if it has only textual content (no other child structure
elements)
Lbl is mapped to div, if it contains other structure elements
EXAMPLE
PDF
<L> {
<LI> {
<Lbl> { - }
<LBody> { text 1}
}
}
HTML output
<ul style="list-style-type:none;">
<li><span>-</span>text 1</li>
</ul>
Lbl is mapped to div if it contains one or more of the following structure elements:
Form, Figure, Formula or Caption as a direct child
Lbl is mapped to label otherwise. If the PDF 2.0 namespace is used, an additional
for attribute shall be added to the HTML label element (see 4.3.5.9.2, "Form field
processing for PDF 2.0 structure elements").
EXAMPLE
PDF
<P> {
<Figure> {
<Caption> {Figure Caption}
CONTENT [The actual image or illustration converted to
star.jpg during derivation]
}
}
HTML output
<p><span>Figure Caption</span><img href="star.jpg"/></p>
4.3.5.5 L (list)
4.3.5.5.1 L within L
If an L structure element is a direct child of a L structure element, then the child L element
shall be output to HTML as the direct child of a newly created li element.
EXAMPLE
PDF
<L "ListNumbering=Ordered"> {
<L> {
<LI> {Item 1.1}
}
<LI> {Item 2}
}
HTML output
<ol>
<li>
<ul>
<li> Item 1.1</li>
</ul>
</li>
<li>Item 2</li>
</ol>
LI to div
Lbl to dt
LBody to dd
EXAMPLE
<L "ListNumbering=Description"> {
<LI> {
Lbl { First}
<LI> {
Lbl {Second}
HTML output
<dl>
<div>
<dt>First</dt>
</div>
<div>
<dt>Second</dt>
</div>
</dl>
EXAMPLE
PDF
<Part> {
<P> {
<P> {
<P> {Actual content before the list}
<L "ListNumbering=Ordered">
<P> {Actual content after the list}
}
}
}
HTML output
<div>
<p><p><p>Actual content before the list</p></p></p>
<ol>. . . </ol>
<p><p><p>Actual content after the list</p></p></p>
</div>
4.3.5.6 TH
If any heading structure element (H, H1..Hn) is a child of a TH structure element then that
heading structure element shall be mapped to an HTML p element:
EXAMPLE
PDF
<Table>{
<TR>{
<TH> {
<H1> { Heading inside TH}
}
}
}
HTML output
<table>
<tr>
<th>
<p>Heading inside TH</p>
</th>
</tr>
</table>
If a Sect structure element is the child of a TH structure element, then all such Sect
structure elements shall be mapped to div in the output HTML.
EXAMPLE
PDF
<Table>{
<TR>{
<TH> {
<Sect> {
<Sect> {
<L> { list}
}
P {.. }
}
}
}
}
HTML output
<table>
<tr>
<th>
<div>
<div>
<ol> … </ol>
</div>
<p> … </p>
</div>
</th>
</tr>
</table>
PDF structure elements of type Private or of type Artifact shall not be output, nor shall
any of their content or descendent elements.
If a Link structure element is a direct child of a Reference structure element then the
processor shall output only one HTML element with href set from the annotation
dictionary represented by the Link.
4.3.5.9 Forms
Both the PDF 1.7 standard structure namespace and the PDF 2.0 standard structure
namespace support the inclusion of form fields in the logical structure. The definition of
the PDF structure element type Form, however, differs between the two namespaces.
Accordingly, PDF structure elements of type Form are not derived to HTML form elements
as such, as detailed in this subclause.
NOTE 1 HTML requires that form fields are always descendants of a form element, whereas
there is no notion of an equivalent structure element in the PDF 1.7 or PDF 2.0 standard
structure namespaces. Consequently, the HTML form element is inserted in a generic
fashion that ensures that any PDF structure element of type Form will always be derived to
an equivalent HTML form field that is a descendant of a form element.
NOTE 2 It is possible to use PDF structure elements and attributes in the HTML namespace
to define forms and form fields that translate more directly into HTML elements and
element structures. If form-related PDF structure elements from the PDF 2.0 standard
structure namespace or the PDF 1.7 standard structure namespace on one side and from
the HTML namespace on the other side were mixed inside the same document, the
conversion result could be inconsistent.
PDF structure elements of type Form as defined in the PDF 1.7 standard structure
namespace shall be processed as defined in 4.4.8.3, "Widget annotations".
If a PDF structure element of type Form has descendants that are structure elements of
type Lbl, these Lbl structure elements shall be created as label elements, as defined in
4.3.2.1, "Processing PDF structure elements". A for attribute shall be added each label
element, whose value shall be the same as that of the id attribute of the HTML form field
element created according to 4.4.8.3, "Widget annotations".
EXAMPLE
PDF
<Form> {
<Lbl>{Last name:}
OBJR [widget annotation of single line text field]
}
HTML output
<label for="bd43-05d-11e7">Last name:</label>
<input id="bd43-05d-11e7" type="text" name="lastname">
4.3.5.9.3 Form field processing for PDF structure elements from the HTML
namespace
When using form field related structure elements from the HTML namespace, no
processing as defined in 4.4.8.3, "Widget annotations". shall be carried out. All attributes
necessary for each HTML form field must be present as structure attributes in the HTML
namespace.
When using form field related structure elements from the HTML namespace, structure
elements of type form shall be present as necessary to ensure that all form fields in the
derived HTML are descendants of a form element as required by HTML.
4.3.6.1 General
If the structure element dictionary contains an ID entry, its value shall be used as the value
of the id attribute on the HTML element.
If a structured destination (see ISO 32000-2, 12.3.2.3) references the structure element
dictionary and does not contain an ID entry, then a unique identifier value (generated in
an implementation-dependent manner) shall be used as the value of the id attribute on
the HTML element.
NOTE 1 This id is used when the Link annotation with the structure destination is
processed.
If the PDF structure element has any classes of attributes (via the C key in the structure
element dictionary), then those classes shall be used as the value for an attribute class on
the HTML element. If C is an array, then the value of the class attribute shall be
constructed as a concatenation of classes separated by a space character. Additionally the
processor shall output attributes that map to HTML properties associated with the classes
according to 4.3.7.2, "Deriving structure attributes to HTML attributes".
If the PDF structure element has an A key in its structure element dictionary, then its
attributes shall be handled as described in 4.3.7, "Attributes", and shall be output as
attributes of the HTML element or as inline styling properties.
NOTE 2 It is important to process classes of attributes before the attributes. ISO 32000-2
14.7.6.2 requires that if both the A and C entries are present and a given attribute is
specified by both, the one specified by the A entry takes precedence.
4.3.6.2 Lang
If the structure element dictionary contains a Lang entry and if the entry’s value is not an
empty string, then its value shall be used as the value of the lang attribute on the HTML
element.
EXAMPLE
PDF
<P> {
Dru {
<Span "ActualText=c">{k-}
}
ker
}
HTML output
<p>Dru<span>c</span>ker</p>
EXAMPLE
PDF
<Figure "Alt=six-point star"> {
CONTENT [The actual image or illustration converted to
star.jpg during derivation]
}
HTML output
<figure><img alt="six-point star" href="star.jpg"/> </figure>
EXAMPLE
PDF
<P> {
<Span "E=Doctor"> {Dr.}
Jones
}
HTML output
<p><span><abbr title="Doctor">Dr.</abbr></span> Jones </p>
4.3.7 Attributes
Additional information is often associated with individual PDF structure elements through
the use of structure attributes. In some cases, the presence of a specific attribute changes
the selected html element, but in most cases PDF structure element attributes are
mapped to HTML attributes or CSS properties.
4.3.7.1 General
Only those standard structure attributes specifically referenced in this document shall be
processed. Additional format specific attributes and owners may be present, and the
processor may decide to output them.
The O key (see ISO 32000-2, Table 376, "Standard structure attribute owners") and its
value shall not be output. If the O key has a corresponding value of NSO, then the NS key
and its value shall not be output.
Whenever an array of attributes is defined the processor shall process attributes in the
following sequence:
NOTE 2 When deriving attribute values from PDF to HTML or CSS, the necessary
conversion to lowercase shall be applied and only those valid in html shall be processed
A style attribute for the HTML element shall be created and all CSS declarations in the
current PDF structure element shall be concatenated into a string, delimited by
semicolons as necessary, and the string shall be used as the value of the style attribute.
The attributes ContinuedList and ContinuedFrom shall not be processed into HTML
unless an implementation is provided (e.g., equivalent CSS or JavaScript) to
accommodate their semantics.
NOTE To achieve equivalent effects in an HTML, the author can provide equivalent CSS or
JavaScript mechanisms.
ColSpan colspan
RowSpan rowspan
Headers headers
NOTE The mapping of the Headers attribute relies
on the fact, that existing ID attributes for PDF
structure elements are mapped to the id attribute of
the th or td elements derived from TH or TD
structure elements.
Scope scope
Short abbr
Table 3: Mapping standard layout attributes of Table structure elements to CSS properties
TBorderStyle border-style
TPadding padding
EXAMPLE
PDF
<Table> {
<TR> {
<TH "RowSpan=2" "TBorderStyle=Dotted"> { Age }
<TH "ColSpan=2" "TBorderStyle=Dotted"> { Names}
}
<TR> {
<TH> { John }
<TH> { Bob }
}
<TR> {
<TH> { 25-30 }
<TD> { 100 }
<TD> { 500 }
}
}
HTML output
<table>
<tr>
<th style="border-style:dotted;" rowspan="2">Age</th>
<th style="border-style:dotted;" colspan="2">Names</th>
</tr>
<tr><th>John</th><th>Bob</th></tr>
<tr><th>25-30</th><td>100</td><td>500</td></tr>
</table>
• If the TextPosition attribute is Sup, the PDF structure element shall map to sup.
• If the TextPosition attribute is Sub, the PDF structure element shall map to sub.
"Table 4: Mapping layout standard structure attribute owner to CSS properties" shows the
mapping from the standard layout attribute to CSS properties that shall be used by the
processor when deriving PDF structure element types to corresponding HTML elements.
BackgroundColor background-color
BorderColor border-color
BorderStyle border-style
BorderThickness border-width
Padding padding
Color color
SpaceBefore (interpreted)
SpaceAfter (interpreted)
StartIndent (interpreted)
EndIndent (interpreted)
TextIndent text-indent
TextAlign text-align
TPadding padding
LineHeight line-height
BaselineShift baseline-shift
TextDecorationColor text-decoration-color
TextDecorationType text-decoration
RubyAlign ruby-align
RubyPosition ruby-position
4.3.7.7 HTML
If the value of the O key of an attribute object dictionary begins with the (case-sensitive)
string "HTML-", then the dictionary shall be considered as containing HTML attributes and
processed according to 4.3.7.2, "Deriving structure attributes to HTML attributes"..
4.3.7.8 CSS
If the value of the O key of an attribute object dictionary begins with the (case-sensitive)
string "CSS-", then this dictionary shall be considered as containing CSS attributes and
processed according 4.3.7.2, "Deriving structure attributes to HTML attributes"..
EXAMPLE
PDF
<H1 "O=CSS-3.00" "color=red" "font-size=12px" > { Heading 1 }
HTML output
<h1 style="color: red; font-size: 12px;">Heading 1</h1>
4.3.7.10 Others
Processing of attributes with any other value of the O key is implementation dependent
and therefore beyond the scope of this document. To achieve consistent output,
implementations should not override attributes defined in ISO 32000-2.
Where visual fidelity is important (infographics, charts etc.) a processor shall process
content items as a group by either rasterizing all items and incorporating the result
as a single raster image or by converting to SVG and include the output in the HTML.
Example of such usage might be content elements within Figure structure element.
For general purposes each content element object type shall be processed
according to the provisions of this subclause.
4.4.1 Paths
A processor should choose one of the following methods of handling a content element
that represents one or more path objects:
simply rasterize the paths and then incorporate it into the HTML as a single raster
image (see 5.2.4.4. Image XObjects and inline images), or
convert to SVG and include it either directly in the HTML or via an img element, or
represent it as a canvas object.
If the paths are irrelevant to the reuse application the processor may decide not to output
path objects.
4.4.2 Text
The text of the structure content element shall be converted to UTF-8 (see 4.2.1, "Head"),
and derived as the content of the HTML element.
The manner in which image data is encoded in PDF in many regards differs from how
image data is encoded in file formats such as GIF, PNG or JPEG, or in SVG. When
converting from PDF image data to an OWP-supported file format, a processor should
choose the most suitable file format, and should take into account the following aspects:
the bit depth, whether by not using GIF or using dithering or other mechanisms
the colour appearance, whether by converting to a device colour space that matches
the rendering system’s or device’s characteristics or by embedding a suitable ICC
profile
the compression; using lossy compression only if no additional loss of information is
incurred
the effect of any Mask or SMask entries applicable to the image data in the PDF
Image XObjects that contain an ImageMask entry with a value of true shall be encoded
such that the current colour in the current graphic state is taken into account, and the
masking effect shall be represented appropriately in the file format to which the image is
converted.
If the processor is unable to convert the data, it shall place some form of placeholder
image, of the same logical (display) size, in the output HTML.
NOTE 2 This ensures that the HTML will at least layout the same way as it would if the
image were present.
The value of the src attribute on the output img element shall be the URL to the image
data that the processor has prepared.
NOTE 3 Since the handling of the image data is implementation-dependent, the URL can
be any valid URL including absolute (with or without prefix) or data URLs (RFC 2397).
4.4.5 Shadings
A processor should choose one of two methods of handling a content element that
represents a shading:
simply rasterize the shading and then incorporate it into the HTML as a single raster
image as per 4.4.3, "Image XObjects and inline images", or
process the shading as a vector element (path) and then address as per 4.4.1, "
Paths".
If the shadings are irrelevant to the reuse application the processor may decide not to
output shadings.
4.4.6 Artifacts
The derivation algorithm intentionally ignores artifacts not contained in the structure tree
(see 4.3.5.7, "NonStruct, Private and Artifact").
HTML provides different types of elements for different types of form fields, such as
button, input, select and textarea, which are collectively referred to as HTML form fields.
Widget annotations that are invisible or hidden, have a width or a height of 0 (zero), or are
completely outside the CropBox – or in the absence of the CropBox, completely outside
of the MediaBox – of the page on which they are present, or are not present on any page,
shall be processed with CSS property display set to none
If the derived HTML element is button, then inner HTML shall be created with
Check box button field checkbox If an Opt entry is present, map the
applicable entry to the value
attribute.
Single line text field text If the RichText flag is not set and RV
is not present, map V to value
Choice field with Edit flag set text Map V to the value
If RichText flag is set and RV is present, inner HTML from RV entry shall
be created; otherwise create inner HTML from V entry
Combo
• Map the entries from the Opt entry of the form field to option inner HTML
As HTML attributes:
4.5 ECMAScript
To achieve an equivalent experience in HTML as when processing forms in the PDF
context, the processor shall derive embedded ECMAscripts into HTML javascript when
deriving Widget annotations into HTML form fields. ECMAScript for PDF (see ISO 21757-1)
defines the set of static and dynamic objects available to PDF.
For Embedded Files, the URL shall be the value of the UF entry from the associated file’s
file specification dictionary.
NOTE 2 This requirement ensures that resources and associated files can reliably refer to
each other, for example CSS referring to an image to be used as a background.
For URL References, the filename extension of the URL (see 4.6.2, "URL References") shall
be used in conjunction with "Table 9: Media types supported by embedded files " to
determine the media type of the associated file.
For embedded files, the media type shall be determined by the value of the Subtype key
of the embedded file stream dictionary that is the value of the EF key of the associated
file’s file specification dictionary.
"Table 9: Media types supported by embedded files" lists the known media types, their
filename extensions, what each represented and which of the following sub-clauses
provides more information about processing it.
If the file extension of the associated file is not one of the known extensions
corresponding to the media types specified in "Table 9: Media types supported by
embedded files " then the processor may process it or ignore it as it deems appropriate. A
processor may support additional filename extensions and/or media types beyond those
in the table.
Table 9: Media types supported by embedded files
If the value of the AFRelationship key in the associated file’s file specification
dictionary is Alternative then the associated file serves as a replacement and all
children of the structure element shall be ignored.
If the value of the AFRelationship key in the associated file’s file specification
dictionary is Supplement then the associated file serves as a supplemental and after
processing the associated file the processor shall continue with processing children
of the structure element.
In both cases all requirements for attribute processing (see 4.3.7, "Attributes") shall apply.
NOTE This enables an author to provide specific attributes on the output HTML elements
by having them present on the PDF structure element.
Associated files with a value other than Alternative or Supplement for the AFRelationship
key in the associated file’s file specification dictionary may be ignored; the processor shall
continue with children of the structure element.
Multiple associated files shall be processed in the order in which they are stored in the
array of the AF key.
For security reasons, processors may choose to mitigate risks by ignoring categories of
Associated Files.
4.6.4.2 HTML
If the associated file is an URL Reference, then the processor shall add a link element to
the head element of HTML output, with attributes of rel (with a value of import) and href
(with a value that is the URL).
If the associated file is an Embedded File then the contents of the associated file’s
embedded file stream shall be added directly to the output HTML stream, taking the place
of the structure element that would normally have been generated.
NOTE This mechanism allows direct injection of an associated file of type HTML with
AFRelationship of Supplement into the output HTML stream. It is therefore expected that
the associated file is not a complete HTML file, but a portion that follows HTML syntax.
4.6.4.3 CSS
If the associated file is either a URL Reference or an Embedded File of type CSS, then the
processor shall add to the output HTML, immediately before the referencing HTML
element, a style element, whose contents shall consist of an @import declaration with a
value of the URL.
EXAMPLE
<style>@import url(specialtable.css);</style>
4.6.4.4 JavaScript
If the associated file is either a URL Reference or an Embedded File of type JavaScript,
then the processor may add to the output HTML, immediately after the referencing HTML
element’s closing tag, a script element with an attribute of src whose value is the URL and
no contents.
EXAMPLE
<script src="specialtable.js"> </script>
If the structure element with the associated file attached derives to script in the HTML
namespace (http://www.w3.org/1999/xhtml), then the HTML element shall be script. All
children of the structure element shall be ignored.
4.6.4.5 Images
To incorporate images into the HTML output, regardless of whether the associated file is a
URL Reference or an Embedded File, an img element shall be added to the HTML output
with a src attribute whose value is the URL.
4.6.4.6 SVG
To incorporate SVG into the HTML output, regardless of whether the associated file is a
URL Reference or an Embedded File, an img element shall be added to the HTML with an
attribute of src whose value is the URL. If the structure element has a BBox structure
attribute (of any owner or namespace), then the height and width of that BBox shall be
written out, respectively, as height and width attributes on the img element. These
height and width attributes should be determined as described in 4.4.3, "Image XObjects
and inline images".
4.6.4.7 MathML
If the associated file is an Embedded File containing MathML then the contents of its
embedded file stream shall be added directly to the HTML output, taking the place of the
structure element that would normally have been output.
NOTE Since MathML is not supported by all user agents, a conforming processor may need
to take additional steps to ensure that it is presented as the author expected.
There are serious security concerns when it comes to derivation of PDF files to HTML. PDF
structures may contain information that can take advantage of the derivation process and
embed malicious code into derived HTML. One major concern is the fact that PDF files
may contain such code, and the process of derivation defined in this document does not
guarantee full control over output HTML. In the case of a public service that allowed users
to upload PDF files in order to experience in HTML form through derivation, an attacker
could leverage this case by uploading crafted PDF; derivation in itself does not prevent
creation of malicious HTML.
Embedded JavaScript could access a whole web page if the PDF is derived into a
<div>, facilitating the delivery of malicious information
JavaScript could access cookies
It is therefore the responsibility of the developer to recognize security risks in each specific
implementation. While using derivation in an enclosed environment where the developer
controls the HTML viewing system, the risk might be considered as low. In cases such as,
allowing users to upload random PDF files to be served as HTML to other users or systems,
the developer should clearly apply stringent processing requirements.
It is not in the scope of this document to define precisely how PDF ECMAscript shall be
derived into JavaScript libraries for use with HTML. In this Annex we will provide guidance
and examples focusing on the most common functionality.
EXAMPLE app object represents the application, in desktop environment the application
works with several opened documents available through activeDocs property or require
interactivity with end user through the alert method. Desired functionality might be
different in an HTML environment and method activeDocs could always return 1 and
alert method could be implemented with window.alert() or with console.log() function.
Each HTML form field should have its own Field JavaScript object that mimics the source
ECMAScript object.
It is recommended to create a Field object only when the HTML form field is used or
required; creating and maintaining the array of all fields as appropriate. Fields are
identified by name as required by ISO 32000-2, 12.7.4.2 "Field Names".
EXAMPLE The following _init function is invoked when the HTML file is loaded by calling:
document.addEventListener("DOMContentLoaded", _init);
function _init() {
var elems = document.getElementsByTagName("input");
for (var i = 0; i < elems.length; i++) {
e.addEventListener("focus", field_event);
e.addEventListener("change", field_event);
e.addEventListener("click", field_event);
};
One ECMAScript Field object may reference more widget annotations; the same
functionality shall be preserved in derivation to HTML:
When ECMAScript changes a value, all HTML form fields with the same name shall
change their value.
When one HTML form field is changed, the corresponding Field object is changed
together with all related HTML form fields.
The processor shall include all document level ECMAScript methods as defined by the
JavaScript entry in the Names entry in the document catalog dictionary and ECMAScript
page level events defined by the AA entry in page dictionary.
When deriving the widget annotation, the processor shall expand the JavaScript library
with methods that are defined for each form field in the form field’s additional-actions
dictionary. See ISO 32000-2, Table 199: Entries in a form field’s additional-actions
dictionary.
NOTE 1 It is best practice to generate function names for each field’s method based on field
identifier, which makes managing the invocation of functions as easy as possible.
Processors should keep all calculated fields in a separate array to have the do_calculation
method optimized.
NOTE 2 HTML form fields always shows formatted value, while real value is preserved in
the Field object.
Bibliography
RFC 1738, Uniform Resource Locators (URL) (December, 1994) Internet Engineering Task
Force (IETF)