UnicodeStandard-12 0
UnicodeStandard-12 0
A Unicode Consortium
Mountain View, CA
Many of the designations used by manufacturers and sellers to distinguish their products are claimed
as trademarks. Where those designations appear in this book, and the publisher was aware of a trade-
mark claim, the designations have been printed with initial capital letters or in all capitals.
Unicode and the Unicode Logo are registered trademarks of Unicode, Inc., in the United States and
other countries.
The authors and publisher have taken care in the preparation of this specification, but make no
expressed or implied warranty of any kind and assume no responsibility for errors or omissions. No
liability is assumed for incidental or consequential damages in connection with or arising out of the
use of the information or programs contained herein.
The Unicode Character Database and other files are provided as-is by Unicode, Inc. No claims are
made as to fitness for any particular purpose. No warranties of any kind are expressed or implied.
The recipient agrees to determine applicability of information provided.
© 2019 Unicode, Inc.
All rights reserved. This publication is protected by copyright, and permission must be obtained from
the publisher prior to any prohibited reproduction. For information regarding permissions, inquire
at http://www.unicode.org/reporting.html. For information about the Unicode terms of use, please
see http://www.unicode.org/copyright.html.
The Unicode Standard / the Unicode Consortium; edited by the Unicode Consortium. — Version
12.0.
Includes index.
ISBN 978-1-936213-22-1 (http://www.unicode.org/versions/Unicode12.0.0/)
1. Unicode (Computer character set) I. Unicode Consortium.
QA268.U545 2019
ISBN 978-1-936213-22-1
Published in Mountain View, CA
March 2019
iii
Contents
Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxi
Why Unicode? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxi
What’s New? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxi
Organization of This Standard . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxii
The Unicode Character Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxiv
Unicode Code Charts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxiv
Unicode Standard Annexes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxiv
Unicode Technical Standards and Unicode Technical Reports . . . . . . . . . . xxvi
Updates and Errata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxvi
Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxvii
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Coverage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Standards Coverage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
New Characters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Design Goals. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Text Handling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
Characters and Glyphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
Text Elements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2 General Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1 Architectural Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
Basic Text Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
Text Elements, Characters, and Text Processes . . . . . . . . . . . . . . . . . . . . . . . . . . 10
Text Processes and Encoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2 Unicode Design Principles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
Universality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
Efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
Characters, Not Glyphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
Semantics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
Plain Text . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
Logical Order . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
Unification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
Dynamic Composition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
Stability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
Convertibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.3 Compatibility Characters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
Compatibility Variants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
Compatibility Decomposable Characters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.4 Code Points and Characters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
Types of Code Points . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
iv
Preface
This is The Unicode Standard, Version 12.0. It supersedes all earlier versions of the Unicode
Standard.
Why Unicode?
The Unicode Standard and its associated specifications provide programmers with a single
universal character encoding, extensive descriptions, and a vast amount of data about how
characters function. The specifications and data describe how to form words and break
lines; how to sort text in different languages; how to format numbers, dates, times, and
other elements appropriate to different languages; how to display languages whose written
form flows from right to left, such as Arabic and Hebrew, or whose written form splits,
combines, and reorders, such as languages of South Asia. These specifications include
descriptions of how to deal with security concerns regarding the many “look-alike” charac-
ters from alphabets around the world. Without the properties and algorithms in the Uni-
code Standard and its associated specifications, interoperability between different
implementations would be impossible, and much of the vast breadth of the world’s lan-
guages would lie outside the reach of modern software.
What’s New?
Unicode Version 12.0 adds 554 characters, for a total of 137,929 characters. Significant
updates include four new scripts, additions to support lesser-used languages and scholarly
work, and important symbol additions.
Support for Languages and Symbol Sets. The following four new scripts were added in
Version 12.0:
• Elymaic, used to write historic Achaemenid Aramaic in the southwestern por-
tion of modern-day Iran
• Nandinagari, historically used to write Sanskrit and Kannada in southern India
• Nyiakeng Puachue Hmong, used to write modern White Hmong and Green
Hmong languages in Laos, Thailand, Vietnam, France, Australia, and the
United States
• Wancho, used to write the modern Wancho language in India, Myanmar, and
Bhutan
Additional support for lesser-used languages and scholarly work was extended worldwide,
including:
• Miao script additions to write several Miao and Yi dialects in China
• Hiragana and Katakana small letters, used to write archaic Japanese
Preface xxii
Acknowledgements
The Unicode Standard, Version 12.0 is the result of the dedication and contributions of
many people over several years. We would like to acknowledge the individuals whose con-
tributions were central to the design, authorship, and review of this standard. A complete
listing of acknowledgements can be found at:
http://www.unicode.org/acknowledgements/
Current editorial contributors can be found at:
http://www.unicode.org/consortium/edcom.html
Preface xxviii
1
Chapter 1
Introduction 1
The Unicode Standard is the universal character encoding standard for written characters
and text. It defines a consistent way of encoding multilingual text that enables the exchange
of text data internationally and creates the foundation for global software. As the default
encoding of HTML and XML, the Unicode Standard provides the underpinning for the
World Wide Web and the global business environments of today. Required in new Internet
protocols and implemented in all modern operating systems and computer languages such
as Java and C#, Unicode is the basis of software that must function all around the world.
With Unicode, the information technology industry has replaced proliferating character
sets with data stability, global interoperability and data interchange, simplified software,
and reduced development costs.
While taking the ASCII character set as its starting point, the Unicode Standard goes far
beyond ASCII’s limited ability to encode only the upper- and lowercase letters A through
Z. It provides the capacity to encode all characters used for the written languages of the
world—more than 1 million characters can be encoded. No escape sequence or control
code is required to specify any character in any language. The Unicode character encoding
treats alphabetic characters, ideographic characters, and symbols equivalently, which
means they can be used in any mixture and with equal facility (see Figure 1-1).
The Unicode Standard specifies a numeric value (code point) and a name for each of its
characters. In this respect, it is similar to other character encoding standards from ASCII
onward. In addition to character codes and names, other information is crucial to ensure
legible text: a character’s case, directionality, and alphabetic properties must be well
defined. The Unicode Standard defines these and other semantic values, and it includes
application data such as case mapping tables and character property tables as part of the
Unicode Character Database. Character properties define a character’s identity and behav-
ior; they ensure consistency in the processing and interchange of Unicode data. See
Section 4.1, Unicode Character Database.
Unicode characters are represented in one of three encoding forms: a 32-bit form (UTF-
32), a 16-bit form (UTF-16), and an 8-bit form (UTF-8). The 8-bit, byte-oriented form,
UTF-8, has been designed for ease of use with existing ASCII-based systems.
The Unicode Standard is code-for-code identical with International Standard ISO/IEC
10646. Any implementation that is conformant to Unicode is therefore conformant to ISO/
IEC 10646.
The Unicode Standard contains 1,114,112 code points, most of which are available for
encoding of characters. The majority of the common characters used in the major lan-
guages of the world are encoded in the first 65,536 code points, also known as the Basic
Introduction 2
Multilingual Plane (BMP). The overall capacity for more than 1 million characters is more
than sufficient for all known character encoding requirements, including full coverage of
all minority and historic scripts of the world.
Introduction 3 1.1 Coverage
1.1 Coverage
The Unicode Standard, Version 12.0, contains 137,929 characters from the world’s scripts.
These characters are more than sufficient not only for modern communication for the
world’s languages, but also to represent the classical forms of many languages. The stan-
dard includes the European alphabetic scripts, Middle Eastern right-to-left scripts, and
scripts of Asia and Africa. Many archaic and historic scripts are encoded. The Han script
includes 87,887 unified ideographic characters defined by national, international, and
industry standards of China, Japan, Korea, Taiwan, Vietnam, and Singapore. In addition,
the Unicode Standard contains many important symbol sets, including currency symbols,
punctuation marks, mathematical symbols, technical symbols, geometric shapes, dingbats,
and emoji. For overall character and code range information, see Chapter 2, General Struc-
ture.
Note, however, that the Unicode Standard does not encode idiosyncratic, personal, novel,
or private-use characters, nor does it encode logos or graphics. Graphologies unrelated to
text, such as dance notations, are likewise outside the scope of the Unicode Standard. Font
variants are explicitly not encoded. The Unicode Standard reserves 6,400 code points in
the BMP for private use, which may be used to assign codes to characters not included in
the repertoire of the Unicode Standard. Another 131,068 private-use code points are avail-
able outside the BMP, should 6,400 prove insufficient for particular applications.
Standards Coverage
The Unicode Standard is a superset of all characters in widespread use today. It contains
the characters from major international and national standards as well as prominent
industry character sets. For example, Unicode incorporates the ISO/IEC 6937 and ISO/
IEC 8859 families of standards, the SGML standard ISO/IEC 8879, and bibliographic stan-
dards such as ISO 5426. Important national standards contained within Unicode include
ANSI Z39.64, KS X 1001, JIS X 0208, JIS X 0212, JIS X 0213, GB 2312, GB 18030, HKSCS,
and CNS 11643. Industry code pages and character sets from Adobe, Apple, Fujitsu, Hewl-
ett-Packard, IBM, Lotus, Microsoft, NEC, and Xerox are fully represented as well.
The Unicode Standard is fully conformant with the International Standard ISO/IEC
10646:2017, Information Technology—Universal Coded Character Set (UCS), known as the
Universal Character Set (UCS). For more information, see Appendix C, Relationship to
ISO/IEC 10646.
New Characters
The Unicode Standard continues to respond to new and changing industry demands by
encoding important new characters. As the universal character encoding, the Unicode
Standard also responds to scholarly needs. To preserve world cultural heritage, important
archaic scripts are encoded as consensus about the encoding is developed.
Introduction 4 1.2 Design Goals
A0041
A
41
å
00E5
å
E5
ESC – G å
0645 1B 2D 47 E5
ESC – F å
03B5 1B 2D 46 E5
ESC – C 1
0131 1B 2D 43 B9
or
ESC – Mý
1B 2D 4D FD
ESC $B F |
65E5 1B 24 42 46 7C
Introduction 6 1.3 Text Handling
Text Elements
The successful encoding, processing, and interpretation of text requires appropriate defini-
tion of useful elements of text and the basic rules for interpreting text. The definition of text
elements often changes depending on the process that handles the text. For example, when
searching for a particular word or character written with the Latin script, one often wishes
to ignore differences of case. However, correct spelling within a document requires case
sensitivity.
Introduction 7 1.3 Text Handling
The Unicode Standard does not define what is and is not a text element in different pro-
cesses; instead, it defines elements called encoded characters. An encoded character is rep-
resented by a number from 0 to 10FFFF16, called a code point. A text element, in turn, is
represented by a sequence of one or more encoded characters.
Introduction 8 1.3 Text Handling
9
Chapter 2
General Structure 2
This chapter describes the fundamental principles governing the design of the Unicode
Standard and presents an informal overview of its main features. The chapter starts by
placing the Unicode Standard in an architectural context by discussing the nature of text
representation and text processing and its bearing on character encoding decisions. Next,
the Unicode Design Principles are introduced—ten basic principles that convey the
essence of the standard. The Unicode Design Principles serve as a tutorial framework for
understanding the Unicode Standard.
The chapter then moves on to the Unicode character encoding model, introducing the
concepts of character, code point, and encoding forms, and diagramming the relationships
between them. This provides an explanation of the encoding forms UTF-8, UTF-16, and
UTF-32 and some general guidelines regarding the circumstances under which one form
would be preferable to another.
The sections on Unicode allocation then describe the overall structure of the Unicode
codespace, showing a summary of the code charts and the locations of blocks of characters
associated with different scripts or sets of symbols.
Next, the chapter discusses the issue of writing direction and introduces several special
types of characters important for understanding the Unicode Standard. In particular, the
use of combining characters, the byte order mark, and other special characters is explored
in some detail.
The section on equivalent sequences and normalization describes the issue of multiple
equivalent representations of Unicode text and explains how text can be transformed to use
a unique and preferred representation for each character sequence.
Finally, there is an informal statement of the conformance requirements for the Unicode
Standard. This informal statement, with a number of easy-to-understand examples, gives a
general sense of what conformance to the Unicode Standard means. The rigorous, formal
definition of conformance is given in the subsequent Chapter 3, Conformance.
General Structure 10 2.1 Architectural Context
language depend upon the specific text process; a text element for spell-checking may have
different boundaries from a text element for sorting purposes. For example, in the phrase
“the quick brown fox,” the sequence “fox” is a text element for the purpose of spell-check-
ing.
In contrast, a character encoding standard provides a single set of fundamental units of
encoding, to which it uniquely assigns numerical code points. These units, called assigned
characters, are the smallest interpretable units of stored text. Text elements are then repre-
sented by a sequence of one or more characters.
Figure 2-1 illustrates the relationship between several different types of text elements and
the characters used to represent those text elements.
Syllable: @
Word: cat c a t
The design of the character encoding must provide precisely the set of characters that
allows programmers to design applications capable of implementing a variety of text pro-
cesses in the desired languages. Therefore, the text elements encountered in most text pro-
cesses are represented as sequences of character codes. See Unicode Standard Annex #29,
“Unicode Text Segmentation,” for detailed information on how to segment character
strings into common types of text elements. Certain text elements correspond to what users
perceive as single characters. These are called grapheme clusters.
right in linear order. Thus one character code inside the computer corresponds to one log-
ical character in a process such as simple English rendering.
When designing an international and multilingual text encoding such as the Unicode Stan-
dard, the relationship between the encoding and implementation of basic text processes
must be considered explicitly, for several reasons:
• Many assumptions about character rendering that hold true for the English
alphabet fail for other writing systems. Characters in these other writing sys-
tems are not necessarily rendered visible one by one in rectangles from left to
right. In many cases, character positioning is quite complex and does not pro-
ceed in a linear fashion. See Section 9.2, Arabic, and Section 12.1, Devanagari,
for detailed examples of this situation.
• It is not always obvious that one set of text characters is an optimal encoding
for a given language. For example, two approaches exist for the encoding of
accented characters commonly used in French or Swedish: ISO/IEC 8859
defines letters such as “ä” and “ö” as individual characters, whereas ISO 5426
represents them by composition with diacritics instead. In the Swedish lan-
guage, both are considered distinct letters of the alphabet, following the letter
“z”. In French, the diaeresis on a vowel merely marks it as being pronounced in
isolation. In practice, both approaches can be used to implement either lan-
guage.
• No encoding can support all basic text processes equally well. As a result, some
trade-offs are necessary. For example, following common practice, Unicode
defines separate codes for uppercase and lowercase letters. This choice causes
some text processes, such as rendering, to be carried out more easily, but other
processes, such as comparison, to become more difficult. A different encoding
design for English, such as case-shift control codes, would have the opposite
effect. In designing a new encoding scheme for complex scripts, such trade-offs
must be evaluated and decisions made explicitly.
For these reasons, design of the Unicode Standard is not specific to the design of particular
basic text-processing algorithms. Instead, it provides an encoding that can be used with a
wide variety of algorithms. In particular, sorting and string comparison algorithms cannot
assume that the assignment of Unicode character code numbers provides an alphabetical
ordering for lexicographic string comparison. Culturally expected sorting orders require
arbitrarily complex sorting algorithms. The expected sort sequence for the same characters
differs across languages; thus, in general, no single acceptable lexicographic ordering
exists. See Unicode Technical Standard #10, “Unicode Collation Algorithm,” for the stan-
dard default mechanism for comparing Unicode strings.
Text processes supporting many languages are often more complex than they are for
English. The character encoding design of the Unicode Standard strives to minimize this
additional complexity, enabling modern computer systems to interchange, render, and
manipulate text in a user’s own script and language—and possibly in other languages as
well.
General Structure 13 2.1 Architectural Context
Character Identity. Whenever Unicode makes statements about the default layout behav-
ior of characters, it is done to ensure that users and implementers face no ambiguities as to
which characters or character sequences to use for a given purpose. For bidirectional writ-
ing systems, this includes the specification of the sequence in which characters are to be
encoded so as to correspond to a specific reading order when displayed. See Section 2.10,
Writing Direction.
The actual layout in an implementation may differ in detail. A mathematical layout system,
for example, will have many additional, domain-specific rules for layout, but a well-
designed system leaves no ambiguities as to which character codes are to be used for a
given aspect of the mathematical expression being encoded.
The purpose of defining Unicode default layout behavior is not to enforce a single and spe-
cific aesthetic layout for each script, but rather to encourage uniformity in encoding. In
that way implementers of layout systems can rely on the fact that users would have chosen
a particular character sequence for a given purpose, and users can rely on the fact that
implementers will create a layout for a particular character sequence that matches the
intent of the user to within the capabilities or technical limitations of the implementation.
In other words, two users who are familiar with the standard and who are presented with
the same text ideally will choose the same sequence of character codes to encode the text. In
actual practice there are many limitations, so this goal cannot always be realized.
General Structure 14 2.2 Unicode Design Principles
Universality
The Unicode Standard encodes a single, very large set of characters, encompassing all the
characters needed for worldwide use. This single repertoire is intended to be universal in
coverage, containing all the characters for textual representation in all modern writing sys-
tems, in most historic writing systems, and for symbols used in plain text.
The Unicode Standard is designed to meet the needs of diverse user communities within
each language, serving business, educational, liturgical and scientific users, and covering
the needs of both modern and historical texts.
Despite its aim of universality, the Unicode Standard considers the following to be outside
its scope: writing systems for which insufficient information is available to enable reliable
encoding of characters, writing systems that have not become standardized through use,
and writing systems that are nontextual in nature.
Because the universal repertoire is known and well defined in the standard, it is possible to
specify a rich set of character semantics. By relying on those character semantics, imple-
mentations can provide detailed support for complex operations on text in a portable way.
See “Semantics” later in this section.
General Structure 15 2.2 Unicode Design Principles
Efficiency
The Unicode Standard is designed to make efficient implementation possible. There are no
escape characters or shift states in the Unicode character encoding model. Each character
code has the same status as any other character code; all codes are equally accessible.
All Unicode encoding forms are self-synchronizing and non-overlapping. This makes ran-
domly accessing and searching inside streams of characters efficient.
By convention, characters of a script are grouped together as far as is practical. Not only is
this practice convenient for looking up characters in the code charts, but it makes imple-
mentations more compact and compression methods more efficient. The common punc-
tuation characters are shared.
Format characters are given specific and unambiguous functions in the Unicode Standard.
This design simplifies the support of subsets. To keep implementations simple and effi-
cient, stateful controls and format characters are avoided wherever possible.
Sequences such as “fi” may be displayed with two independent glyphs or with a ligature
glyph.
What the user thinks of as a single character—which may or may not be represented by a
single glyph—may be represented in the Unicode Standard as multiple code points. See
Table 2-2 for additional examples.
For certain scripts, such as Arabic and the various Indic scripts, the number of glyphs
needed to display a given script may be significantly larger than the number of characters
encoding the basic units of that script. The number of glyphs may also depend on the
orthographic style supported by the font. For example, an Arabic font intended to support
the Nastaliq style of Arabic script may possess many thousands of glyphs. However, the
character encoding employs the same few dozen letters regardless of the font style used to
depict the character data in context.
A font and its associated rendering process define an arbitrary mapping from Unicode
characters to glyphs. Some of the glyphs in a font may be independent forms for individual
General Structure 17 2.2 Unicode Design Principles
characters; others may be rendering forms that do not directly correspond to any single
character.
Text rendering requires that characters in memory be mapped to glyphs. The final appear-
ance of rendered text may depend on context (neighboring characters in the memory rep-
resentation), variations in typographic design of the fonts used, and formatting
information (point size, superscript, subscript, and so on). The results on screen or paper
can differ considerably from the prototypical shape of a letter or character, as shown in
Figure 2-3.
Text
Rendering
Process
For the Latin script, this relationship between character code sequence and glyph is rela-
tively simple and well known; for several other scripts, it is documented in this standard.
However, in all cases, fine typography requires a more elaborate set of rules than given
here. The Unicode Standard documents the default relationship between character
General Structure 18 2.2 Unicode Design Principles
sequences and glyphic appearance for the purpose of ensuring that the same text content
can be stored with the same, and therefore interchangeable, sequence of character codes.
Semantics
Characters have well-defined semantics. These semantics are defined by explicitly assigned
character properties, rather than implied through the character name or the position of a
character in the code tables (see Section 3.5, Properties). The Unicode Character Database
provides machine-readable character property tables for use in implementations of pars-
ing, sorting, and other algorithms requiring semantic knowledge about the code points.
These properties are supplemented by the description of script and character behavior in
this standard. See also Unicode Technical Report #23, “The Unicode Character Property
Model.”
The Unicode Standard identifies more than 100 different character properties, including
numeric, casing, combination, and directionality properties (see Chapter 4, Character
Properties). Additional properties may be defined as needed from time to time. Where
characters are used in different ways in different languages, the relevant properties are nor-
mally defined outside the Unicode Standard. For example, Unicode Technical Standard
#10, “Unicode Collation Algorithm,” defines a set of default collation weights that can be
used with a standard algorithm. Tailorings for each language are provided in the Unicode
Common Locale Data Repository (CLDR); see Section B.3, Other Unicode Online
Resources.
The Unicode Standard, by supplying a universal repertoire associated with well-defined
character semantics, does not require the code set independent model of internationaliza-
tion and text handling. That model abstracts away string handling as manipulation of byte
streams of unknown semantics to protect implementations from the details of hundreds of
different character encodings and selectively late-binds locale-specific character properties
to characters. Of course, it is always possible for code set independent implementations to
retain their model and to treat Unicode characters as just another character set in that con-
text. It is not at all unusual for Unix implementations to simply add UTF-8 as another char-
acter set, parallel to all the other character sets they support. By contrast, the Unicode
approach—because it is associated with a universal repertoire—assumes that characters
and their properties are inherently and inextricably associated. If an internationalized
application can be structured to work directly in terms of Unicode characters, all levels of
the implementation can reliably and efficiently access character storage and be assured of
the universal applicability of character property semantics.
Plain Text
Plain text is a pure sequence of character codes; plain Unicode-encoded text is therefore a
sequence of Unicode character codes. In contrast, styled text, also known as rich text, is any
text representation consisting of plain text plus added information such as a language iden-
tifier, font size, color, hypertext links, and so on. For example, the text of this specification,
a multi-font text as formatted by a book editing system, is rich text.
General Structure 19 2.2 Unicode Design Principles
The simplicity of plain text gives it a natural role as a major structural element of rich text.
SGML, RTF, HTML, XML, and TEX are examples of rich text fully represented as plain text
streams, interspersing plain text data with sequences of characters that represent the addi-
tional data structures. They use special conventions embedded within the plain text file,
such as “<p>”, to distinguish the markup or tags from the “real” content. Many popular
word processing packages rely on a buffer of plain text to represent the content and imple-
ment links to a parallel store of formatting data.
The relative functional roles of both plain text and rich text are well established:
• Plain text is the underlying content stream to which formatting can be applied.
• Rich text carries complex formatting information as well as text context.
• Plain text is public, standardized, and universally readable.
• Rich text representation may be implementation-specific or proprietary.
Although some rich text formats have been standardized or made public, the majority of
rich text designs are vehicles for particular implementations and are not necessarily read-
able by other implementations. Given that rich text equals plain text plus added informa-
tion, the extra information in rich text can always be stripped away to reveal the “pure” text
underneath. This operation is often employed, for example, in word processing systems
that use both their own private rich text format and plain text file format as a universal, if
limited, means of exchange. Thus, by default, plain text represents the basic, interchange-
able content of text.
Plain text represents character content only, not its appearance. It can be displayed in a
varity of ways and requires a rendering process to make it visible with a particular appear-
ance. If the same plain text sequence is given to disparate rendering processes, there is no
expectation that rendered text in each instance should have the same appearance. Instead,
the disparate rendering processes are simply required to make the text legible according to
the intended reading. This legibility criterion constrains the range of possible appearances.
The relationship between appearance and content of plain text may be summarized as fol-
lows:
Plain text must contain enough information to permit the text to be rendered legibly,
and nothing more.
The Unicode Standard encodes plain text. The distinction between plain text and other
forms of data in the same data stream is the function of a higher-level protocol and is not
specified by the Unicode Standard itself.
Logical Order
The order in which Unicode text is stored in the memory representation is called logical
order. This order roughly corresponds to the order in which text is typed in via the key-
board; it also roughly corresponds to phonetic order. For decimal numbers, the logical
General Structure 20 2.2 Unicode Design Principles
order consistently corresponds to the most significant digit first, which is the order
expected by number-parsing software.
When displayed, this logical order often corresponds to a simple linear progression of
characters in one direction, such as from left to right, right to left, or top to bottom. In
other circumstances, text is displayed or printed in an order that differs from a single linear
progression. Some of the clearest examples are situations where a right-to-left script (such
as Arabic or Hebrew) is mixed with a left-to-right script (such as Latin or Greek). For
example, when the text in Figure 2-4 is ordered for display the glyph that represents the
first character of the English text appears at the left. The logical start character of the
Hebrew text, however, is represented by the Hebrew glyph closest to the right margin. The
succeeding Hebrew glyphs are laid out to the left.
In logical order, numbers are encoded with most significant digit first, but are displayed in
different writing directions. As shown in Figure 2-5 these writing directions do not always
correspond to the writing direction of the surrounding text. The first example shows N’Ko,
a right-to-left script with digits that also render right to left. Examples 2 and 3 show
Hebrew and Arabic, in which the numbers are rendered left to right, resulting in bidirec-
tional layout. In left-to-right scripts, such as Latin and Hiragana and Katakana (for Japa-
nese), numbers follow the predominant left-to-right direction of the script, as shown in
Examples 4 and 5. When Japanese is laid out vertically, numbers are either laid out verti-
cally or may be rotated clockwise ninety degrees to follow the layout direction of the lines,
as shown in Example 6.
The Unicode Standard precisely defines the conversion of Unicode text from logical order
to the order of readable (displayed) text so as to ensure consistent legibility. Properties of
General Structure 21 2.2 Unicode Design Principles
directionality inherent in characters generally determine the correct display order of text.
The Unicode Bidirectional Algorithm specifies how these properties are used to resolve
directional interactions when characters of right-to-left and left-to-right directionality are
mixed. (See Unicode Standard Annex #9, “Unicode Bidirectional Algorithm.”) However,
when characters of different directionality are mixed, inherent directionality alone is occa-
sionally insufficient to render plain text legibly. The Unicode Standard therefore includes
characters to explicitly specify changes in direction when necessary. The Bidirectional
Algorithm uses these directional layout control characters together with the inherent direc-
tional properties of characters to exert exact control over the display ordering for legible
interchange. By requiring the use of this algorithm, the Unicode Standard ensures that
plain text used for simple items like file names or labels can always be correctly ordered for
display.
Besides mixing runs of differing overall text direction, there are many other cases where the
logical order does not correspond to a linear progression of characters. Combining charac-
ters (such as accents) are stored following the base character to which they apply, but are
positioned relative to that base character and thus do not follow a simple linear progres-
sion in the final rendered text. For example, the Latin letter “Ï” is stored as “x” followed by
combining “Δ; the accent appears below, not to the right of the base. This position with
respect to the base holds even where the overall text progression is from top to bottom—for
example, with “Ï” appearing upright within a vertical Japanese line. Characters may also
combine into ligatures or conjuncts or otherwise change positions of their components
radically, as shown in Figure 2-3 and Figure 2-19.
There is one particular exception to the usual practice of logical order paralleling phonetic
order. With the Thai, Lao, Tai Viet, and New Tai Lue scripts, users traditionally type in
visual order rather than phonetic order, resulting in some vowel letters being stored ahead
of consonants, even though they are pronounced after them.
Unification
The Unicode Standard avoids duplicate encoding of characters by unifying them within
scripts across language. Common letters are given one code each, regardless of language, as
are common Chinese/Japanese/Korean (CJK) ideographs. (See Section 18.1, Han.)
Punctuation marks, symbols, and diacritics are handled in a similar manner as letters. If
they can be clearly identified with a particular script, they are encoded once for that script
and are unified across any languages that may use that script. See, for example, U+1362
ethiopic full stop, U+060F arabic sign misra, and U+0592 hebrew accent segol.
However, some punctuation or diacritical marks may be shared in common across a num-
ber of scripts—the obvious example being Western-style punctuation characters, which are
often recently added to the writing systems of scripts other than Latin. In such cases, char-
acters are encoded only once and are intended for use with multiple scripts. Common sym-
bols are also encoded only once and are not associated with any script in particular.
It is quite normal for many characters to have different usages, such as comma “ , ” for
either thousands-separator (English) or decimal-separator (French). The Unicode Stan-
General Structure 22 2.2 Unicode Design Principles
dard avoids duplication of characters due to specific usage in different languages; rather, it
duplicates characters only to support compatibility with base standards. Avoidance of
duplicate encoding of characters is important to avoid visual ambiguity.
There are a few notable instances in the standard where visual ambiguity between different
characters is tolerated, however. For example, in most fonts there is little or no distinction
visible between Latin “o”, Cyrillic “o”, and Greek “o” (omicron). These are not unified
because they are characters from three different scripts, and many legacy character encod-
ings distinguish between them. As another example, there are three characters whose glyph
is the same uppercase barred D shape, but they correspond to three distinct lowercase
forms. Unifying these uppercase characters would have resulted in unnecessary complica-
tions for case mapping.
The Unicode Standard does not attempt to encode features such as language, font, size,
positioning, glyphs, and so forth. For example, it does not preserve language as a part of
character encoding: just as French i grec, German ypsilon, and English wye are all repre-
sented by the same character code, U+0059 “Y”, so too are Chinese zi, Japanese ji, and
Korean ja all represented as the same character code, U+5B57 %.
In determining whether to unify variant CJK ideograph forms across standards, the Uni-
code Standard follows the principles described in Section 18.1, Han. Where these principles
determine that two forms constitute a trivial difference, the Unicode Standard assigns a
single code. Just as for the Latin and other scripts, typeface distinctions or local preferences
in glyph shapes alone are not sufficient grounds for disunification of a character. Figure 2-6
illustrates the well-known example of the CJK ideograph for “bone,” which shows signifi-
cant shape differences from typeface to typeface, with some forms preferred in China and
some in Japan. All of these forms are considered to be the same character, encoded at
U+9AA8 in the Unicode Standard.
EF
Many characters in the Unicode Standard could have been unified with existing visually
similar Unicode characters or could have been omitted in favor of some other Unicode
mechanism for maintaining the kinds of text distinctions for which they were intended.
However, considerations of interoperability with other standards and systems often
require that such compatibility characters be included in the Unicode Standard. See
Section 2.3, Compatibility Characters. In particular, whenever font style, size, positioning or
precise glyph shape carry a specific meaning and are used in distinction to the ordinary
character—for example, in phonetic or mathematical notation—the characters are not uni-
fied.
General Structure 23 2.2 Unicode Design Principles
Dynamic Composition
The Unicode Standard allows for the dynamic composition of accented forms and Hangul
syllables. Combining characters used to create composite forms are productive. Because
the process of character composition is open-ended, new forms with modifying marks may
be created from a combination of base characters followed by combining characters. For
example, the diaeresis “¨” may be combined with all vowels and a number of consonants in
languages using the Latin script and several other scripts, as shown in Figure 2-7.
A + $¨ → Ä
0041 0308
Equivalent Sequences. Some text elements can be encoded either as static precomposed
forms or by dynamic composition. Common precomposed forms such as U+00DC “Ü”
latin capital letter u with diaeresis are included for compatibility with current stan-
dards. For static precomposed forms, the standard provides a mapping to an equivalent
dynamically composed sequence of characters. (See also Section 3.7, Decomposition.) Thus
different sequences of Unicode characters are considered equivalent. A precomposed char-
acter may be represented as an equivalent composed character sequence (see Section 2.12,
Equivalent Sequences).
Stability
Certain aspects of the Unicode Standard must be absolutely stable between versions, so
that implementers and users can be guaranteed that text data, once encoded, retains the
same meaning. Most importantly, this means that once Unicode characters are assigned,
their code point assignments cannot be changed, nor can characters be removed.
Characters are retained in the standard, so that previously conforming data stay confor-
mant in future versions of the standard. Sometimes characters are deprecated—that is,
their use in new documents is strongly discouraged. While implementations should con-
tinue to recognize such characters when they are encountered, spell-checkers or editors
could warn users of their presence and suggest replacements. For more about deprecated
characters, see D13 in Section 3.4, Characters and Encoding.
Unicode character names are also never changed, so that they can be used as identifiers
that are valid across versions. See Section 4.8, Name.
Similar stability guarantees exist for certain important properties. For example, the decom-
positions are kept stable, so that it is possible to normalize a Unicode text once and have it
remain normalized in all future versions.
General Structure 24 2.2 Unicode Design Principles
The most current versions of the character encoding stability policies for the Unicode
Standard are maintained online at:
http://www.unicode.org/policies/stability_policy.html
Convertibility
Character identity is preserved for interchange with a number of different base standards,
including national, international, and vendor standards. Where variant forms (or even the
same form) are given separate codes within one base standard, they are also kept separate
within the Unicode Standard. This choice guarantees the existence of a mapping between
the Unicode Standard and base standards.
Accurate convertibility is guaranteed between the Unicode Standard and other standards
in wide usage as of May 1993. Characters have also been added to allow convertibility to
several important East Asian character sets created after that date—for example, GB 18030.
In general, a single code point in another standard will correspond to a single code point in
the Unicode Standard. Sometimes, however, a single code point in another standard corre-
sponds to a sequence of code points in the Unicode Standard, or vice versa. Conversion
between Unicode text and text in other character codes must, in general, be done by
explicit table-mapping processes. (See also Section 5.1, Data Structures for Character Con-
version.)
General Structure 25 2.3 Compatibility Characters
of compatibility characters in the W3C specification, “Unicode in XML and Other Markup
Languages.”
Allocation. The Compatibility and Specials Area contains a large number of compatibility
characters, but the Unicode Standard also contains many compatibility characters that do
not appear in that area. These include examples such as U+2163 “IV” roman numeral
four, U+2007 figure space, U+00B2 “2” superscript two, U+2502 box drawings
light vertical, and U+32D0 circled katakana a.
There is no formal listing of all compatibility characters in the Unicode Standard. This fol-
lows from the nature of the definition of compatibility characters. It is a judgement call as
to whether any particular character would have been accepted for encoding if it had not
been required for interoperability with a particular standard. Different participants in
character encoding often disagree about the appropriateness of encoding particular char-
acters, and sometimes there are multiple justifications for encoding a given character.
Compatibility Variants
Compatibility variants are a subset of compatibility characters, and have the further charac-
teristic that they represent variants of existing, ordinary, Unicode characters.
For example, compatibility variants might represent various presentation or styled forms
of basic letters: superscript or subscript forms, variant glyph shapes, or vertical presenta-
tion forms. They also include halfwidth or fullwidth characters from East Asian character
encoding standards, Arabic contextual form glyphs from preexisting Arabic code pages,
Arabic ligatures and ligatures from other scripts, and so on. Compatibility variants also
include CJK compatibility ideographs, many of which are minor glyph variants of an
encoded unified CJK ideograph.
In contrast to compatibility variants there are the numerous compatibility characters, such
as U+2502 box drawings light vertical, U+263A white smiling face, or U+2701
upper blade scissors, which are not variants of ordinary Unicode characters. However, it
is not always possible to determine unequivocally whether a compatibility character is a
variant or not.
encoded; the list of compatibility decomposable characters for any version of the Unicode
Standard is thus also stable.
Compatibility decomposable characters have also been referred to in earlier versions of the
Unicode Standard as compatibility composite characters or compatibility composites for
short, but the full term, compatibility decomposable character is preferred.
Compatibility Character Vs. Compatibility Decomposable Character. In informal dis-
cussions of the Unicode Standard, compatibility decomposable characters have also often
been referred to simply as “compatibility characters.” This is understandable, in part
because the two sets of characters largely overlap, but the concepts are actually distinct.
There are compatibility characters which are not compatibility decomposable characters,
and there are compatibility decomposable characters which are not compatibility charac-
ters.
For example, the deprecated alternate format characters such as U+206C inhibit arabic
form shaping are considered compatibility characters, but they have no decomposition
mapping, and thus by definition cannot be compatibility decomposable characters. Like-
wise for such other compatibility characters as U+2502 box drawings light vertical or
U+263A white smiling face.
There are also instances of compatibility variants which clearly are variants of other Uni-
code characters, but which have no decomposition mapping. For example, U+2EAF cjk
radical silk is a compatibility variant of U+2F77 kangxi radical silk, as well as being a
compatibility variant of U+7CF9 cjk unified ideograph-7cf9, but has no compatibility
decomposition. The numerous compatibility variants like this in the CJK Radicals Supple-
ment block were encoded for compatibility with encodings that distinguished and sepa-
rately encoded various forms of CJK radicals as symbols.
A different case is illustrated by the CJK compatibility ideographs, such as U+FA0C cjk
compatibility ideograph-fa0c. Those compatibility characters have a decomposition
mapping, but for historical reasons it is always a canonical decomposition, so they are
canonical decomposable characters, but not compatibility decomposable characters.
By way of contrast, some compatibility decomposable characters, such as modifier letters
used in phonetic orthographies, for example, U+02B0 modifier letter small h, are not
considered to be compatibility characters. They would have been accepted for encoding in
the standard on their own merits, regardless of their need for mapping to IPA. A large
number of compatibility decomposable characters like this are actually distinct symbols
used in specialized notations, whether phonetic or mathematical. In such cases, their com-
patibility mappings express their historical derivation from styled forms of standard letters.
Other compatibility decomposable characters are widely used characters serving essential
functions. U+00A0 no-break space is one example. In these and similar cases, such as
fixed-width space characters, the compatibility decompositions define possible fallback
representations.
The Unicode Character Database supplies identification and mapping information only
for compatibility decomposable characters, while compatibility variants are not formally
General Structure 28 2.3 Compatibility Characters
identified or documented. Because the two sets substantially overlap, many specifications
are written in terms of compatibility decomposable characters first; if necessary, such spec-
ifications may be extended to handle other, non-decomposable compatibility variants as
required. (See also the discussion in Section 5.19, Mapping Compatibility Variants.)
General Structure 29 2.4 Code Points and Characters
Abstract Encoded
00C5
212B
0041 030A
When referring to code points in the Unicode Standard, the usual practice is to refer to
them by their numeric value expressed in hexadecimal, with a “U+” prefix. (See Appendix A,
Notational Conventions.) Encoded characters can also be referred to by their code points
General Structure 30 2.4 Code Points and Characters
only. To prevent ambiguity, the official Unicode name of the character is often added; this
clearly identifies the abstract character that is encoded. For example:
U+0061 latin small letter a
U+10330 gothic letter ahsa
U+201DF cjk unified ideograph-201df
Such citations refer only to the encoded character per se, associating the code point (as an
integral value) with the abstract character that is encoded.
Not all assigned code points represent abstract characters; only Graphic, Format, Control
and Private-use do. Surrogates and Noncharacters are assigned code points but are not
assigned to abstract characters. Reserved code points are assignable: any may be assigned
General Structure 31 2.4 Code Points and Characters
in a future version of the standard. The General Category provides a finer breakdown of
Graphic characters and also distinguishes between the other basic types (except between
Noncharacter and Reserved). Other properties defined in the Unicode Character Database
provide for different categorizations of Unicode code points.
Control Codes. Sixty-five code points (U+0000..U+001F and U+007F..U+009F) are
defined specifically as control codes, for compatibility with the C0 and C1 control codes of
the ISO/IEC 2022 framework. A few of these control codes are given specific interpreta-
tions by the Unicode Standard. (See Section 23.1, Control Codes.)
Noncharacters. Sixty-six code points are not used to encode characters. Noncharacters
consist of U+FDD0..U+FDEF and any code point ending in the value FFFE16 or FFFF16—
that is, U+FFFE, U+FFFF, U+1FFFE, U+1FFFF, ... U+10FFFE, U+10FFFF. (See
Section 23.7, Noncharacters.)
Private Use. Three ranges of code points have been set aside for private use. Characters in
these areas will never be defined by the Unicode Standard. These code points can be freely
used for characters of any purpose, but successful interchange requires an agreement
between sender and receiver on their interpretation. (See Section 23.5, Private-Use Charac-
ters.)
Surrogates. Some 2,048 code points have been allocated as surrogate code points, which
are used in the UTF-16 encoding form. (See Section 23.6, Surrogates Area.)
Restricted Interchange. Code points that are not assigned to abstract characters are subject
to restrictions in interchange.
• Surrogate code points cannot be conformantly interchanged using Unicode
encoding forms. They do not correspond to Unicode scalar values and thus do
not have well-formed representations in any Unicode encoding form. (See
Section 3.8, Surrogates.)
• Noncharacter code points are reserved for internal use, such as for sentinel val-
ues. They have well-formed representations in Unicode encoding forms and
survive conversions between encoding forms. This allows sentinel values to be
preserved internally across Unicode encoding forms, even though they are not
designed to be used in open interchange.
• All implementations need to preserve reserved code points because they may
originate in implementations that use a future version of the Unicode Standard.
For example, suppose that one person is using a Unicode 12.0 system and a sec-
ond person is using a Unicode 11.0 system. The first person sends the second
person a document containing some code points newly assigned in Unicode
12.0; these code points were unassigned in Unicode 11.0. The second person
may edit the document, not changing the reserved codes, and send it on. In
that case the second person is interchanging what are, as far as the second per-
son knows, reserved code points.
General Structure 32 2.4 Code Points and Characters
Code Point Semantics. The semantics of most code points are established by this standard;
the exceptions are Controls, Private-use, and Noncharacters. Control codes generally have
semantics determined by other standards or protocols (such as ISO/IEC 6429), but there
are a small number of control codes for which the Unicode Standard specifies particular
semantics. See Table 23-1 in Section 23.1, Control Codes, for the exact list of those control
codes. The semantics of private-use characters are outside the scope of the Unicode Stan-
dard; their use is determined by private agreement, as, for example, between vendors. Non-
characters have semantics in internal use only.
General Structure 33 2.5 Encoding Forms
84 44
0414
Trail and Single
44 D
0044
84 84
0442
Lead and Trail
84 84
The situation is made more complex by the fact that lead and trail bytes can also overlap, as
shown in the second part of Figure 2-9. This means that the backward scan has to repeat
until it hits the start of the text or hits a sequence that could not exist as a pair as shown in
General Structure 34 2.5 Encoding Forms
Figure 2-10. This is not only inefficient, but also extremely error-prone: corruption of one
byte can cause entire lines of text to be corrupted.
?? ... 84 84 84 84 84 84 44
D
0442 0414 0044
The Unicode encoding forms avoid this problem, because none of the ranges of values for
the lead, trail, or single code units in any of those encoding forms overlap.
Non-overlap makes all of the Unicode encoding forms well behaved for searching and
comparison. When searching for a particular character, there will never be a mismatch
against some code unit sequence that represents just part of another character. The fact
that all Unicode encoding forms observe this principle of non-overlap distinguishes them
from many legacy East Asian multibyte character encodings, for which overlap of code unit
sequences may be a significant problem for implementations.
Another aspect of non-overlap in the Unicode encoding forms is that all Unicode charac-
ters have determinate boundaries when expressed in any of the encoding forms. That is,
the edges of code unit sequences representing a character are easily determined by local
examination of code units; there is never any need to scan back indefinitely in Unicode text
to correctly determine a character boundary. This property of the encoding forms has
sometimes been referred to as self-synchronization. This property has another very import-
ant implication: corruption of a single code unit corrupts only a single character; none of
the surrounding characters are affected.
For example, when randomly accessing a string, a program can find the boundary of a
character with limited backup. In UTF-16, if a pointer points to a leading surrogate, a sin-
gle backup is required. In UTF-8, if a pointer points to a byte starting with 10xxxxxx (in
binary), one to three backups are required to find the beginning of the character.
Conformance. The Unicode Consortium fully endorses the use of any of the three Unicode
encoding forms as a conformant way of implementing the Unicode Standard. It is import-
ant not to fall into the trap of trying to distinguish “UTF-8 versus Unicode,” for example.
UTF-8, UTF-16, and UTF-32 are all equally valid and conformant ways of implementing
the encoded characters of the Unicode Standard.
Examples. Figure 2-11 shows the three Unicode encoding forms, including how they are
related to Unicode code points.
In Figure 2-11, the UTF-32 line shows that each example character can be expressed with
one 32-bit code unit. Those code units have the same values as the code point for the char-
acter. For UTF-16, most characters can be expressed with one 16-bit code unit, whose
General Structure 35 2.5 Encoding Forms
A Ω UTF-32
00000041 000003A9 00008A9E 00010384
A Ω UTF-16
0041 03A9 8A9E D800 DF84
A Ω UTF-8
41 CE A9 E8 AA 9E F0 90 8E 84
value is the same as the code point for the character, but characters with high code point
values require a pair of 16-bit surrogate code units instead. In UTF-8, a character may be
expressed with one, two, three, or four bytes, and the relationship between those byte val-
ues and the code point value is more complex.
UTF-8, UTF-16, and UTF-32 are further described in the subsections that follow. See each
subsection for a general overview of how each encoding form is structured and the general
benefits or drawbacks of each encoding form for particular purposes. For the detailed for-
mal definition of the encoding forms and conformance requirements, see Section 3.9, Uni-
code Encoding Forms.
UTF-32
UTF-32 is the simplest Unicode encoding form. Each Unicode code point is represented
directly by a single 32-bit code unit. Because of this, UTF-32 has a one-to-one relationship
between encoded character and code unit; it is a fixed-width character encoding form. This
makes UTF-32 an ideal form for APIs that pass single character values.
As for all of the Unicode encoding forms, UTF-32 is restricted to representation of code
points in the range 0..10FFFF16—that is, the Unicode codespace. This guarantees interop-
erability with the UTF-16 and UTF-8 encoding forms.
Fixed Width. The value of each UTF-32 code unit corresponds exactly to the Unicode code
point value. This situation differs significantly from that for UTF-16 and especially UTF-8,
where the code unit values often change unrecognizably from the code point value. For
example, U+10000 is represented as <00010000> in UTF-32 and as <F0 90 80 80> in UTF-
8. For UTF-32, it is trivial to determine a Unicode character from its UTF-32 code unit rep-
resentation. In contrast, UTF-16 and UTF-8 representations often require doing a code
unit conversion before the character can be identified in the Unicode code charts.
Preferred Usage. UTF-32 may be a preferred encoding form where memory or disk stor-
age space for characters is not a particular concern, but where fixed-width, single code unit
access to characters is desired. UTF-32 is also a preferred encoding form for processing
characters on most Unix platforms.
General Structure 36 2.5 Encoding Forms
UTF-16
In the UTF-16 encoding form, code points in the range U+0000..U+FFFF are represented
as a single 16-bit code unit; code points in the supplementary planes, in the range
U+10000..U+10FFFF, are represented as pairs of 16-bit code units. These pairs of special
code units are known as surrogate pairs. The values of the code units used for surrogate
pairs are completely disjunct from the code units used for the single code unit representa-
tions, thus maintaining non-overlap for all code point representations in UTF-16. For the
formal definition of surrogates, see Section 3.8, Surrogates.
Optimized for BMP. UTF-16 optimizes the representation of characters in the Basic Mul-
tilingual Plane (BMP)—that is, the range U+0000..U+FFFF. For that range, which contains
the vast majority of common-use characters for all modern scripts of the world, each char-
acter requires only one 16-bit code unit, thus requiring just half the memory or storage of
the UTF-32 encoding form. For the BMP, UTF-16 can effectively be treated as if it were a
fixed-width encoding form.
Supplementary Characters and Surrogates. For supplementary characters, UTF-16
requires two 16-bit code units. The distinction between characters represented with one
versus two 16-bit code units means that formally UTF-16 is a variable-width encoding
form. That fact can create implementation difficulties if it is not carefully taken into
account; UTF-16 is somewhat more complicated to handle than UTF-32.
Preferred Usage. UTF-16 may be a preferred encoding form in many environments that
need to balance efficient access to characters with economical use of storage. It is reason-
ably compact, and all the common, heavily used characters fit into a single 16-bit code unit.
Origin. UTF-16 is the historical descendant of the earliest form of Unicode, which was
originally designed to use a fixed-width, 16-bit encoding form exclusively. The surrogates
were added to provide an encoding form for the supplementary characters at code points
past U+FFFF. The design of the surrogates made them a simple and efficient extension
mechanism that works well with older Unicode implementations and that avoids many of
the problems of other variable-width character encodings. See Section 5.4, Handling Surro-
gate Pairs in UTF-16, for more information about surrogates and their processing.
Collation. For the purpose of sorting text, binary order for data represented in the UTF-16
encoding form is not the same as code point order. This means that a slightly different
comparison implementation is needed for code point order. For more information, see
Section 5.17, Binary Order.
UTF-8
To meet the requirements of byte-oriented, ASCII-based systems, a third encoding form is
specified by the Unicode Standard: UTF-8. This variable-width encoding form preserves
ASCII transparency by making use of 8-bit code units.
Byte-Oriented. Much existing software and practice in information technology have long
depended on character data being represented as a sequence of bytes. Furthermore, many
General Structure 37 2.5 Encoding Forms
of the protocols depend not only on ASCII values being invariant, but must make use of or
avoid special byte values that may have associated control functions. The easiest way to
adapt Unicode implementations to such a situation is to make use of an encoding form that
is already defined in terms of 8-bit code units and that represents all Unicode characters
while not disturbing or reusing any ASCII or C0 control code value. That is the function of
UTF-8.
Variable Width. UTF-8 is a variable-width encoding form, using 8-bit code units, in which
the high bits of each code unit indicate the part of the code unit sequence to which each
byte belongs. A range of 8-bit code unit values is reserved for the first, or leading, element of
a UTF-8 code unit sequences, and a completely disjunct range of 8-bit code unit values is
reserved for the subsequent, or trailing, elements of such sequences; this convention pre-
serves non-overlap for UTF-8. Table 3-6 on page 126 shows how the bits in a Unicode code
point are distributed among the bytes in the UTF-8 encoding form. See Section 3.9, Unicode
Encoding Forms, for the full, formal definition of UTF-8.
ASCII Transparency. The UTF-8 encoding form maintains transparency for all of the
ASCII code points (0x00..0x7F). That means Unicode code points U+0000..U+007F are
converted to single bytes 0x00..0x7F in UTF-8 and are thus indistinguishable from ASCII
itself. Furthermore, the values 0x00..0x7F do not appear in any byte for the representation
of any other Unicode code point, so that there can be no ambiguity. Beyond the ASCII
range of Unicode, many of the non-ideographic scripts are represented by two bytes per
code point in UTF-8; all non-surrogate code points between U+0800 and U+FFFF are rep-
resented by three bytes; and supplementary code points above U+FFFF require four bytes.
Preferred Usage. UTF-8 is typically the preferred encoding form for HTML and similar
protocols, particularly for the Internet. The ASCII transparency helps migration. UTF-8
also has the advantage that it is already inherently byte-serialized, as for most existing 8-bit
character sets; strings of UTF-8 work easily with C or other programming languages, and
many existing APIs that work for typical Asian multibyte character sets adapt to UTF-8 as
well with little or no change required.
Self-synchronizing. In environments where 8-bit character processing is required for one
reason or another, UTF-8 has the following attractive features as compared to other multi-
byte encodings:
• The first byte of a UTF-8 code unit sequence indicates the number of bytes to
follow in a multibyte sequence. This allows for very efficient forward parsing.
• It is efficient to find the start of a character when beginning from an arbitrary
location in a byte stream of UTF-8. Programs need to search at most four bytes
backward, and usually much less. It is a simple task to recognize an initial byte,
because initial bytes are constrained to a fixed range of values.
• As with the other encoding forms, there is no overlap of byte values.
General Structure 38 2.5 Encoding Forms
able-width nature of processing text elements. See Unicode Technical Standard #18, “Uni-
code Regular Expressions,” for an example where commonly implemented processes deal
with inherently variable-width text elements owing to user expectations of the identity of a
“character.”
UTF-8. UTF-8 is reasonably compact in terms of the number of bytes used. It is really only
at a significant size disadvantage when used for East Asian implementations such as Chi-
nese, Japanese, and Korean, which use Han ideographs or Hangul syllables requiring
three-byte code unit sequences in UTF-8. UTF-8 is also significantly less efficient in terms
of processing than the other encoding forms.
Binary Sorting. A binary sort of UTF-8 strings gives the same ordering as a binary sort of
Unicode code points. This is obviously the same order as for a binary sort of UTF-32
strings.
All three encoding forms give the same results for binary string comparisons or string sort-
ing when dealing only with BMP characters (in the range U+0000..U+FFFF). However,
when dealing with supplementary characters (in the range U+10000..U+10FFFF), UTF-16
binary order does not match Unicode code point order. This can lead to complications
when trying to interoperate with binary sorted lists—for example, between UTF-16 sys-
tems and UTF-8 or UTF-32 systems. However, for data that is sorted according to the con-
ventions of a specific language or locale rather than using binary order, data will be
ordered the same, regardless of the encoding form.
General Structure 40 2.6 Encoding Schemes
Encoding Scheme Versus Encoding Form. Note that some of the Unicode encoding
schemes have the same labels as the three Unicode encoding forms. This could cause con-
fusion, so it is important to keep the context clear when using these terms: character
encoding forms refer to integral data units in memory or in APIs, and byte order is irrele-
vant; character encoding schemes refer to byte-serialized data, as for streaming I/O or in
file storage, and byte order must be specified or determinable.
The Internet Assigned Numbers Authority (IANA) maintains a registry of charset names
used on the Internet. Those charset names are very close in meaning to the Unicode char-
acter encoding model’s concept of character encoding schemes, and all of the Unicode
character encoding schemes are, in fact, registered as charsets. While the two concepts are
quite close and the names used are identical, some important differences may arise in
terms of the requirements for each, particularly when it comes to handling of the byte
order mark. Exercise due caution when equating the two.
Examples. Figure 2-12 illustrates the Unicode character encoding schemes, showing how
each is derived from one of the encoding forms by serialization of bytes.
A
00 00 00 41 00
Ω
00 03 A9 00 00 8A 9E 00 01 03 84
UTF-32BE
A Ω UTF-32LE
41 00 00 00 A9 03 00 00 9E 8A 00 00 84 03 01 00
A Ω
00 41 03 A9 8A 9E D8 00 DF 84
UTF-16BE
A Ω 9E 8A
UTF-16LE
41 00 A9 03 00 D8 84 DF
A Ω UTF-8
41 CE A9 E8 AA 9E F0 90 8E 84
General Structure 42 2.6 Encoding Schemes
In Figure 2-12, the code units used to express each example character have been serialized
into sequences of bytes. This figure should be compared with Figure 2-11, which shows the
same characters before serialization into sequences of bytes. The “BE” lines show serializa-
tion in big-endian order, whereas the “LE” lines show the bytes reversed into little-endian
order. For UTF-8, the code unit is just an 8-bit byte, so that there is no distinction between
big-endian and little-endian order. UTF-32 and UTF-16 encoding schemes using the byte
order mark are not shown in Figure 2-12, to keep the basic picture regarding serialization
of bytes clearer.
For the detailed formal definition of the Unicode encoding schemes and conformance
requirements, see Section 3.10, Unicode Encoding Schemes. For further general discussion
about character encoding forms and character encoding schemes, both for the Unicode
Standard and as applied to other character encoding standards, see Unicode Technical
Report #17, “Unicode Character Encoding Model.” For information about charsets and
character conversion, see Unicode Technical Standard #22, “Character Mapping Markup
Language (CharMapML).”
General Structure 43 2.7 Unicode Strings
Planes
The Unicode codespace consists of the single range of numeric values from 0 to 10FFFF16,
but in practice it has proven convenient to think of the codespace as divided up into planes
of characters—each plane consisting of 64K code points. Because of these numeric conven-
tions, the Basic Multilingual Plane is occasionally referred to as Plane 0. The last four hexa-
decimal digits in each code point indicate a character’s position inside a plane. The
remaining digits indicate the plane. For example, U+23456 cjk unified ideograph-23456
is found at location 345616 in Plane 2.
Basic Multilingual Plane. The Basic Multilingual Plane (BMP, or Plane 0) contains the
common-use characters for all the modern scripts of the world as well as many historical
and rare characters. By far the majority of all Unicode characters for almost all textual data
can be found in the BMP.
Supplementary Multilingual Plane. The Supplementary Multilingual Plane (SMP, or
Plane 1) is dedicated to the encoding of characters for scripts or symbols which either
could not be fit into the BMP or see very infrequent usage. This includes many historic
scripts, a number of lesser-used contemporary scripts, special-purpose invented scripts,
notational systems or large pictographic symbol sets, and occasionally historic extensions
of scripts whose core sets are encoded on the BMP.
Examples include Gothic (historic), Shavian (special-purpose invented), Musical Symbols
(notational system), Domino Tiles (pictographic), and Ancient Greek Numbers (historic
extension for Greek). A number of scripts, whether of historic and contemporary use, do
not yet have their characters encoded in the Unicode Standard. The majority of scripts cur-
rently identified for encoding will eventually be allocated in the SMP. As a result, some
areas of the SMP will experience common, frequent usage.
Supplementary Ideographic Plane. The Supplementary Ideographic Plane (SIP, or Plane
2) is intended as an additional allocation area for those CJK characters that could not be fit
in the blocks set aside for more common CJK characters in the BMP. While there are a
General Structure 45 2.8 Unicode Allocation
small number of common-use CJK characters in the SIP (for example, for Cantonese
usage), the vast majority of Plane 2 characters are extremely rare or of historical interest
only.
Supplementary Special-purpose Plane. The Supplementary Special-purpose Plane (SSP,
or Plane 14) is the spillover allocation area for format control characters that do not fit into
the small allocation areas for format control characters in the BMP.
Private Use Planes. The two Private Use Planes (Planes 15 and 16) are allocated, in their
entirety, for private use. Those two planes contain a total of 131,068 characters to supple-
ment the 6,400 private-use characters located in the BMP.
• Block definitions are not at all exclusive. For instance, many mathematical
operator characters are not encoded in the Mathematical Operators block—
and are not even in any block containing “Mathematical” in its name; many
currency symbols are not found in the Currency Symbols block, and so on.
For reliable specification of the properties of characters, one should instead turn to the
detailed, character-by-character property assignments available in the Unicode Character
Database. See also Chapter 4, Character Properties. For further discussion of the relation-
ship between the blocks in the Unicode standard and significant property assignments and
sets of characters, see Unicode Standard Annex #24, “Unicode Script Property,” and Uni-
code Technical Standard #18, “Unicode Regular Expressions.”
Allocation Order. The allocation order of various scripts and other groups of characters
reflects the historical evolution of the Unicode Standard. While there is a certain geo-
graphic sense to the ordering of the allocation areas for the scripts, this is only a very loose
correlation.
Roadmap for Future Allocation. The unassigned ranges in the Unicode codespace will be
filled with future script or symbol encodings on a space-available basis. The relevant char-
acter encoding committees follow an organized roadmap to help them decide where to
encode new scripts and other characters within the available space. Until the characters are
actually standardized, however, there are no absolute guarantees where future allocations
will occur. In general, implementations should not make assumptions about where future
scripts or other sets of symbols may be encoded based solely on the identity of neighboring
blocks of characters already encoded.
See Section B.3, Other Unicode Online Resources for information about the roadmap and
about the pipeline of approved characters in process for future publication.
Graphic
Format or Control
Private Use
Reserved
Plane 0 (BMP)
Figure 2-14 shows the Basic Multilingual Plane (BMP) in an expanded format to illustrate
the allocation substructure of that plane in more detail. This section describes each alloca-
tion area, in the order of their location on the BMP.
2000
2000-2BFF Punctuation and Symbols Area
A000
A000-ABFF General Scripts Area (Asia & Africa)
AC00
F900
F900-FFFF Compatibility and Specials Area
(FFFF)
General Structure 50 2.9 Details of Allocation
ASCII and Latin-1 Compatibility Area. For compatibility with the ASCII and ISO 8859-1,
Latin-1 standards, this area contains the same repertoire and ordering as Latin-1. Accord-
ingly, it contains the basic Latin alphabet, European digits, and then the same collection of
miscellaneous punctuation, symbols, and additional Latin letters as are found in Latin-1.
General Scripts Area. The General Scripts Area contains a large number of modern-use
scripts of the world, including Latin, Greek, Cyrillic, Arabic, and so on. Most of the charac-
ters encoded in this area are graphic characters. A subrange of the General Scripts Area is
set aside for right-to-left scripts, including Hebrew, Arabic, Thaana, and N’Ko.
Punctuation and Symbols Area. This area is devoted mostly to all kinds of symbols,
including many characters for use in mathematical notation. It also contains general punc-
tuation, as well as most of the important format control characters.
Supplementary General Scripts Area. This area contains scripts or extensions to scripts
that did not fit in the General Scripts Area itself. It contains the Glagolitic, Coptic, and Tifi-
nagh scripts, plus extensions for the Latin, Cyrillic, Georgian, and Ethiopic scripts.
CJK Miscellaneous Area. The CJK Miscellaneous Area contains some East Asian scripts,
such as Hiragana and Katakana for Japanese, punctuation typically used with East Asian
scripts, lists of CJK radical symbols, and a large number of East Asian compatibility charac-
ters.
CJKV Ideographs Area. This area contains almost all the unified Han ideographs in the
BMP. It is subdivided into a block for the Unified Repertoire and Ordering (the initial
block of 20,902 unified Han ideographs plus a small number of later additions) and
another block containing Extension A (an additional 6,582 unified Han ideographs).
General Scripts Area (Asia and Africa). This area contains numerous blocks for addi-
tional scripts of Asia and Africa, such as Yi, Cham, Vai, and Bamum. It also contains more
spillover blocks with additional characters for the Latin, Devanagari, Myanmar, and Han-
gul scripts.
Hangul Area. This area consists of one large block containing 11,172 precomposed Han-
gul syllables, and one small block with additional, historic Hangul jamo extensions.
Surrogates Area. The Surrogates Area contains only surrogate code points and no encoded
characters. See Section 23.6, Surrogates Area, for more details.
Private Use Area. The Private Use Area in the BMP contains 6,400 private-use characters.
Compatibility and Specials Area. This area contains many compatibility variants of char-
acters from widely used corporate and national standards that have other representations
in the Unicode Standard. For example, it contains Arabic presentation forms, whereas the
basic characters for the Arabic script are located in the General Scripts Area. The Compat-
ibility and Specials Area also contains twelve CJK unified ideographs, a few important for-
mat control characters, the basic variation selectors, and other special characters. See
Section 23.8, Specials, for more details.
General Structure 51 2.9 Details of Allocation
Plane 1 (SMP)
Figure 2-15 shows Plane 1, the Supplementary Multilingual Plane (SMP), in expanded for-
mat to illustrate the allocation substructure of that plane in more detail.
1 6000
General Scripts Area
1 7000
1 BC00
General Scripts Area
1 D000
Symbols Area
1 E800 General Scripts Area (RTL)
1 F000
Symbols Area
(1 FFFF)
General Scripts Areas. These areas contain a large number of historic scripts, as well as a
few regional scripts which have some current use. The first of these areas also contains a
small number of symbols and numbers associated with ancient scripts.
General Scripts Areas (RTL). There are two subranges in the SMP which are set aside for
historic right-to-left scripts, such as Phoenician, Kharoshthi, and Avestan. The second of
these also defaults to Bidi_Class = R and is reserved for the encoding of other historic
right-to-left scripts or symbols.
Cuneiform and Hieroglyphic Area. This area contains three large, ancient scripts: Sum-
ero-Akkadian Cuneiform, Egyptian Hieroglyphs, and Anatolian Hieroglyphs. Other large
hieroglyphic and pictographic scripts will be allocated in this area in the future.
General Structure 52 2.9 Details of Allocation
Ideographic Scripts Area. This area is set aside for large, historic siniform (but non-Han)
logosyllabic scripts such as Tangut, Jurchen, and Khitan, and other East Asian logosyllabic
scripts such as Naxi. As of Unicode 12.0, this area contains a large set of Tangut ideographs
and components, the Nüshu script, and a large set of hentaigana (historic, variant form
kana) characters.
Symbols Areas. The first of these SMP Symbols Areas contains sets of symbols for nota-
tional systems, such as musical symbols, shorthands, and mathematical alphanumeric
symbols. The second contains various game symbols, and large sets of miscellaneous sym-
bols and pictographs, mostly used in compatibility mapping of East Asian character sets.
Notable among these are emoji and emoticons.
Plane 2 (SIP)
Plane 2, the Supplementary Ideographic Plane (SIP), consists primarily of one big area,
starting from the first code point in the plane, that is dedicated to encoding additional uni-
fied CJK characters. A much smaller area, toward the end of the plane, is dedicated to addi-
tional CJK compatibility ideographic characters—which are basically just duplicated
character encodings required for round-trip conversion to various existing legacy East
Asian character sets. The CJK compatibility ideographic characters in Plane 2 are currently
all dedicated to round-trip conversion for the CNS standard and are intended to supple-
ment the CJK compatibility ideographic characters in the BMP, a smaller number of char-
acters dedicated to round-trip conversion for various Korean, Chinese, and Japanese
standards.
Other Planes
The first 4,096 code positions on Plane 14 form an area set aside for special characters that
have the Default_Ignorable_Code_Point property. A small number of tag characters, plus
some supplementary variation selection characters, have been allocated there. All remain-
ing code positions on Plane 14 are reserved for future allocation of other special-purpose
characters.
Plane 15 and Plane 16 are allocated, in their entirety, for private use. Those two planes con-
tain a total of 131,068 characters, to supplement the 6,400 private-use characters located in
the BMP.
All other planes are reserved; there are no characters assigned in them. The last two code
positions of all planes are permanently set aside as noncharacters. (See Section 2.13, Special
Characters).
General Structure 53 2.10 Writing Direction
Bidirectional. In most Semitic scripts such as Hebrew and Arabic, characters are arranged
from right to left into lines, although digits run the other way, making the scripts inher-
ently bidirectional, as shown in the second example in Figure 2-16. In addition, left-to-right
and right-to-left scripts are frequently used together. In all such cases, arranging characters
into lines becomes more complex. The Unicode Standard defines an algorithm to deter-
mine the layout of a line, based on the inherent directionality of each character, and sup-
plemented by a small set of directional controls. See Unicode Standard Annex #9, “Unicode
Bidirectional Algorithm,” for more information.
Vertical. East Asian scripts are frequently written in vertical lines in which characters are
arranged from top to bottom. Lines are arranged from right to left, as shown in the third
example in Figure 2-16. Such scripts may also be written horizontally, from left to right.
Most East Asian characters have the same shape and orientation when displayed horizon-
tally or vertically, but many punctuation characters change their shape when displayed ver-
tically. In a vertical context, letters and words from other scripts are generally rotated
through 90-degree angles so that they, too, read from top to bottom. Unicode Technical
Report #50, “Unicode Vertical Text Layout,” defines a character property which is useful in
determining the correct orientation of characters when laid out vertically in text.
In contrast to the bidirectional case, the choice to lay out text either vertically or horizon-
tally is treated as a formatting style. Therefore, the Unicode Standard does not provide
directionality controls to specify that choice.
Mongolian is usually written from top to bottom, with lines arranged from left to right, as
shown in the fourth example. When Mongolian is written horizontally, the characters are
rotated.
Boustrophedon. Early Greek used a system called boustrophedon (literally, “ox-turning”).
In boustrophedon writing, characters are arranged into horizontal lines, but the individual
lines alternate between right to left and left to right, the way an ox goes back and forth
General Structure 54 2.10 Writing Direction
when plowing a field, as shown in the fifth example. The letter images are mirrored in
accordance with the direction of each individual line.
Other Historical Directionalities. Other script directionalities are found in historical writ-
ing systems. For example, some ancient Numidian texts are written from bottom to top,
and Egyptian hieroglyphics can be written with varying directions for individual lines.
The historical directionalities are of interest almost exclusively to scholars intent on repro-
ducing the exact visual content of ancient texts. The Unicode Standard does not provide
direct support for them. Fixed texts can, however, be written in boustrophedon or in other
directional conventions by using hard line breaks and directionality overrides or the equiv-
alent markup.
General Structure 55 2.11 Combining Characters
+ →
2621 20DF
+ →
2615 20E0
+ →
062D 20DD
Script-Specific Combining Characters. Some scripts, such as Hebrew, Arabic, and the
scripts of India and Southeast Asia, have both spacing and nonspacing combining charac-
ters specific to those scripts. Many of these combining characters encode vowel letters. As
such, they are not generally referred to as diacritics, but may have script-specific terminol-
ogy such as harakat (Arabic) or matra (Devanagari). See Section 7.9, Combining Marks.
a + $¨ + u → äu (not aü)
0061 0308 0075
” + Á$ → Á”
092B 093F
Properties. A sequence of a base character plus one or more combining characters gener-
ally has the same properties as the base character. For example, “A” followed by “ˆ” has the
same properties as “”. For this reason, most Unicode algorithms ensure that such
sequences behave the same way as the corresponding base character. However, when the
combining character is an enclosing combining mark—in other words, when its General_-
Category value is Me—the resulting sequence has the appearance of a symbol. In
Figure 2-20, enclosing the exclamation mark with U+20E4 combining enclosing
upward pointing triangle produces a sequence that looks like U+26A0 warning sign.
+ $ → ≠
0021 20E4 26A0
Because the properties of U+0021 exclamation mark are that of a punctuation character,
they are different from those of U+26A0 warning sign. For example, the two will behave
differently for line breaking. To avoid unexpected results, it is best to limit the use of com-
bining enclosing marks to characters that encode symbols. For that reason, the warning
sign is separately encoded as a miscellaneous symbol in the Unicode Standard and does not
have a decomposition.
a + $¨ + $˜ + $ + $ → ä̃
0061 0308 ˙ ˆ0303 0323 032D ˆ˙
+$ +$ →
0E02 0E36 0E49
General Structure 58 2.11 Combining Characters
Another example of multiple combining characters above the base character can be found
in Thai, where a consonant letter can have above it one of the vowels U+0E34 through
U+0E37 and, above that, one of four tone marks U+0E48 through U+0E4B. The order of
character codes that produces this graphic display is base consonant character + vowel char-
acter + tone mark character, as shown in Figure 2-21.
Many combining characters have specific typographical traditions that provide detailed
rules for the expected rendering. These rules override the default stacking behavior. For
example, certain combinations of combining marks are sometimes positioned horizontally
rather than stacking or by ligature with an adjacent nonspacing mark (see Table 2-6).
When positioned horizontally, the order of codes is reflected by positioning in the predom-
inant direction of the script with which the codes are used. For example, in a left-to-right
General Structure 59 2.11 Combining Characters
script, horizontal accents would be coded from left to right. In Table 2-6, the top example is
correct and the bottom example is incorrect.
Such override behavior is associated with specific scripts or alphabets. For example, when
used with the Greek script, the “breathing marks” U+0313 combining comma above
(psili) and U+0314 combining reversed comma above (dasia) require that, when used
together with a following acute or grave accent, they be rendered side-by-side rather than
the accent marks being stacked above the breathing marks. The order of codes here is base
character code + breathing mark code + accent mark code. This example demonstrates the
script-dependent or writing-system-dependent nature of rendering combining diacritical
marks.
f + $˜ + i + $. → ˜.
fi
0066 0303 0069 0323
Ligated base characters with multiple combining marks do not commonly occur in most
scripts. However, in some scripts, such as Arabic, this situation occurs quite often when
vowel marks are used. It arises because of the large number of ligatures in Arabic, where
each element of a ligature is a consonant, which in turn can have a vowel mark attached to
it. Ligatures can even occur with three or more characters merging; vowel marks may be
attached to each part.
pendent way. This core concept is known as a grapheme cluster, and it consists of any com-
bining character sequence that contains only nonspacing combining marks or any
sequence of characters that constitutes a Hangul syllable (possibly followed by one or more
nonspacing marks). An implementation operating on such a cluster would almost never
want to break between its elements for rendering, editing, or other such text processes; the
grapheme cluster is treated as a single unit. Unicode Standard Annex #29, “Unicode Text
Segmentation,” provides a complete formal definition of a grapheme cluster and discusses
its application in the context of editing and other text processes. Implementations also may
tailor the definition of a grapheme cluster, so that under limited circumstances, particular
to one written language or another, the grapheme cluster may more closely pertain to what
end users think of as “characters” for that language.
General Structure 62 2.12 Equivalent Sequences
B+Ä ≡ B+A+ ¨
0042 00C4 0042 0041 0308
%
LJ + A ≈ L + J + A
01C7 0041 004C 004A 0041
2+¼ ≈ 2+1+⁄+4
0032 00BC 0032 0031 2044 0034
Normalization
Where a unique representation is required, a normalized form of Unicode text can be used
to eliminate unwanted distinctions. The Unicode Standard defines four normalization
forms: Normalization Form D (NFD), Normalization Form KD (NFKD), Normalization
Form C (NFC), and Normalization Form KC (NFKC). Roughly speaking, NFD and
NFKD decompose characters where possible, while NFC and NFKC compose characters
where possible. For more information, see Unicode Standard Annex #15, “Unicode Nor-
malization Forms,” and Section 3.11, Normalization Forms.
A key part of normalization is to provide a unique canonical order for visually nondistinct
sequences of combining characters. Figure 2-24 shows the effect of canonical ordering for
multiple combining marks applied to the same base character.
General Structure 63 2.12 Equivalent Sequences
non-interacting
A + $´ + $˛ ≡ A + $˛ + $´
0041 0301 0328 0041 0328 0301
ccc=0 ccc=230 ccc=202 ccc=0 ccc=202 ccc=230
interacting
A + $´ + $¨ ≠ A + $¨ + $´
0041 0301 0308 0041 0308 0301
ccc=0 ccc=230 ccc=230 ccc=0 ccc=230 ccc=230
In the first row of Figure 2-24, the two sequences are visually nondistinct and, therefore,
equivalent. The sequence on the right has been put into canonical order by reordering in
ascending order of the Canonical_Combining_Class (ccc) values. The ccc values are
shown below each character. The second row of Figure 2-24 shows an example where com-
bining marks interact typographically—the two sequences have different stacking order,
and the order of combining marks is significant. Because the two combining marks have
been given the same combining class, their ordering is retained under canonical reorder-
ing. Thus the two sequences in the second row are not equivalent.
Decompositions
Precomposed characters are formally known as decomposables, because they have decom-
positions to one or more other characters. There are two types of decompositions:
• Canonical. The character and its decomposition should be treated as essentially
equivalent.
• Compatibility. The decomposition may remove some information (typically
formatting information) that is important to preserve in particular contexts.
Types of Decomposables. Conceptually, a decomposition implies reducing a character to
an equivalent sequence of constituent parts, such as mapping an accented character to a
base character followed by a combining accent. The vast majority of nontrivial decomposi-
tions are indeed a mapping from a character code to a character sequence. However, in a
small number of exceptional cases, there is a mapping from one character to another char-
acter, such as the mapping from ohm to capital omega. Finally, there are the “trivial”
decompositions, which are simply a mapping of a character to itself. They are really an
indication that a character cannot be decomposed, but are defined so that all characters
formally have a decomposition. The definition of decomposable is written to encompass
only the nontrivial types of decompositions; therefore these characters are considered non-
decomposable.
General Structure 64 2.12 Equivalent Sequences
Nondecomposables
a ø Đ
0061 00F8 0110 0681
→ →
2126 03A9 FF76 30AB
Á
00C1
→ A
0041 0301 3384
→
006B 0041
03D3
→
03D2 0301
03D3
03D2 0301
} →
03A5 0301
Conformance to the Unicode Standard does not require the use of the BOM as such a sig-
nature. See Section 23.8, Specials, for more information on the byte order mark and its use
as an encoding signature.
Control Codes
In addition to the special characters defined in the Unicode Standard for a number of pur-
poses, the standard incorporates the legacy control codes for compatibility with the ISO/
IEC 2022 framework, ASCII, and the various protocols that make use of control codes.
Rather than simply being defined as byte values, however, the legacy control codes are
assigned to Unicode code points: U+0000..U+001F, U+007F..U+009F. Those code points
for control codes must be represented consistently with the various Unicode encoding
forms when they are used with other Unicode characters. For more information on control
codes, see Section 23.1, Control Codes.
General Structure 69 2.14 Conforming to the Unicode Standard
shaped according to the relevant block descriptions. (More sophisticated shaping can be
used if available.)
Unacceptable Behavior
It is unacceptable for a conforming implementation:
To use unassigned codes.
• U+2073 is unassigned and not usable for ‘3’ (superscript 3) or any other charac-
ter.
To corrupt unsupported characters.
• U+03A1 “P” greek capital letter rho should not be changed to U+00A1
(first byte dropped), U+0050 (mapped to Latin letter P), U+A103 (bytes
reversed), or anything other than U+03A1.
To remove or alter uninterpreted code points in text that purports to be unmodified.
• U+2029 is paragraph separator and should not be dropped by applications
that do not support it.
Acceptable Behavior
It is acceptable for a conforming implementation:
To support only a subset of the Unicode characters.
• An application might not provide mathematical symbols or the Thai script, for
example.
To transform data knowingly.
• Uppercase conversion: ‘a’ transformed to ‘A’
• Romaji to kana: ‘kyo’ transformed to Õá
• Decomposition: U+247D ‘(10)’ decomposed to <U+0028, U+0031, U+0030,
U+0029>
To build higher-level protocols on the character set.
• Examples are defining a file format for compression of characters or for use
with rich text.
To define private-use characters.
• Examples of characters that might be defined for private use include additional
ideographic characters (gaiji) or existing corporate logo characters.
To not support the Bidirectional Algorithm or character shaping in implementations that
do not support complex scripts, such as Arabic and Devanagari.
General Structure 71 2.14 Conforming to the Unicode Standard
Supported Subsets
The Unicode Standard does not require that an application be capable of interpreting and
rendering all Unicode characters so as to be conformant. Many systems will have fonts for
only some scripts, but not for others; sorting and other text-processing rules may be imple-
mented for only a limited set of languages. As a result, an implementation is able to inter-
pret a subset of characters.
The Unicode Standard provides no formalized method for identifying an implemented
subset. Furthermore, such a subset is typically different for different aspects of an imple-
mentation. For example, an application may be able to read, write, and store any Unicode
character and to sort one subset according to the rules of one or more languages (and the
rest arbitrarily), but have access to fonts for only a single script. The same implementation
may be able to render additional scripts as soon as additional fonts are installed in its envi-
ronment. Therefore, the subset of interpretable characters is typically not a static concept.
General Structure 72 2.14 Conforming to the Unicode Standard
73
Chapter 3
Conformance 3
This chapter defines conformance to the Unicode Standard in terms of the principles and
encoding architecture it embodies. The first section defines the format for referencing the
Unicode Standard and Unicode properties. The second section consists of the confor-
mance clauses, followed by sections that define more precisely the technical terms used in
those clauses. The remaining sections contain the formal algorithms that are part of con-
formance and referenced by the conformance clause. Additional definitions and algo-
rithms that are part of this standard can be found in the Unicode Standard Annexes listed
at the end of Section 3.2, Conformance Requirements.
In this chapter, conformance clauses are identified with the letter C. Definitions are identi-
fied with the letter D. Bulleted items are explanatory comments regarding definitions or
subclauses.
For information on implementing best practices, see Chapter 5, Implementation Guide-
lines.
Conformance 74 3.1 Versions of the Unicode Standard
Stability
Each version of the Unicode Standard, once published, is absolutely stable and will never
change. Implementations or specifications that refer to a specific version of the Unicode
Standard can rely upon this stability. When implementations or specifications are
upgraded to a future version of the Unicode Standard, then changes to them may be neces-
sary. Note that even errata and corrigenda do not formally change the text of a published
version; see “Errata and Corrigenda” later in this section.
Some features of the Unicode Standard are guaranteed to be stable across versions. These
include the names and code positions of characters, their decompositions, and several
other character properties for which stability is important to implementations. See also
Conformance 75 3.1 Versions of the Unicode Standard
“Stability of Properties” in Section 3.5, Properties. The formal statement of such stability
guarantees is contained in the policies on character encoding stability found on the Uni-
code website. See the subsection “Policies” in Section B.3, Other Unicode Online Resources.
See the discussion of backward compatibility in Section 2.5 of Unicode Standard Annex
#31, “Unicode Identifier and Pattern Syntax,” and the subsection “Interacting with Down-
level Systems” in Section 5.3, Unknown and Missing Characters.
Version Numbering
Version numbers for the Unicode Standard consist of three fields, denoting the major ver-
sion, the minor version, and the update version, respectively. For example, “Unicode 5.2.0”
indicates major version 5 of the Unicode Standard, minor version 2 of Unicode 5, and
update version 0 of minor version Unicode 5.2.
To simplify implementations of Unicode version numbering, the version fields are limited
to values which can be stored in a single byte. The major version is a positive integer con-
strained to the range 1..255. The minor and update versions are non-negative integers con-
strained to the range 0..255.
Additional information on the current and past versions of the Unicode Standard can be
found on the Unicode website. See the subsection “Versions” in Section B.3, Other Unicode
Online Resources. The online document contains the precise list of contributing files from
the Unicode Character Database and the Unicode Standard Annexes, which are formally
part of each version of the Unicode Standard.
Major and Minor Versions. Major and minor versions have significant additions to the
standard, including, but not limited to, additions to the repertoire of encoded characters.
Both are published as an updated core specification, together with associated updates to
the code charts, the Unicode Standard Annexes and the Unicode Character Database. Such
versions consolidate all errata and corrigenda and supersede any prior documentation for
major, minor, or update versions.
A major version typically is of more importance to implementations; however, even update
versions may be important to particular companies or other organizations. Major and
minor versions are often synchronization points with related standards, such as with ISO/
IEC 10646.
Prior to Version 5.2, minor versions of the standard were published as online amendments
expressed as textual changes to the previous version, rather than as fully consolidated new
editions of the core specification.
Update Version. An update version represents relatively small changes to the standard, typ-
ically updates to the data files of the Unicode Character Database. An update version never
involves any additions to the character repertoire. These versions are published as modifi-
cations to the data files, and, on occasion, include documentation of small updates for
selected errata or corrigenda.
Conformance 76 3.1 Versions of the Unicode Standard
Formally, each new version of the Unicode Standard supersedes all earlier versions. How-
ever, update versions generally do not obsolete the documentation of the immediately
prior version of the standard.
Scheduling of Versions. Prior to Version 7.0.0, major, minor, and update versions of the
Unicode Standard were published whenever the work on each new set of repertoire, prop-
erties, and documentation was finished. The emphasis was on ensuring synchronization of
the major releases with corresponding major publication milestones for ISO/IEC 10646,
but that practice resulted in an irregular publication schedule.
The Unicode Technical Committee changed its process as of Version 7.0.0 of the Unicode
Standard, to make the publication time predictable. Major releases of the standard are now
scheduled for annual publication. Further minor and update releases are not anticipated,
but might occur under exceptional circumstances. This predictable, regular publication
makes planning for new releases easier for most users of the standard. The detailed state-
ments of synchronization between versions of the Unicode Standard and ISO/IEC 10646
have become somewhat more complex as a result, but in practice this has not been a prob-
lem for implementers.
of contributory files, Unicode Standard Annexes, and Unicode Character Database files
can be found at Enumerated Version 3.1.1.
The reference for this version, Version 12.0.0, of the Unicode Standard, is
The Unicode Consortium. The Unicode Standard, Version 12.0.0,
defined by: The Unicode Standard, Version 12.0 (Mountain View, CA:
The Unicode Consortium, 2019. ISBN 978-1-936213-22-1)
References to an update (or minor version prior to Version 5.2.0) include a reference to
both the major version and the documents modifying it. For the standard citation format
for other versions of the Unicode Standard, see “Versions” in Section B.3, Other Unicode
Online Resources.
When referencing a Unicode character property, it is customary to prepend the word “Uni-
code” to the name of the property, unless it is clear from context that the Unicode Standard
is the source of the specification.
Interpretation
Interpretation of characters is the key conformance requirement for the Unicode Standard,
as it is for any coded character set standard. In legacy character set standards, the single
conformance requirement is generally stated in terms of the interpretation of bit patterns
used as characters. Conforming to a particular standard requires interpreting bit patterns
used as characters according to the list of character names and the glyphs shown in the
associated code table that form the bulk of that standard.
Interpretation of characters is a more complex issue for the Unicode Standard. It includes
the core issue of interpreting code points used as characters according to the names and
representative glyphs shown in the code charts, of course. However, the Unicode Standard
also specifies character properties, behavior, and interactions between characters. Such
information about characters is considered an integral part of the “character semantics
established by this standard.”
Information about the properties, behavior, and interactions between Unicode characters
is provided in the Unicode Character Database and in the Unicode Standard Annexes.
Additional information can be found throughout the other chapters of this core specifica-
tion for the Unicode Standard. However, because of the need to keep extended discussions
of scripts, sets of symbols, and other characters readable, material in other chapters is not
always labeled as to its normative or informative status. In general, supplementary seman-
tic information about a character is considered normative when it contributes directly to
the identification of the character or its behavior. Additional information provided about
the history of scripts, the languages which use particular characters, and so forth, is merely
informative. Thus, for example, the rules about Devanagari rendering specified in
Section 12.1, Devanagari, or the rules about Arabic character shaping specified in
Section 9.2, Arabic, are normative: they spell out important details about how those charac-
ters behave in conjunction with each other that is necessary for proper and complete inter-
pretation of the respective Unicode characters covered in each section.
C4 A process shall interpret a coded character sequence according to the character seman-
tics established by this standard, if that process does interpret that coded character
sequence.
• This restriction does not preclude internal transformations that are never visi-
ble external to the process.
C5 A process shall not assume that it is required to interpret any particular coded charac-
ter sequence.
• Processes that interpret only a subset of Unicode characters are allowed; there
is no blanket requirement to interpret all Unicode characters.
• Any means for specifying a subset of characters that a process can interpret is
outside the scope of this standard.
• The semantics of a private-use code point is outside the scope of this standard.
Conformance 81 3.2 Conformance Requirements
Modification
C7 When a process purports not to modify the interpretation of a valid coded character
sequence, it shall make no change to that coded character sequence other than the pos-
sible replacement of character sequences by their canonical-equivalent sequences.
• Replacement of a character sequence by a compatibility-equivalent sequence
does modify the interpretation of the text.
• Replacement or deletion of a character sequence that the process cannot or
does not interpret does modify the interpretation of the text.
• Changing the bit or byte ordering of a character sequence when transforming it
between different machine architectures does not modify the interpretation of
the text.
• Changing a valid coded character sequence from one Unicode character
encoding form to another does not modify the interpretation of the text.
Conformance 82 3.2 Conformance Requirements
• Changing the byte serialization of a code unit sequence from one Unicode
character encoding scheme to another does not modify the interpretation of
the text.
• If a noncharacter that does not have a specific internal use is unexpectedly
encountered in processing, an implementation may signal an error or replace
the noncharacter with U+FFFD replacement character. If the implementa-
tion chooses to replace, delete or ignore a noncharacter, such an action consti-
tutes a modification in the interpretation of the text. In general, a noncharacter
should be treated as an unassigned code point. For example, an API that
returned a character property value for a noncharacter would return the same
value as the default value for an unassigned code point.
• Note that security problems can result if noncharacter code points are removed
from text received from external sources. For more information, see
Section 23.7, Noncharacters, and Unicode Technical Report #36, “Unicode
Security Considerations.”
• All processes and higher-level protocols are required to abide by conformance
clause C7 at a minimum. However, higher-level protocols may define addi-
tional equivalences that do not constitute modifications under that protocol.
For example, a higher-level protocol may allow a sequence of spaces to be
replaced by a single space.
• There are important security issues associated with the correct interpretation
and display of text. For more information, see Unicode Technical Report #36,
“Unicode Security Considerations.”
C10 When a process interprets a code unit sequence which purports to be in a Unicode
character encoding form, it shall treat ill-formed code unit sequences as an error con-
dition and shall not interpret such sequences as characters.
• For example, in UTF-8 every code unit of the form 110xxxx2 must be followed
by a code unit of the form 10xxxxxx2. A sequence such as 110xxxxx2 0xxxxxxx2 is
ill-formed and must never be generated. When faced with this ill-formed code
unit sequence while transforming or interpreting text, a conformant process
must treat the first code unit 110xxxxx2 as an illegally terminated code unit
sequence—for example, by signaling an error, filtering the code unit out, or
representing the code unit with a marker such as U+FFFD replacement
character.
• Conformant processes cannot interpret ill-formed code unit sequences. How-
ever, the conformance clauses do not prevent processes from operating on
code unit sequences that do not purport to be in a Unicode character encoding
form. For example, for performance reasons a low-level string operation may
simply operate directly on code units, without interpreting them as characters.
See, especially, the discussion under D89.
• Utility programs are not prevented from operating on “mangled” text. For
example, a UTF-8 file could have had CRLF sequences introduced at every 80
bytes by a bad mailer program. This could result in some UTF-8 byte
sequences being interrupted by CRLFs, producing illegal byte sequences. This
mangled text is no longer UTF-8. It is permissible for a conformant program to
repair such text, recognizing that the mangled text was originally well-formed
UTF-8 byte sequences. However, such repair of mangled data is a special case,
and it must not be used in circumstances where it would cause security prob-
lems. There are important security issues associated with encoding conversion,
especially with the conversion of malformed text. For more information, see
Unicode Technical Report #36, “Unicode Security Considerations.”
Bidirectional Text
C12 A process that displays text containing supported right-to-left characters or embedding
codes shall display all visible representations of characters (excluding format charac-
ters) in the same order as if the Bidirectional Algorithm had been applied to the text,
unless tailored by a higher-level protocol as permitted by the specification.
• The Bidirectional Algorithm is specified in Unicode Standard Annex #9, “Uni-
code Bidirectional Algorithm.”
Normalization Forms
C13 A process that produces Unicode text that purports to be in a Normalization Form
shall do so in accordance with the specifications in Section 3.11, Normalization Forms.
C14 A process that tests Unicode text to determine whether it is in a Normalization Form
shall do so in accordance with the specifications in Section 3.11, Normalization Forms.
C15 A process that purports to transform text into a Normalization Form must be able to
produce the results of the conformance test specified in Unicode Standard Annex #15,
“Unicode Normalization Forms.”
• This means that when a process uses the input specified in the conformance
test, its output must match the expected output of the test.
Normative References
C16 Normative references to the Unicode Standard itself, to property aliases, to property
value aliases, or to Unicode algorithms shall follow the formats specified in Section 3.1,
Versions of the Unicode Standard.
C17 Higher-level protocols shall not make normative references to provisional properties.
• Higher-level protocols may make normative references to informative proper-
ties.
Unicode Algorithms
C18 If a process purports to implement a Unicode algorithm, it shall conform to the specifi-
cation of that algorithm in the standard, including any tailoring by a higher-level pro-
tocol as permitted by the specification.
• The term Unicode algorithm is defined at D17.
• An implementation claiming conformance to a Unicode algorithm need only
guarantee that it produces the same results as those specified in the logical
description of the process; it is not required to follow the actual described pro-
cedure in detail. This allows room for alternative strategies and optimizations
in implementation.
Conformance 85 3.2 Conformance Requirements
C19 The specification of an algorithm may prohibit or limit tailoring by a higher-level pro-
tocol. If a process that purports to implement a Unicode algorithm applies a tailoring,
that fact must be disclosed.
• For example, the algorithms for normalization and canonical ordering are not
tailorable. The Bidirectional Algorithm allows some tailoring by higher-level
protocols. The Unicode Default Case algorithms may be tailored without lim-
itation.
3.3 Semantics
Definitions
This and the following sections more precisely define the terms that are used in the confor-
mance clauses.
charts. See Section 24.1, Character Names List, for the notational conventions
used to distinguish the two.
D6 Namespace: A set of names together with name matching rules, so that all names are
distinct under the matching rules.
• Within a given namespace all names must be unique, although the same name
may be used with a different meaning in a different namespace.
• Character names, character name aliases, and named character sequences
share a single namespace in the Unicode Standard.
Conformance 90 3.4 Characters and Encoding
• The range for each defined block is specified by Field 0 in Blocks.txt; for exam-
ple, “0000..007F”.
• The ranges for blocks are non-overlapping. In other words, no code point can
be contained in the range for one block and also in the range for a second dis-
tinct block.
• The range for each block is defined as a contiguous sequence. In other words, a
block cannot consist of two (or more) discontiguous sequences of code points.
• Each range for a defined block starts with a value for which code point MOD
16 = 0 and terminates with a larger value for which code point MOD 16 = 15.
This specification results in block ranges which always include full code point
columns for code chart display. A block never starts or terminates in mid-col-
umn.
• All assigned characters are contained within ranges for defined blocks.
• Blocks may contain reserved code points, but no block contains only reserved
code points. The majority of reserved code points are outside the ranges of
defined blocks.
• A few designated code points are not contained within the ranges for defined
blocks. This applies to the noncharacter code points at the last two code points
of supplementary planes 1 through 14.
• The name for each defined block is specified by Field 1 in Blocks.txt; for exam-
ple, “Basic Latin”.
• The names for defined blocks constitute a unique namespace.
• The uniqueness rule for the block namespace is LM3, as defined in Unicode
Standard Annex #44, “Unicode Character Database.” In other words, casing,
whitespace, hyphens, and underscores are ignored when matching strings for
block names. The string “BASIC LATIN” or “Basic_Latin” would be consid-
ered as matching the name for the block named “Basic Latin”.
• There is also a normative Block property. See Table 3-2. The Block property is a
catalog property whose value is a string that identifies a block.
• Property value aliases for the Block property are defined in PropertyVal-
ueAliases.txt in the Unicode Character Database. The long alias defined for the
Block property is always a loose match for the name of the block defined in
Blocks.txt. Additional short aliases and other aliases are provided for conve-
nience of use in regular expression syntax.
• The default value for the Block property is “No_Block”. This default applies to
any code point which is not contained in the range of a defined block.
For a general discussion of blocks and their relation to allocation in the Unicode Standard,
see “Allocation Areas and Blocks” in Section 2.8, Unicode Allocation. For a general discus-
Conformance 92 3.4 Characters and Encoding
sion of the use of blocks in the presentation of the Unicode code charts, see Chapter 24,
About the Code Charts.
D11 Encoded character: An association (or mapping) between an abstract character and
a code point.
• An encoded character is also referred to as a coded character.
• While an encoded character is formally defined in terms of the mapping
between an abstract character and a code point, informally it can be thought of
as an abstract character taken together with its assigned code point.
• Occasionally, for compatibility with other standards, a single abstract character
may correspond to more than one code point—for example, “Å” corresponds
both to U+00C5 Å latin capital letter a with ring above and to U+212B
Å angstrom sign.
• A single abstract character may also be represented by a sequence of code
points—for example, latin capital letter g with acute may be represented by the
sequence <U+0047 latin capital letter g, U+0301 combining acute
accent>, rather than being mapped to a single code point.
D12 Coded character sequence: An ordered sequence of one or more code points.
• A coded character sequence is also known as a coded character representation.
• Normally a coded character sequence consists of a sequence of encoded char-
acters, but it may also include noncharacters or reserved code points.
• Internally, a process may choose to make use of noncharacter code points in its
coded character sequences. However, such noncharacter code points may not
be interpreted as abstract characters (see conformance clause C2). Their
removal by a conformant process constitutes modification of interpretation of
the coded character sequence (see conformance clause C7).
• Reserved code points are included in coded character sequences, so that the
conformance requirements regarding interpretation and modification are
properly defined when a Unicode-conformant implementation encounters
coded character sequences produced under a future version of the standard.
Unless specified otherwise for clarity, in the text of the Unicode Standard the term charac-
ter alone designates an encoded character. Similarly, the term character sequence alone
designates a coded character sequence.
D13 Deprecated character: A coded character whose use is strongly discouraged.
• Deprecated characters are retained in the standard indefinitely, but should not
be used. They are retained in the standard so that previously conforming data
stay conformant in future versions of the standard.
• Deprecated characters typically consist of characters with significant architec-
tural problems, or ones which cause implementation problems. Some examples
Conformance 93 3.4 Characters and Encoding
D18 Named Unicode algorithm: A Unicode algorithm that is specified in the Unicode
Standard or in other standards published by the Unicode Consortium and that is
given an explicit name for ease of reference.
• Named Unicode algorithms are cited in titlecase in the Unicode Standard.
Table 3-1 lists the named Unicode algorithms and indicates the locations of their specifica-
tions. Details regarding conformance to these algorithms and any restrictions they place on
the scope of allowable tailoring by higher-level protocols can be found in the specifications.
In some cases, a named Unicode algorithm is provided for information only. When exter-
nally referenced, a named Unicode algorithm may be prefixed with the qualifier “Unicode”
to make the connection of the algorithm to the Unicode Standard and other Unicode spec-
ifications clear. Thus, for example, the Bidirectional Algorithm is generally referred to by
its full name, “Unicode Bidirectional Algorithm.” As much as is practical, the titles of Uni-
code Standard Annexes which define Unicode algorithms consist of the name of the Uni-
code algorithm they specify. In a few cases, named Unicode algorithms are also widely
known by their acronyms, and those acronyms are also listed in Table 3-1.
3.5 Properties
The Unicode Standard specifies many different types of character properties. This section
provides the basic definitions related to character properties.
The actual values of Unicode character properties are specified in the Unicode Character
Database. See Section 4.1, Unicode Character Database, for an overview of those data files.
Chapter 4, Character Properties, contains more detailed descriptions of some particular,
important character properties. Additional properties that are specific to particular charac-
ters (such as the definition and use of the right-to-left override character or zero width
space) are discussed in the relevant sections of this standard.
The interpretation of some properties (such as the case of a character) is independent of
context, whereas the interpretation of other properties (such as directionality) is applicable
to a character sequence as a whole, rather than to the individual characters that compose
the sequence.
Types of Properties
D19 Property: A named attribute of an entity in the Unicode Standard, associated with a
defined set of values.
• The lists of code point and encoded character properties for the Unicode Stan-
dard are documented in Unicode Standard Annex #44, “Unicode Character
Database,” and in Unicode Standard Annex #38, “Unicode Han Database (Uni-
han).”
• The file PropertyAliases.txt in the Unicode Character Database provides a
machine-readable list of the non-Unihan properties and their names.
D20 Code point property: A property of code points.
• Code point properties refer to attributes of code points per se, based on archi-
tectural considerations of this standard, irrespective of any particular encoded
character.
• Thus the Surrogate property and the Noncharacter property are code point
properties.
D21 Abstract character property: A property of abstract characters.
• Abstract character properties refer to attributes of abstract characters per se,
based on their independent existence as elements of writing systems or other
notational systems, irrespective of their encoding in the Unicode Standard.
• Thus the Alphabetic property, the Punctuation property, the Hex_Digit prop-
erty, the Numeric_Value property, and so on are properties of abstract charac-
ters and are associated with those characters whether encoded in the Unicode
Standard or in any other character encoding—or even prior to their being
encoded in any character encoding standard.
Conformance 96 3.5 Properties
D22 Encoded character property: A property of encoded characters in the Unicode Stan-
dard.
• For each encoded character property there is a mapping from every code point
to some value in the set of values associated with that property.
Encoded character properties are defined this way to facilitate the implementation of char-
acter property APIs based on the Unicode Character Database. Typically, an API will take
a property and a code point as input, and will return a value for that property as output,
interpreting it as the “character property” for the “character” encoded at that code point.
However, to be useful, such APIs must return meaningful values for unassigned code
points, as well as for encoded characters.
In some instances an encoded character property in the Unicode Standard is exactly equiv-
alent to a code point property. For example, the Pattern_Syntax property simply defines a
range of code points that are reserved for pattern syntax. (See Unicode Standard Annex
#31, “Unicode Identifier and Pattern Syntax.”)
In other instances, an encoded character property directly reflects an abstract character
property, but extends the domain of the property to include all code points, including
unassigned code points. For Boolean properties, such as the Hex_Digit property, typically
an encoded character property will be true for the encoded characters with that abstract
character property and will be false for all other code points, including unassigned code
points, noncharacters, private-use characters, and encoded characters for which the
abstract character property is inapplicable or irrelevant.
However, in many instances, an encoded character property is semantically complex and
may telescope together values associated with a number of abstract character properties
and/or code point properties. The General_Category property is an example—it contains
values associated with several abstract character properties (such as Letter, Punctuation,
and Symbol) as well as code point properties (such as \p{gc=Cs} for the Surrogate code
point property).
In the text of this standard the terms “Unicode character property,” “character property,”
and “property” without qualifier generally refer to an encoded character property, unless
otherwise indicated.
A list of the encoded character properties formally considered to be a part of the Unicode
Standard can be found in PropertyAliases.txt in the Unicode Character Database. See also
“Property Aliases” later in this section.
Property Values
D23 Property value: One of the set of values associated with an encoded character prop-
erty.
• For example, the East_Asian_Width [EAW] property has the possible values
“Narrow”, “Neutral”, “Wide”, “Ambiguous”, and “Unassigned”.
Conformance 97 3.5 Properties
A list of the values associated with encoded character properties in the Unicode Standard
can be found in PropertyValueAliases.txt in the Unicode Character Database. See also
“Property Aliases” later in this section.
D24 Explicit property value: A value for an encoded character property that is explicitly
associated with a code point in one of the data files of the Unicode Character Data-
base.
D25 Implicit property value: A value for an encoded character property that is given by a
generic rule or by an “otherwise” clause in one of the data files of the Unicode Char-
acter Database.
• Implicit property values are used to avoid having to explicitly list values for
more than 1 million code points (most of them unassigned) for every property.
• Examples are the Age, Block, and Script properties. Additional new values for
the set of enumerated values for these properties may be added each time the
standard is revised. A new value for Age is added for each new Unicode version,
a new value for Block is added for each new block added to the standard, and a
new value for Script is added for each new script added to the standard.
Most properties have a single value associated with each code point. However, some prop-
erties may instead associate a set of multiple different values with each code point. See Sec-
tion 5.7.6, Properties Whose Values Are Sets of Values, in Unicode Standard Annex #44,
“Unicode Character Database.”
Property Status
Each Unicode character property has one of several different statuses: normative, informa-
tive, contributory, or provisional. Each of these statuses is formally defined below, with
some explanation and examples. In addition, normative properties can be subclassified,
based on whether or not they can be overridden by conformant higher-level protocols.
The full list of currently defined Unicode character properties is provided in Unicode Stan-
dard Annex #44, “Unicode Character Database” and in Unicode Standard Annex #38,
“Unicode Han Database (Unihan).” The tables of properties in those documents specify
the status of each property explicitly. The data file PropertyAliases.txt provides a machine-
readable listing of the character properties, except for those associated with the Unicode
Han Database. The long alias for each property in PropertyAliases.txt also serves as the for-
mal name of that property. In case of any discrepancy between the listing in Proper-
tyAliases.txt and the listing in Unicode Standard Annex #44 or any other text of the
Unicode Standard, the listing in PropertyAliases.txt should be taken as definitive. The tag
for each Unihan-related character property documented in Unicode Standard Annex #38
serves as the formal name of that property.
D33 Normative property: A Unicode character property used in the specification of the
standard.
Specification that a character property is normative means that implementations which
claim conformance to a particular version of the Unicode Standard and which make use of
that particular property must follow the specifications of the standard for that property for
the implementation to be conformant. For example, the Bidi_Class property is required for
conformance whenever rendering text that requires bidirectional layout, such as Arabic or
Hebrew.
Whenever a normative process depends on a property in a specified way, that property is
designated as normative.
The fact that a given Unicode character property is normative does not mean that the val-
ues of the property will never change for particular characters. Corrections and extensions
to the standard in the future may require minor changes to normative values, even though
the Unicode Technical Committee strives to minimize such changes. See also “Stability of
Properties” later in this section.
Conformance 100 3.5 Properties
Some of the normative Unicode algorithms depend critically on particular property values
for their behavior. Normalization, for example, defines an aspect of textual interoperability
that many applications rely on to be absolutely stable. As a result, some of the normative
properties disallow any kind of overriding by higher-level protocols. Thus the decomposi-
tion of Unicode characters is both normative and not overridable; no higher-level protocol
may override these values, because to do so would result in non-interoperable results for
the normalization of Unicode text. Other normative properties, such as case mapping, are
overridable by higher-level protocols, because their intent is to provide a common basis for
behavior. Nevertheless, they may require tailoring for particular local cultural conventions
or particular implementations.
D34 Overridable property: A normative property whose values may be overridden by
conformant higher-level protocols.
• For example, the Canonical_Decomposition property is not overridable. The
Uppercase property can be overridden.
Some important normative character properties of the Unicode Standard are listed in
Table 3-2, with an indication of which sections in the standard provide a general descrip-
tion of the properties and their use. Other normative properties are documented in the
Unicode Character Database. In all cases, the Unicode Character Database provides the
definitive list of character properties and the exact list of property value assignments for
each version of the standard.
D35 Informative property: A Unicode character property whose values are provided for
information only.
A conformant implementation of the Unicode Standard is free to use or change informa-
tive property values as it may require, while remaining conformant to the standard. An
implementer always has the option of establishing a protocol to convey the fact that infor-
mative properties are being used in distinct ways.
Informative properties capture expert implementation experience. When an informative
property is explicitly specified in the Unicode Character Database, its use is strongly rec-
ommended for implementations to encourage comparable behavior between implementa-
tions. Note that it is possible for an informative property in one version of the Unicode
Standard to become a normative property in a subsequent version of the standard if its use
starts to acquire conformance implications in some part of the standard.
Table 3-3 provides a partial list of the more important informative character properties.
For a complete listing, see the Unicode Character Database.
D35a Contributory property: A simple property defined merely to make the statement of a
rule defining a derived property more compact or general.
Contributory properties typically consist of short lists of exceptional characters which are
used as part of the definition of a more generic normative or informative property. In most
cases, such properties are given names starting with “Other”, as Other_Alphabetic or Oth-
er_Default_Ignorable_Code_Point.
Contributory properties are not themselves subject to stability guarantees, but they are
sometimes specified in order to make it easier to state the definition of a derived property
which itself is subject to a stability guarantee, such as the derived, normative identifier-
related properties, XID_Start and XID_Continue. The complete list of contributory prop-
erties is documented in Unicode Standard Annex #44, “Unicode Character Database.”
D36 Provisional property: A Unicode character property whose values are unapproved
and tentative, and which may be incomplete or otherwise not in a usable state.
• Provisional properties may be removed from future versions of the standard,
without prior notice.
Conformance 102 3.5 Properties
Some of the information provided about characters in the Unicode Character Database
constitutes provisional data. This data may capture partial or preliminary information. It
may contain errors or omissions, or otherwise not be ready for systematic use; however, it
is included in the data files for distribution partly to encourage review and improvement of
the information. For example, a number of the tags in the Unihan database file (Uni-
han.zip) provide provisional property values of various sorts about Han characters.
The data files of the Unicode Character Database may also contain various annotations
and comments about characters, and those annotations and comments should be consid-
ered provisional. Implementations should not attempt to parse annotations and comments
out of the data files and treat them as informative character properties per se.
Section 4.12, Characters with Unusual Properties, provides additional lists of Unicode char-
acters with unusual behavior, including many format controls discussed in detail elsewhere
in the standard. Although in many instances those characters and their behavior have nor-
mative implications, the particular subclassification provided in Table 4-10 does not
directly correspond to any formal definition of Unicode character properties. Therefore
that subclassification itself should also be considered provisional and potentially subject to
change.
Context Dependence
D37 Context-dependent property: A property that applies to a code point in the context of
a longer code point sequence.
• For example, the lowercase mapping of a Greek sigma depends on the context
of the surrounding characters.
D38 Context-independent property: A property that is not context dependent; it applies to
a code point in isolation.
Stability of Properties
D39 Stable transformation: A transformation T on a property P is stable with respect to
an algorithm A if the result of the algorithm on the transformed property A(T(P)) is
the same as the original result A(P) for all code points.
D40 Stable property: A property is stable with respect to a particular algorithm or process
as long as possible changes in the assignment of property values are restricted in
such a manner that the result of the algorithm on the property continues to be the
same as the original result for all previously assigned code points.
• As new characters are assigned to previously unassigned code points, the
replacement of any default values for these code points with actual property
values must maintain stability.
D41 Fixed property: A property whose values (other than a default value), once associ-
ated with a specific code point, are fixed and will not be changed, except to correct
obvious or clerical errors.
Conformance 103 3.5 Properties
• For a fixed property, any default values can be replaced without restriction by
actual property values as new characters are assigned to previously unassigned
code points. Examples of fixed properties include Age and Hangul_Syllable_-
Type.
• Designating a property as fixed does not imply stability or immutability (see
“Stability” in Section 3.1, Versions of the Unicode Standard). While the age of a
character, for example, is established by the version of the Unicode Standard to
which it was added, errors in the published listing of the property value could
be corrected. For some other properties, even the correction of such errors is
prohibited by explicit guarantees of property stability.
D42 Immutable property: A fixed property that is also subject to a stability guarantee pre-
venting any change in the published listing of property values other than assign-
ment of new values to formerly unassigned code points.
• An immutable property is trivially stable with respect to all algorithms.
• An example of an immutable property is the Unicode character name itself.
Because character names are values of an immutable property, misspellings
and incorrect names will never be corrected clerically. Any errata will be noted
in a comment in the character names list and, where needed, an informative
character name alias will be provided.
• When an encoded character property representing a code point property is
immutable, none of its values can ever change. This follows from the fact that
the code points themselves do not change, and the status of the property is
unaffected by whether a particular abstract character is encoded at a code point
later. An example of such a property is the Pattern_Syntax property; all values
of that property are unchangeable for all code points, forever.
• In the more typical case of an immutable property, the values for existing
encoded characters cannot change, but when a new character is encoded, the
formerly unassigned code point changes from having a default value for the
property to having one of its nondefault values. Once that nondefault value is
published, it can no longer be changed.
D43 Stabilized property: A property that is neither extended to new characters nor main-
tained in any other manner, but that is retained in the Unicode Character Database.
• A stabilized property is also a fixed property.
D44 Deprecated property: A property whose use by implementations is discouraged.
• One of the reasons a property may be deprecated is because a different combi-
nation of properties better expresses the intended semantics.
• Where sufficiently widespread legacy support exists for the deprecated prop-
erty, not all implementations may be able to discontinue the use of the depre-
Conformance 104 3.5 Properties
Property Aliases
To enable normative references to Unicode character properties, formal aliases for proper-
ties and for property values are defined as part of the Unicode Character Database.
D47 Property alias: A unique identifier for a particular Unicode character property.
• The identifiers used for property aliases contain only ASCII alphanumeric
characters or the underscore character.
• Short and long forms for each property alias are defined. The short forms are
typically just two or three characters long to facilitate their use as attributes for
tags in markup languages. For example, “General_Category” is the long form
and “gc” is the short form of the property alias for the General Category prop-
erty. The long form serves as the formal name for the character property.
• Property aliases are defined in the file PropertyAliases.txt lists all of the non-
Unihan properties that are part of each version of the standard. The Unihan
properties are listed in Unicode Standard Annex #38, “Unicode Han Database
(Unihan).”
• Property aliases of normative properties are themselves normative.
Conformance 105 3.5 Properties
D48 Property value alias: A unique identifier for a particular enumerated value for a par-
ticular Unicode character property.
• The identifiers used for property value aliases contain only ASCII alphanu-
meric characters or the underscore character, or have the special value “n/a”.
• Short and long forms for property value aliases are defined. For example, “Cur-
rency_Symbol” is the long form and “Sc” is the short form of the property value
alias for the currency symbol value of the General Category property.
• Property value aliases are defined in the file PropertyValueAliases.txt in the
Unicode Character Database.
• Property value aliases are unique identifiers only in the context of the particular
property with which they are associated. The same identifier string might be
associated with an entirely different value for a different property. The combi-
nation of a property alias and a property value alias is, however, guaranteed to
be unique.
• Property value aliases referring to values of normative properties are them-
selves normative.
The property aliases and property value aliases can be used, for example, in XML formats
of property data, for regular-expression property tests, and in other programmatic textual
descriptions of Unicode property data. Thus “gc = Lu” is a formal way of specifying that the
General Category of a character (using the property alias “gc”) has the value of being an
uppercase letter (using the property value alias “Lu”).
Private Use
D49 Private-use code point: Code points in the ranges U+E000..U+F8FF, U+F0000..
U+FFFFD, and U+100000..U+10FFFD.
• Private-use code points are considered to be assigned characters, but the
abstract characters associated with them have no interpretation specified by
this standard. They can be given any interpretation by conformant processes.
• Private-use code points are given default property values, but these default val-
ues are overridable by higher-level protocols that give those private-use code
points a specific interpretation. See Section 23.5, Private-Use Characters.
Conformance 106 3.6 Combination
3.6 Combination
Combining Character Sequences
D50 Graphic character: A character with the General Category of Letter (L), Combining
Mark (M), Number (N), Punctuation (P), Symbol (S), or Space Separator (Zs).
• Graphic characters specifically exclude the line and paragraph separators (Zl,
Zp), as well as the characters with the General Category of Other (Cn, Cs, Cc,
Cf ).
• The interpretation of private-use characters (Co) as graphic characters or not is
determined by the implementation.
• For more information, see Chapter 2, General Structure, especially Section 2.4,
Code Points and Characters, and Table 2-3.
D51 Base character: Any graphic character except for those with the General Category of
Combining Mark (M).
• Most Unicode characters are base characters. In terms of General Category val-
ues, a base character is any code point that has one of the following categories:
Letter (L), Number (N), Punctuation (P), Symbol (S), or Space Separator (Zs).
• Base characters do not include control characters or format controls.
• Base characters are independent graphic characters, but this does not preclude
the presentation of base characters from adopting different contextual forms or
participating in ligatures.
• The interpretation of private-use characters (Co) as base characters or not is
determined by the implementation. However, the default interpretation of pri-
vate-use characters should be as base characters, in the absence of other infor-
mation.
D51a Extended base: Any base character, or any standard Korean syllable block.
• This term is defined to take into account the fact that sequences of Korean con-
joining jamo characters behave as if they were a single Hangul syllable charac-
ter, so that the entire sequence of jamos constitutes a base.
• For the definition of standard Korean syllable block, see D134 in Section 3.12,
Conjoining Jamo Behavior.
D52 Combining character: A character with the General Category of Combining Mark
(M).
• Combining characters consist of all characters with the General Category val-
ues of Spacing Combining Mark (Mc), Nonspacing Mark (Mn), and Enclosing
Mark (Me).
Conformance 107 3.6 Combination
• All characters with non-zero canonical combining class are combining charac-
ters, but the reverse is not the case: there are combining characters with a zero
canonical combining class.
• The interpretation of private-use characters (Co) as combining characters or
not is determined by the implementation.
• These characters are not normally used in isolation unless they are being
described. They include such characters as accents, diacritics, Hebrew points,
Arabic vowel signs, and Indic matras.
• The graphic positioning of a combining character depends on the last preced-
ing base character, unless they are separated by a character that is neither a
combining character nor either zero width joiner or zero width non-
joiner. The combining character is said to apply to that base character.
• There may be no such base character, such as when a combining character is at
the start of text or follows a control or format character—for example, a car-
riage return, tab, or right-left mark. In such cases, the combining characters
are called isolated combining characters.
• With isolated combining characters or when a process is unable to perform
graphical combination, a process may present a combining character without
graphical combination; that is, it may present it as if it were a base character.
• The representative images of combining characters are depicted with a dotted
circle in the code charts. When presented in graphical combination with a pre-
ceding base character, that base character is intended to appear in the position
occupied by the dotted circle.
D53 Nonspacing mark: A combining character with the General Category of Nonspacing
Mark (Mn) or Enclosing Mark (Me).
• The position of a nonspacing mark in presentation depends on its base charac-
ter. It generally does not consume space along the visual baseline in and of
itself.
• Such characters may be large enough to affect the placement of their base char-
acter relative to preceding and succeeding base characters. For example, a cir-
cumflex applied to an “i” may affect spacing (“î”), as might the character
U+20DD combining enclosing circle.
D54 Enclosing mark: A nonspacing mark with the General Category of Enclosing Mark
(Me).
• Enclosing marks are a subclass of nonspacing marks that surround a base char-
acter, rather than merely being placed over, under, or through it.
Conformance 108 3.6 Combination
Grapheme Clusters
D58 Grapheme base: A character with the property Grapheme_Base, or any standard
Korean syllable block.
• Characters with the property Grapheme_Base include all base characters (with
the exception of U+FF9E..U+FF9F) plus most spacing marks.
• The concept of a grapheme base is introduced to simplify discussion of the
graphical application of nonspacing marks to other elements of text. A graph-
eme base may consist of a spacing (combining) mark, which distinguishes it
Conformance 109 3.6 Combination
from a base character per se. A grapheme base may also itself consist of a
sequence of characters, in the case of the standard Korean syllable block.
• For the definition of standard Korean syllable block, see D134 in Section 3.12,
Conjoining Jamo Behavior.
D59 Grapheme extender: A character with the property Grapheme_Extend.
• Grapheme extender characters consist of all nonspacing marks, zero width
joiner, zero width non-joiner, U+FF9E halfwidth katakana voiced
sound mark, U+FF9F halfwidth katakana semi-voiced sound mark, and
a small number of spacing marks.
• A grapheme extender can be conceived of primarily as the kind of nonspacing
graphical mark that is applied above or below another spacing character.
• zero width joiner and zero width non-joiner are formally defined to be
grapheme extenders so that their presence does not break up a sequence of
other grapheme extenders.
• The small number of spacing marks that have the property Grapheme_Extend
are all the second parts of a two-part combining mark.
• The set of characters with the Grapheme_Extend property and the set of char-
acters with the Grapheme_Base property are disjoint, by definition.
• The Grapheme_Extend property is used in the derivation of the set of charac-
ters with the value Grapheme_Cluster_Break = Extend, but is not identical to
it. See Section 3, “Grapheme Cluster Boundaries” in UAX #29 for details.
D60 Grapheme cluster: The text between grapheme cluster boundaries as specified by
Unicode Standard Annex #29, “Unicode Text Segmentation.”
• This definition of “grapheme cluster” is generic. The specification of grapheme
cluster boundary segmentation in UAX #29 includes two alternatives, for
“extended grapheme clusters” and for “legacy grapheme clusters.” Further-
more, the segmentation algorithm in UAX #29 is tailorable.
• The grapheme cluster represents a horizontally segmentable unit of text, con-
sisting of some grapheme base (which may consist of a Korean syllable)
together with any number of nonspacing marks applied to it.
• A grapheme cluster is similar, but not identical to a combining character
sequence. A combining character sequence starts with a base character and
extends across any subsequent sequence of combining marks, nonspacing or
spacing. A combining character sequence is most directly relevant to processing
issues related to normalization, comparison, and searching.
• A grapheme cluster typically starts with a grapheme base and then extends
across any subsequent sequence of nonspacing marks. A grapheme cluster is
most directly relevant to text rendering and processes such as cursor placement
Conformance 110 3.6 Combination
and text selection in editing, but may also be relevant to comparison and
searching.
• For many processes, a grapheme cluster behaves as if it were a single character
with the same properties as its grapheme base. Effectively, nonspacing marks
apply graphically to the base, but do not change its properties. For example, <x,
macron> behaves in line breaking or bidirectional layout as if it were the char-
acter x.
D61 Extended grapheme cluster: The text between extended grapheme cluster boundaries
as specified by Unicode Standard Annex #29, “Unicode Text Segmentation.”
• Extended grapheme clusters are defined in a parallel manner to legacy graph-
eme clusters, but also include sequences of spacing marks.
• Grapheme clusters and extended grapheme clusters may not have any particu-
lar linguistic significance, but are used to break up a string of text into units for
processing.
• Grapheme clusters and extended grapheme clusters may be adjusted for partic-
ular processing requirements, by tailoring the rules for grapheme cluster seg-
mentation specified in Unicode Standard Annex #29, “Unicode Text
Segmentation.”
• Dependence concerns all combining marks, including spacing marks and com-
bining marks that have no visible display.
D61b Graphical application: A nonspacing mark is said to apply to its associated graph-
eme base.
• The associated grapheme base is the grapheme base in the grapheme cluster
that a nonspacing mark is part of.
• A nonspacing mark in a defective combining character sequence is not part of a
grapheme cluster and is subject to the same kinds of fallback processing as for
any defective combining character sequence.
• Graphic application concerns visual rendering issues and thus is an issue for
nonspacing marks that have visible glyphs. Those glyphs interact, in rendering,
with their grapheme base.
Throughout the text of the standard, whenever the situation is clear, discussion of combin-
ing marks often simply talks about combining marks “applying” to their base. In the proto-
typical case of a nonspacing accent mark applying to a single base character letter, this
simplification is not problematical, because the nonspacing mark both depends (notion-
ally) on its base character and simultaneously applies (graphically) to its grapheme base,
affecting its display. The finer distinctions are needed when dealing with the edge cases,
such as combining marks that have no display glyph, graphical application of nonspacing
marks to Korean syllables, and the behavior of spacing combining marks.
The distinction made here between notional dependence and graphical application does
not preclude spacing marks or even sequences of base characters from having effects on
neighboring characters in rendering. Thus spacing forms of dependent vowels (matras) in
Indic scripts may trigger particular kinds of conjunct formation or may be repositioned in
ways that influence the rendering of other characters. (See Chapter 12, South and Central
Asia-I, for many examples.) Similarly, sequences of base characters may form ligatures in
rendering. (See “Cursive Connection and Ligatures” in Section 23.2, Layout Controls.)
The following listing specifies the principles regarding application of combining marks.
Many of these principles are illustrated in Section 2.11, Combining Characters, and
Section 7.9, Combining Marks.
P1 [Normative] Combining character order: Combining characters follow the base
character on which they depend.
• This principle follows from the definition of a combining character sequence.
• Thus the character sequence <U+0061 “a” latin small letter a, U+0308 “!”
combining diaeresis, U+0075 “u” latin small letter u> is unambiguously
interpreted (and displayed) as “äu”, not “aü”. See Figure 2-18.
P2 [Guideline] Inside-out application. Nonspacing marks with the same combining
class are generally positioned graphically outward from the grapheme base to
which they apply.
Conformance 112 3.6 Combination
• The most numerous and important instances of this principle involve nonspac-
ing marks applied either directly above or below a grapheme base. See
Figure 2-21.
• In a sequence of two nonspacing marks above a grapheme base, the first nons-
pacing mark is placed directly above the grapheme base, and the second is then
placed above the first nonspacing mark.
• In a sequence of two nonspacing marks below a grapheme base, the first nons-
pacing mark is placed directly below the grapheme base, and the second is then
placed below the first nonspacing mark.
• This rendering behavior for nonspacing marks can be generalized to sequences
of any length, although practical considerations usually limit such sequences to
no more than two or three marks above and/or below a grapheme base.
• The principle of inside-out application is also referred to as default stacking
behavior for nonspacing marks.
P3 [Guideline] Side-by-side application. Notwithstanding the principle of inside-out
application, some specific nonspacing marks may override the default stacking
behavior and are positioned side-by-side over (or under) a grapheme base, rather
than stacking vertically.
• Such side-by-side positioning may reflect language-specific orthographic rules,
such as for Vietnamese diacritics and tone marks or for polytonic Greek
breathing and accent marks. See Table 2-6.
• Side-by-side positioning may also reflect certain writing conventions, such as
for titlo letters in the Old Church Slavonic manuscript tradition.
• When positioned side-by-side, the visual rendering order of a sequence of non-
spacing marks reflects the dominant order of the script with which they are
used. Thus, in Greek, the first nonspacing mark in such a sequence will be posi-
tioned to the left side above a grapheme base, and the second to the right side
above the grapheme base. In Hebrew, the opposite positioning is used for side-
by-side placement.
• The combining parentheses diacritical marks U+1ABB..U+1ABD are also posi-
tioned in a side-by-side manner, surrounding other diacritics, as described in
the subsection “Combining Diacritical Marks Extended: U+1AB0–U+1AFF” in
Section 7.9, Combining Marks.
P4 [Guideline] Traditional typographical behavior will sometimes override the
default placement or rendering of nonspacing marks.
• Because of typographical conflict with the descender of a base character, a
combining comma below placed on a lowercase “g” is traditionally rendered as
if it were an inverted comma above. See Figure 7-1.
Conformance 113 3.6 Combination
a + $ + $¨ + $ → ä
09A4 20DE 0308 20DD
This treatment of the application of combining marks with respect to Korean syllables fol-
lows from the implications of canonical equivalence. It should be noted, however, that
older implementations may have supported the application of an enclosing combining
mark to an entire Indic consonant conjunct or to a sequence of grapheme clusters linked
together by combining grapheme joiners. Such an approach has a number of technical
problems and leads to interoperability defects, so it is strongly recommended that imple-
mentations do not follow it.
For more information on the recommended use of the combining grapheme joiner, see the
subsection “Combining Grapheme Joiner” in Section 23.2, Layout Controls. For more dis-
cussion regarding the application of combining marks in general, see Section 7.9, Combin-
ing Marks.
Conformance 116 3.7 Decomposition
3.7 Decomposition
D62 Decomposition mapping: A mapping from a character to a sequence of one or more
characters that is a canonical or compatibility equivalent, and that is listed in the
character names list or described in Section 3.12, Conjoining Jamo Behavior.
• Each character has at most one decomposition mapping. The mappings in
Section 3.12, Conjoining Jamo Behavior, are canonical mappings. The mappings
in the character names list are identified as either canonical or compatibility
mappings (see Section 24.1, Character Names List).
D63 Decomposable character: A character that is equivalent to a sequence of one or more
other characters, according to the decomposition mappings found in the Unicode
Character Database, and those described in Section 3.12, Conjoining Jamo Behavior.
• A decomposable character is also referred to as a precomposed character or
composite character.
• The decomposition mappings from the Unicode Character Database are also
given in Section 24.1, Character Names List.
D64 Decomposition: A sequence of one or more characters that is equivalent to a decom-
posable character. A full decomposition of a character sequence results from decom-
posing each of the characters in the sequence until no characters can be further
decomposed.
Compatibility Decomposition
D65 Compatibility decomposition: The decomposition of a character or character
sequence that results from recursively applying both the compatibility mappings
and the canonical mappings found in the Unicode Character Database, and those
described in Section 3.12, Conjoining Jamo Behavior, until no characters can be fur-
ther decomposed, and then reordering nonspacing marks according to Section 3.11,
Normalization Forms.
• The decomposition mappings from the Unicode Character Database are also
given in Section 24.1, Character Names List.
• Some compatibility decompositions remove formatting information.
D66 Compatibility decomposable character: A character whose compatibility decomposi-
tion is not identical to its canonical decomposition. It may also be known as a com-
patibility precomposed character or a compatibility composite character.
• For example, U+00B5 micro sign has no canonical decomposition mapping,
so its canonical decomposition is the same as the character itself. It has a com-
patibility decomposition to U+03BC greek small letter mu. Because micro
sign has a compatibility decomposition that is not equal to its canonical
decomposition, it is a compatibility decomposable character.
Conformance 117 3.7 Decomposition
• For example, U+03D3 greek upsilon with acute and hook symbol canon-
ically decomposes to the sequence <U+03D2 greek upsilon with hook sym-
bol, U+0301 combining acute accent>. That sequence has a compatibility
decomposition of <U+03A5 greek capital letter upsilon, U+0301 com-
bining acute accent>. Because greek upsilon with acute and hook sym-
bol has a compatibility decomposition that is not equal to its canonical
decomposition, it is a compatibility decomposable character.
• This term should not be confused with the term “compatibility character,”
which is discussed in Section 2.3, Compatibility Characters.
• Many compatibility decomposable characters are included in the Unicode
Standard solely to represent distinctions in other base standards. They support
transmission and processing of legacy data. Their use is discouraged other than
for legacy data or other special circumstances.
• Some widely used and indispensable characters, such as NBSP, are compatibil-
ity decomposable characters for historical reasons. Their use is not discour-
aged.
• A large number of compatibility decomposable characters are used in phonetic
and mathematical notation, where their use is not discouraged.
• For historical reasons, some characters that might have been given a compati-
bility decomposition were not, in fact, decomposed. The Normalization Stabil-
ity Policy prohibits adding decompositions for such cases in the future, so that
normalization forms will stay stable. See the subsection “Policies” in
Section B.3, Other Unicode Online Resources.
• Replacing a compatibility decomposable character by its compatibility decom-
position may lose round-trip convertibility with a base standard.
D67 Compatibility equivalent: Two character sequences are said to be compatibility
equivalents if their full compatibility decompositions are identical.
Canonical Decomposition
D68 Canonical decomposition: The decomposition of a character or character sequence
that results from recursively applying the canonical mappings found in the Unicode
Character Database and those described in Section 3.12, Conjoining Jamo Behavior,
until no characters can be further decomposed, and then reordering nonspacing
marks according to Section 3.11, Normalization Forms.
• The decomposition mappings from the Unicode Character Database are also
printed in Section 24.1, Character Names List.
• A canonical decomposition does not remove formatting information.
Conformance 118 3.7 Decomposition
D69 Canonical decomposable character: A character that is not identical to its canonical
decomposition. It may also be known as a canonical precomposed character or a
canonical composite character.
• For example, U+00E0 latin small letter a with grave is a canonical
decomposable character because its canonical decomposition is to the
sequence <U+0061 latin small letter a, U+0300 combining grave
accent>. U+212A kelvin sign is a canonical decomposable character because
its canonical decomposition is to U+004B latin capital letter k.
D70 Canonical equivalent: Two character sequences are said to be canonical equivalents
if their full canonical decompositions are identical.
• For example, the sequences <o, combining-diaeresis> and <ö> are canonical
equivalents. Canonical equivalence is a Unicode property. It should not be con-
fused with language-specific collation or matching, which may add other
equivalencies. For example, in Swedish, ö is treated as a completely different
letter from o and is collated after z. In German, ö is weakly equivalent to oe and
is collated with oe. In English, ö is just an o with a diacritic that indicates that it
is pronounced separately from the previous letter (as in coöperate) and is col-
lated with o.
• By definition, all canonical-equivalent sequences are also compatibility-equiva-
lent sequences.
For information on the use of decomposition in normalization, see Section 3.11, Normal-
ization Forms.
Conformance 119 3.8 Surrogates
3.8 Surrogates
D71 High-surrogate code point: A Unicode code point in the range U+D800 to U+DBFF.
D72 High-surrogate code unit: A 16-bit code unit in the range D80016 to DBFF16, used in
UTF-16 as the leading code unit of a surrogate pair.
D73 Low-surrogate code point: A Unicode code point in the range U+DC00 to U+DFFF.
D74 Low-surrogate code unit: A 16-bit code unit in the range DC0016 to DFFF16, used in
UTF-16 as the trailing code unit of a surrogate pair.
• High-surrogate and low-surrogate code points are designated only for that use.
• High-surrogate and low-surrogate code units are used only in the context of the
UTF-16 character encoding form.
D75 Surrogate pair: A representation for a single abstract character that consists of a
sequence of two 16-bit code units, where the first value of the pair is a high-surro-
gate code unit and the second value is a low-surrogate code unit.
• Surrogate pairs are used only in UTF-16. (See Section 3.9, Unicode Encoding
Forms.)
• Isolated surrogate code units have no interpretation on their own. Certain
other isolated code units in other encoding forms also have no interpretation
on their own. For example, the isolated byte 8016 has no interpretation in UTF-
8; it can be used only as part of a multibyte sequence. (See Table 3-7.)
• Sometimes high-surrogate code units are referred to as leading surrogates. Low-
surrogate code units are then referred to as trailing surrogates. This is analo-
gous to usage in UTF-8, which has leading bytes and trailing bytes.
• For more information, see Section 23.6, Surrogates Area, and Section 5.4, Han-
dling Surrogate Pairs in UTF-16.
Conformance 120 3.9 Unicode Encoding Forms
D83 Unicode 32-bit string: A Unicode string containing only UTF-32 code units.
D84 Ill-formed: A Unicode code unit sequence that purports to be in a Unicode encoding
form is called ill-formed if and only if it does not follow the specification of that Uni-
code encoding form.
• Any code unit sequence that would correspond to a code point outside the
defined range of Unicode scalar values would, for example, be ill-formed.
• UTF-8 has some strong constraints on the possible byte ranges for leading and
trailing bytes. A violation of those constraints would produce a code unit
sequence that could not be mapped to a Unicode scalar value, resulting in an
ill-formed code unit sequence.
D84a Ill-formed code unit subsequence: A non-empty subsequence of a Unicode code unit
sequence X which does not contain any code units which also belong to any mini-
mal well-formed subsequence of X.
• In other words, an ill-formed code unit subsequence cannot overlap with a
minimal well-formed subsequence.
D85 Well-formed: A Unicode code unit sequence that purports to be in a Unicode encod-
ing form is called well-formed if and only if it does follow the specification of that
Unicode encoding form.
D85a Minimal well-formed code unit subsequence: A well-formed Unicode code unit
sequence that maps to a single Unicode scalar value.
• For UTF-8, see the specification in D92 and Table 3-7.
• For UTF-16, see the specification in D91.
• For UTF-32, see the specification in D90.
A well-formed Unicode code unit sequence can be partitioned into one or more minimal
well-formed code unit sequences for the given Unicode encoding form. Any Unicode code
unit sequence can be partitioned into subsequences that are either well-formed or ill-
formed. The sequence as a whole is well-formed if and only if it contains no ill-formed sub-
sequence. The sequence as a whole is ill-formed if and only if it contains at least one ill-
formed subsequence.
D86 Well-formed UTF-8 code unit sequence: A well-formed Unicode code unit sequence
of UTF-8 code units.
• The UTF-8 code unit sequence <41 C3 B1 42> is well-formed, because it can be
partitioned into subsequences, all of which match the specification for UTF-8
in Table 3-7. It consists of the following minimal well-formed code unit subse-
quences: <41>, <C3 B1>, and <42>.
• The UTF-8 code unit sequence <41 C2 C3 B1 42> is ill-formed, because it con-
tains one ill-formed subsequence. There is no subsequence for the C2 byte
which matches the specification for UTF-8 in Table 3-7. The code unit
Conformance 123 3.9 Unicode Encoding Forms
If a Unicode string purports to be in a Unicode encoding form, then it must not contain any
ill-formed code unit subsequence.
If a process which verifies that a Unicode string is in a Unicode encoding form encounters
an ill-formed code unit subsequence in that string, then it must not identify that string as
being in that Unicode encoding form.
A process which interprets a Unicode string must not interpret any ill-formed code unit
subsequences in the string as characters. (See conformance clause C10.) Furthermore, such
a process must not treat any adjacent well-formed code unit sequences as being part of
those ill-formed code unit sequences.
Table 3-4 gives examples that summarize the three Unicode encoding forms.
UTF-32
D90 UTF-32 encoding form: The Unicode encoding form that assigns each Unicode sca-
lar value to a single unsigned 32-bit code unit with the same numeric value as the
Unicode scalar value.
• In UTF-32, the code point sequence <004D, 0430, 4E8C, 10302> is represented
as <0000004D 00000430 00004E8C 00010302>.
• Because surrogate code points are not included in the set of Unicode scalar val-
ues, UTF-32 code units in the range 0000D80016..0000DFFF16 are ill-formed.
• Any UTF-32 code unit greater than 0010FFFF16 is ill-formed.
For a discussion of the relationship between UTF-32 and UCS-4 encoding form defined in
ISO/IEC 10646, see Section C.2, Encoding Forms in ISO/IEC 10646.
Conformance 125 3.9 Unicode Encoding Forms
UTF-16
D91 UTF-16 encoding form: The Unicode encoding form that assigns each Unicode sca-
lar value in the ranges U+0000..U+D7FF and U+E000..U+FFFF to a single unsigned
16-bit code unit with the same numeric value as the Unicode scalar value, and that
assigns each Unicode scalar value in the range U+10000..U+10FFFF to a surrogate
pair, according to Table 3-5.
• In UTF-16, the code point sequence <004D, 0430, 4E8C, 10302> is represented
as <004D 0430 4E8C D800 DF02>, where <D800 DF02> corresponds to
U+10302.
• Because surrogate code points are not Unicode scalar values, isolated UTF-16
code units in the range D80016..DFFF16 are ill-formed.
Table 3-5 specifies the bit distribution for the UTF-16 encoding form. Note that for Uni-
code scalar values equal to or greater than U+10000, UTF-16 uses surrogate pairs. Calcula-
tion of the surrogate pair values involves subtraction of 1000016, to account for the starting
offset to the scalar value. ISO/IEC 10646 specifies an equivalent UTF-16 encoding form.
For details, see Section C.3, UTF-8 and UTF-16.
UTF-8
D92 UTF-8 encoding form: The Unicode encoding form that assigns each Unicode scalar
value to an unsigned byte sequence of one to four bytes in length, as specified in
Table 3-6 and Table 3-7.
• In UTF-8, the code point sequence <004D, 0430, 4E8C, 10302> is represented
as <4D D0 B0 E4 BA 8C F0 90 8C 82>, where <4D> corresponds to U+004D,
<D0 B0> corresponds to U+0430, <E4 BA 8C> corresponds to U+4E8C, and
<F0 90 8C 82> corresponds to U+10302.
• Any UTF-8 byte sequence that does not match the patterns listed in Table 3-7 is
ill-formed.
• Before the Unicode Standard, Version 3.1, the problematic “non-shortest form”
byte sequences in UTF-8 were those where BMP characters could be repre-
sented in more than one way. These sequences are ill-formed, because they are
not allowed by Table 3-7.
Conformance 126 3.9 Unicode Encoding Forms
• Because surrogate code points are not Unicode scalar values, any UTF-8 byte
sequence that would otherwise map to code points U+D800..U+DFFF is ill-
formed.
Table 3-6 specifies the bit distribution for the UTF-8 encoding form, showing the ranges of
Unicode scalar values corresponding to one-, two-, three-, and four-byte sequences. For a
discussion of the difference in the formulation of UTF-8 in ISO/IEC 10646, see Section C.3,
UTF-8 and UTF-16.
Table 3-7 lists all of the byte sequences that are well-formed in UTF-8. A range of byte val-
ues such as A0..BF indicates that any byte from A0 to BF (inclusive) is well-formed in that
position. Any byte value outside of the ranges listed is ill-formed. For example:
• The byte sequence <C0 AF> is ill-formed, because C0 is not well-formed in the
“First Byte” column.
• The byte sequence <E0 9F 80> is ill-formed, because in the row where E0 is
well-formed as a first byte, 9F is not well-formed as a second byte.
• The byte sequence <F4 80 83 92> is well-formed, because every byte in that
sequence matches a byte range in a row of the table (the last row).
In Table 3-7, cases where a trailing byte range is not 80..BF are shown in bold italic to draw
attention to them. These exceptions to the general pattern occur only in the second byte of
a sequence.
Conformance 127 3.9 Unicode Encoding Forms
For a UTF-8 conversion process to consume valid successor bytes is not only non-confor-
mant, but also leaves the converter open to security exploits. See Unicode Technical Report
#36, “Unicode Security Considerations.”
Although a UTF-8 conversion process is required to never consume well-formed subse-
quences as part of its error handling for ill-formed subsequences, such a process is not oth-
erwise constrained in how it deals with any ill-formed subsequence itself. An ill-formed
subsequence consisting of more than one code unit could be treated as a single error or as
multiple errors.
For example, in processing the UTF-8 code unit sequence <F0 80 80 41>, the only formal
requirement mandated by Unicode conformance for a converter is that the <41> be pro-
cessed and correctly interpreted as <U+0041>. The converter could return <U+FFFD,
U+0041>, handling <F0 80 80> as a single error, or <U+FFFD, U+FFFD, U+FFFD,
U+0041>, handling each byte of <F0 80 80> as a separate error, or could take other
approaches to signalling <F0 80 80> as an ill-formed code unit subsequence.
This definition can be trivially applied to the UTF-32 or UTF-16 encoding forms, but is
primarily of interest when converting UTF-8 strings.
This practice replaces almost every byte of an ill-formed UTF-8 sequence with one
U+FFFD. For example, every byte of a “non-shortest form” sequence (see Definition D92),
or of a truncated version thereof, is replaced, as shown in Table 3-8. (The interpretation of
“non-shortest form” sequences has been forbidden since the publication of Corrigendum
#1.)
Also, every byte of a sequence that would correspond to a surrogate code point, or of a
truncated version thereof, is replaced with one U+FFFD, as shown in Table 3-9. (The inter-
pretation of such byte sequences has been forbidden since Unicode 3.2.)
Finally, every byte of a sequence that would correspond to a code point beyond U+10FFFF,
and any other byte that does not contribute to a valid sequence, is also replaced with one
U+FFFD, as shown in Table 3-10.
Only when a sequence of two or three bytes is a truncated version of a sequence which is
otherwise well-formed to that point, is more than one byte replaced with a single U+FFFD,
as shown in Table 3-11.
For a discussion of the generalization of this approach for conversion of other character
sets to Unicode, see Section 5.22, U+FFFD Substitution in Conversion.
Conformance 130 3.10 Unicode Encoding Schemes
D99 UTF-32BE encoding scheme: The Unicode encoding scheme that serializes a UTF-
32 code unit sequence as a byte sequence in big-endian format.
• In UTF-32BE, the UTF-32 code unit sequence <0000004D 00000430
00004E8C 00010302> is serialized as <00 00 00 4D 00 00 04 30 00 00 4E 8C 00
01 03 02>.
• In UTF-32BE, an initial byte sequence <00 00 FE FF> is interpreted as
U+FEFF zero width no-break space.
D100 UTF-32LE encoding scheme: The Unicode encoding scheme that serializes a UTF-
32 code unit sequence as a byte sequence in little-endian format.
• In UTF-32LE, the UTF-32 code unit sequence <0000004D 00000430
00004E8C 00010302> is serialized as <4D 00 00 00 30 04 00 00 8C 4E 00 00 02
03 01 00>.
• In UTF-32LE, an initial byte sequence <FF FE 00 00> is interpreted as
U+FEFF zero width no-break space.
D101 UTF-32 encoding scheme: The Unicode encoding scheme that serializes a UTF-32
code unit sequence as a byte sequence in either big-endian or little-endian format.
• In the UTF-32 encoding scheme, the UTF-32 code unit sequence <0000004D
00000430 00004E8C 00010302> is serialized as <00 00 FE FF 00 00 00 4D 00 00
04 30 00 00 4E 8C 00 01 03 02> or <FF FE 00 00 4D 00 00 00 30 04 00 00 8C 4E
00 00 02 03 01 00> or <00 00 00 4D 00 00 04 30 00 00 4E 8C 00 01 03 02>.
• In the UTF-32 encoding scheme, an initial byte sequence corresponding to
U+FEFF is interpreted as a byte order mark; it is used to distinguish between
the two byte orders. An initial byte sequence <00 00 FE FF> indicates big-
endian order, and an initial byte sequence <FF FE 00 00> indicates little-
endian order. The BOM is not considered part of the content of the text.
• The UTF-32 encoding scheme may or may not begin with a BOM. However,
when there is no BOM, and in the absence of a higher-level protocol, the byte
order of the UTF-32 encoding scheme is big-endian.
Table 3-13 gives examples that summarize the three Unicode encoding schemes for the
UTF-32 encoding form.
The terms UTF-8, UTF-16, and UTF-32, when used unqualified, are ambiguous between
their sense as Unicode encoding forms or Unicode encoding schemes. For UTF-8, this
ambiguity is usually innocuous, because the UTF-8 encoding scheme is trivially derived
from the byte sequences defined for the UTF-8 encoding form. However, for UTF-16 and
UTF-32, the ambiguity is more problematical. As encoding forms, UTF-16 and UTF-32
refer to code units in memory; there is no associated byte orientation, and a BOM is never
used. As encoding schemes, UTF-16 and UTF-32 refer to serialized bytes, as for streaming
data or in files; they may have either byte orientation, and a BOM may be present.
When the usage of the short terms “UTF-16” or “UTF-32” might be misinterpreted, and
where a distinction between their use as referring to Unicode encoding forms or to Uni-
code encoding schemes is important, the full terms, as defined in this chapter of the Uni-
code Standard, should be used. For example, use UTF-16 encoding form or UTF-16
encoding scheme. These terms may also be abbreviated to UTF-16 CEF or UTF-16 CES,
respectively.
When converting between different encoding schemes, extreme care must be taken in han-
dling any initial byte order marks. For example, if one converted a UTF-16 byte serializa-
tion with an initial byte order mark to a UTF-8 byte serialization, thereby converting the
byte order mark to <EF BB BF> in the UTF-8 form, the <EF BB BF> would now be ambig-
uous as to its status as a byte order mark (from its source) or as an initial zero width no-
break space. If the UTF-8 byte serialization were then converted to UTF-16BE and the ini-
tial <EF BB BF> were converted to <FE FF>, the interpretation of the U+FEFF character
would have been modified by the conversion. This would be nonconformant behavior
according to conformance clause C7, because the change between byte serializations
would have resulted in modification of the interpretation of the text. This is one reason
why the use of the initial byte sequence <EF BB BF> as a signature on UTF-8 byte
sequences is not recommended by the Unicode Standard.
Conformance 134 3.11 Normalization Forms
Normalization Stability
A very important attribute of the Unicode Normalization Forms is that they must remain
stable between versions of the Unicode Standard. A Unicode string normalized to a partic-
ular Unicode Normalization Form in one version of the standard is guaranteed to remain
in that Normalization Form for implementations of future versions of the standard. In
order to ensure this stability, there are strong constraints on changes of any character
properties that are involved in the specification of normalization—in particular, the com-
bining class and the decomposition of characters. The details of those constraints are
spelled out in the Normalization Stability Policy. See the subsection “Policies” in
Conformance 135 3.11 Normalization Forms
Section B.3, Other Unicode Online Resources. The requirement for stability of normalization
also constrains what kinds of characters can be encoded in future versions of the standard.
For an extended discussion of this topic, see Section 3, Versioning and Stability, in Unicode
Standard Annex #15, “Unicode Normalization Forms.”
Combining Classes
Each character in the Unicode Standard has a combining class associated with it. The com-
bining class is a numerical value used by the Canonical Ordering Algorithm to determine
which sequences of combining marks are to be considered canonically equivalent and
which are not. Canonical equivalence is the criterion used to determine whether two char-
acter sequences are considered identical for interpretation.
D104 Combining class: A numeric value in the range 0..254 given to each Unicode code
point, formally defined as the property Canonical_Combining_Class.
• The combining class for each encoded character in the standard is specified in
the file UnicodeData.txt in the Unicode Character Database. Any code point
not listed in that data file defaults to \p{Canonical_Combining_Class=0} (or
\p{ccc=0} for short).
• An extracted listing of combining classes, sorted by numeric value, is provided
in the file DerivedCombiningClass.txt in the Unicode Character Database.
• Only combining marks have a combining class other than zero. Almost all
combining marks with a class other than zero are also nonspacing marks, with
a few exceptions. Also, not all nonspacing marks have a non-zero combining
class. Thus, while the correlation between ^\p{ccc=0] and \p{gc=Mn} is close,
it is not exact, and implementations should not depend on the two concepts
being identical.
D105 Fixed position class: A subset of the range of numeric values for combining classes—
specifically, any value in the range 10..199.
• Fixed position classes are assigned to a small number of Hebrew, Arabic, Syr-
iac, Telugu, Thai, Lao, and Tibetan combining marks whose positions were
conceived of as occurring in a fixed position with respect to their grapheme
base, regardless of any other combining mark that might also apply to the
grapheme base.
• Not all Arabic vowel points or Indic matras are given fixed position classes. The
existence of fixed position classes in the standard is an historical artifact of an
earlier stage in its development, prior to the formal standardization of the Uni-
code Normalization Forms.
D106 Typographic interaction: Graphical application of one nonspacing mark in a posi-
tion relative to a grapheme base that is already occupied by another nonspacing
mark, so that some rendering adjustment must be done (such as default stacking or
side-by-side placement) to avoid illegible overprinting or crashing of glyphs.
Conformance 136 3.11 Normalization Forms
The assignment of combining class values for Unicode characters was originally done with
the goal in mind of defining distinct numeric values for each group of nonspacing marks
that would typographically interact. Thus all generic nonspacing marks placed above the
base character are given the same value, \p{ccc=230}, while all generic nonspacing marks
placed below are given the value \p{ccc=220}. Nonspacing marks that tend to sit on one
“shoulder” or another of a grapheme base, or that may actually be attached to the graph-
eme base itself when applied, have their own combining classes.
The design of canonical ordering generally assures that:
• When two combining characters C1 and C2 do typographically interact, the
sequence C1+ C2 is not canonically equivalent to C2+ C1.
• When two combining characters C1 and C2 do not typographically interact,
the sequence C1+ C2 is canonically equivalent to C2+ C1.
This is roughly correct for the normal cases of detached, generic nonspacing marks placed
above and below base letters. However, the ramifications of complex rendering for many
scripts ensure that there are always some edge cases involving typographic interaction
between combining marks of distinct combining classes. This has turned out to be particu-
larly true for some of the fixed position classes for Hebrew and Arabic, for which a distinct
combining class is no guarantee that there will be no typographic interaction for rendering.
Because of these considerations, particular combining class values should be taken only as
a guideline regarding issues of typographic interaction of combining marks.
The only normative use of combining class values is as input to the Canonical Ordering
Algorithm, where they are used to normatively distinguish between sequences of combin-
ing marks that are canonically equivalent and those that are not.
Starters
D107 Starter: Any code point (assigned or not) with combining class of zero (ccc = 0).
• Note that ccc = 0 is the default value for the Canonical_Combining_Class
property, so that all reserved code points are Starters by definition. Noncharac-
ters are also Starters by definition. All control characters, format characters,
and private-use characters are also Starters.
Conformance 137 3.11 Normalization Forms
The term Starter refers, in concept, to the starting character of a combining character
sequence (D56), because all combining character sequences except defective combining
character sequences (D57) commence with a ccc = 0 character—in other words, they start
with a Starter. However, because the specification of Unicode Normalization Forms must
apply to all possible coded character sequences, and not just to typical combining character
sequences, the behavior of a code point for Unicode Normalization Forms is specified
entirely in terms of its status as a Starter or a non-starter, together with its Decomposi-
tion_Mapping value.
Table 3-15 gives some examples of sequences of characters, showing which of them consti-
tute a Reorderable Pair and the reasons for that determination. Except for the base charac-
ter “a”, the other characters in the example table are combining marks; character names are
abbreviated in the Sequence column to make the examples clearer.
Definitions
The following definitions use the Hangul_Syllable_Type property, which is defined in the
UCD file HangulSyllableType.txt.
D122 Leading consonant: A character with the Hangul_Syllable_Type property value
Leading_Jamo. Abbreviated as L.
• When not occurring in clusters, the term leading consonant is equivalent to syl-
lable-initial character.
D123 Choseong: A sequence of one or more leading consonants.
• In Modern Korean, a choseong consists of a single jamo. In Old Korean, a
sequence of more than one leading consonant may occur.
• Equivalent to syllable-initial cluster.
D124 Choseong filler: U+115F hangul choseong filler. Abbreviated as Lf.
• A choseong filler stands in for a missing choseong to make a well-formed
Korean syllable.
D125 Vowel: A character with the Hangul_Syllable_Type property value Vowel_Jamo.
Abbreviated as V.
• When not occurring in clusters, the term vowel is equivalent to syllable-peak
character.
D126 Jungseong: A sequence of one or more vowels.
• In Modern Korean, a jungseong consists of a single jamo. In Old Korean, a
sequence of more than one vowel may occur.
• Equivalent to syllable-peak cluster.
Conformance 143 3.12 Conjoining Jamo Behavior
• This definition is used in Unicode Standard Annex #29, “Unicode Text Seg-
mentation,” as part of the algorithm for determining syllable boundaries in a
sequence of conjoining jamo characters.
Arithmetic Decomposition Mapping. If the precomposed Hangul syllable s with the index
SIndex (defined above) has the Hangul_Syllable_Type value LV, then it has a canonical
decomposition mapping into a sequence of an L jamo and a V jamo, <LPart, VPart>:
LIndex = SIndex div NCount
VIndex = (SIndex mod NCount) div TCount
LPart = LBase + LIndex
VPart = VBase + VIndex
If the precomposed Hangul syllable s with the index SIndex (defined above) has the Han-
gul_Syllable_Type value LVT, then it has a canonical decomposition mapping into a
sequence of an LV_Syllable and a T jamo, <LVPart, TPart>:
LVIndex = (SIndex div TCount) * TCount
TIndex = SIndex mod TCount
LVPart = SBase + LVIndex
TPart = TBase + TIndex
In this specification, the “div” operator refers to integer division (rounded down). The
“mod” operator refers to the modulo operation, equivalent to the integer remainder for
positive numbers.
The canonical decomposition mappings calculated this way are equivalent to the values of
the Unicode character property Decomposition_Mapping (dm), for each precomposed
Hangul syllable.
Full Canonical Decomposition. The full canonical decomposition for a Unicode character
is defined as the recursive application of canonical decomposition mappings. The canoni-
cal decomposition mapping of an LVT_Syllable contains an LVPart which itself is a pre-
composed Hangul syllable and thus must be further decomposed. However, it is simplest
to unwind the recursion and directly calculate the resulting <LPart, VPart, TPart>
sequence instead. For full canonical decomposition of a precomposed Hangul syllable,
compute the indices and components as follows:
LIndex = SIndex div NCount
VIndex = (SIndex mod NCount) div TCount
TIndex = SIndex mod TCount
LPart = LBase + LIndex
VPart = VBase + VIndex
TPart = TBase + TIndex if TIndex > 0
If TIndex = 0, then there is no trailing consonant, so map the precomposed Hangul syllable
s to its full decomposition d = <LPart, VPart>. Otherwise, there is a trailing consonant, so
map s to its full decomposition d = <LPart, VPart, TPart>.
Example. For the precomposed Hangul syllable U+D4DB, compute the indices and com-
ponents:
SIndex = 10459
LIndex = 17
VIndex = 16
Conformance 146 3.12 Conjoining Jamo Behavior
TIndex = 15
LPart = LBase + 17 = 111116
VPart = VBase + 16 = 117116
TPart = TBase + 15 = 11B616
Then map the precomposed syllable to the calculated sequence of components, which con-
stitute its full canonical decomposition:
U+D4DB → <U+1111, U+1171, U+11B6>
Note that the canonical decomposition mapping for U+D4DB would be <U+D4CC,
U+11B6>, but in computing the full canonical decomposition, that sequence would only
be an intermediate step.
cases in which Hangul data is not canonically decomposed. Given a sequence <LVPart,
TPart>, where the LVPart is a precomposed Hangul syllable of Hangul_Syllable_Type LV,
and where the TPart is in the range U+11A8..U+11C2, compute the index and syllable
mapping:
TIndex = TPart - TBase
s = LVPart + TIndex
Example. For the canonically decomposed Hangul jamo sequence <U+1111, U+1171,
U+11B6>, compute the indices and syllable mapping:
LIndex = 17
VIndex = 16
TIndex = 15
LVIndex = 17 * 588 + 16 * 28 = 9996 + 448 = 10444
s = AC0016 + 10444 + 15 = D4DB16
Then map the Hangul jamo sequence to this precomposed Hangul syllable as its Primary
Composite:
<U+1111, U+1171, U+11B6> → U+D4DB
Example. For the precomposed Hangul syllable U+D4DB, construct the full canonical
decomposition:
U+D4DB → <U+1111, U+1171, U+11B6>
Conformance 148 3.12 Conjoining Jamo Behavior
Look up the Jamo_Short_Name values for each of the Hangul jamo in the canonical
decomposition:
JSNL = Jamo_Short_Name(U+1111) = "P"
JSNV = Jamo_Short_Name(U+1171) = "WI"
JSNT = Jamo_Short_Name(U+11B6) = "LH"
last += TIndex;
result.setCharAt(result.length()-1, last); // reset last
continue; // discard ch
}
}
// if neither case was true, just add the character
last = ch;
Conformance 150 3.12 Conjoining Jamo Behavior
result.append(ch);
}
return result.toString();
}
Hangul Character Name Generation. Hangul decomposition is also used when generat-
ing the names for precomposed Hangul syllables. This is apparent in the following sample
method for constructing a Hangul syllable name. The content of the three tables used in
this method can be derived from the data file Jamo.txt in the Unicode Character Database.
public static String getHangulName(char s) {
int SIndex = s - SBase;
if (0 > SIndex || SIndex >= SCount) {
throw new IllegalArgumentException("Not a Hangul Syllable: "
+ s);
}
int LIndex = SIndex / NCount;
int VIndex = (SIndex % NCount) / TCount;
int TIndex = SIndex % TCount;
return "HANGUL SYLLABLE " + JAMO_L_TABLE[LIndex]
+ JAMO_V_TABLE[VIndex] + JAMO_T_TABLE[TIndex];
}
Definitions
The full case mappings for Unicode characters are obtained by using the mappings from
SpecialCasing.txt plus the mappings from UnicodeData.txt, excluding any of the latter
mappings that would conflict. Any character that does not have a mapping in these files is
considered to map to itself. The full case mappings of a character C are referred to as Low-
ercase_Mapping(C), Titlecase_Mapping(C), and Uppercase_Mapping(C). The full case
folding of a character C is referred to as Case_Folding(C).
Conformance 153 3.13 Default Case Algorithms
Detection of case and case mapping requires more than just the General_Category values
(Lu, Lt, Ll). The following definitions are used:
D135 A character C is defined to be cased if and only if C has the Lowercase or Uppercase
property or has a General_Category value of Titlecase_Letter.
• The Uppercase and Lowercase property values are specified in the data file
DerivedCoreProperties.txt in the Unicode Character Database. The derived
property Cased is also listed in DerivedCoreProperties.txt.
D136 A character C is defined to be case-ignorable if C has the value MidLetter (ML),
MidNumLet (MB), or Single_Quote (SQ) for the Word_Break property or its Gen-
eral_Category is one of Nonspacing_Mark (Mn), Enclosing_Mark (Me), Format
(Cf ), Modifier_Letter (Lm), or Modifier_Symbol (Sk).
• The Word_Break property is defined in the data file WordBreakProperty.txt in
the Unicode Character Database.
• The derived property Case_Ignorable is listed in the data file DerivedCore-
Properties.txt in the Unicode Character Database.
• The Case_Ignorable property is defined for use in the context specifications of
Table 3-17. It is a narrow-use property, and is not intended for use in other
contexts. The more broadly applicable string casing function, isCased(X), is
defined in D143.
D137 Case-ignorable sequence: A sequence of zero or more case-ignorable characters.
D138 A character C is in a particular casing context for context-dependent matching if and
only if it matches the corresponding specification in Table 3-17.
In Table 3-17, a description of each context is followed by the equivalent regular expres-
sion(s) describing the context before C, the context after C, or both. The regular expres-
sions use the syntax of Unicode Technical Standard #18, “Unicode Regular Expressions,”
with one addition: “!” means that the expression does not match. All of the regular expres-
sions are case-sensitive.
The regular-expression operator * in Table 3-17 is “possessive,” consuming as many char-
acters as possible, with no backup. This is significant in the case of Final_Sigma, because
the sets of case-ignorable and cased characters are not disjoint: for example, they both con-
tain U+0345 ypogegrammeni. Thus, the Before condition is not satisfied if C is preceded
by only U+0345, but would be satisfied by the sequence <capital-alpha, ypogegrammeni>.
Similarly, the After condition is satisfied if C is only followed by ypogegrammeni, but would
not satisfied by the sequence <ypogegrammeni, capital-alpha>.
The default case conversion operations may be tailored for specific requirements. A com-
mon variant, for example, is to make use of simple case conversion, rather than full case
conversion. Language- or locale-specific tailorings of these rules may also be used.
For more information on the use of NFKC_Casefold and caseless matching for identifiers,
see Unicode Standard Annex #31, “Unicode Identifier and Pattern Syntax.”
Only when a string, such as “123”, contains no cased letters will all three conditions,—
isLowercase, isUppercase, and isTitlecase—evaluate as true. This combination of condi-
tions can be used to check for the presence of cased letters, using the following definition:
D143 isCased(X): isCased(X) is true when isLowercase(X) is false, or isUppercase(X) is
false, or isTitlecase(X) is false.
• Any string X for which isCased(X) is true contains at least one character that
has a case mapping other than to itself.
• For example, isCased(“123”) is false because all the characters in “123” have
case mappings to themselves, while isCased(“abc”) and isCased(“A12”) are
both true.
• The derived binary property Changes_When_Casemapped is listed in the data
file DerivedCoreProperties.txt in the Unicode Character Database.
To find out whether a string contains only lowercase letters, implementations need to test
for (isLowercase(X) and isCased(X)).
When comparing strings for case-insensitive equality, the strings should also be normal-
ized for most correct results. For example, the case folding of U+00C5 Å latin capital
letter a with ring above is U+00E5 å latin small letter a with ring above,
whereas the case folding of the sequence <U+0041 “A” latin capital letter a, U+030A Ää
combining ring above> is the sequence <U+0061 “a” latin small letter a, U+030A Ää
combining ring above>. Simply doing a binary comparison of the results of case folding
both strings will not catch the fact that the resulting case-folded strings are canonical-
equivalent sequences. In principle, normalization needs to be done after case folding,
because case folding does not preserve the normalized form of strings in all instances. This
requirement for normalization is covered in the following definition for canonical caseless
matching:
D145 A string X is a canonical caseless match for a string Y if and only if:
NFD(toCasefold(NFD(X))) = NFD(toCasefold(NFD(Y )))
The invocations of canonical decomposition (NFD normalization) before case folding in
D145 are to catch very infrequent edge cases. Normalization is not required before case
folding, except for the character U+0345 n combining greek ypogegrammeni and any
characters that have it as part of their canonical decomposition, such as U+1FC3 o greek
small letter eta with ypogegrammeni. In practice, optimized versions of canonical
caseless matching can catch these special cases, thereby avoiding an extra normalization
step for each comparison.
In some instances, implementers may wish to ignore compatibility differences between
characters when comparing strings for case-insensitive equality. The correct way to do this
makes use of the following definition for compatibility caseless matching:
D146 A string X is a compatibility caseless match for a string Y if and only if:
NFKD(toCasefold(NFKD(toCasefold(NFD(X))))) =
NFKD(toCasefold(NFKD(toCasefold(NFD(Y )))))
Compatibility caseless matching requires an extra cycle of case folding and normalization
for each string compared, because the NFKD normalization of a compatibility character
such as U+3392 square mhz may result in a sequence of alphabetic characters which must
again be case folded (and normalized) to be compared correctly.
Caseless matching for identifiers can be simplified and optimized by using the NFKC_-
Casefold mapping. That mapping incorporates internally the derived results of iterated
case folding and NFKD normalization. It also maps away characters with the property
value Default_Ignorable_Code_Point = True, which should not make a difference when
comparing identifiers.
The following defines identifier caseless matching:
D147 A string X is an identifier caseless match for a string Y if and only if:
toNFKC_Casefold(NFD(X)) = toNFKC_Casefold(NFD(Y ))
159
Chapter 4
Character Properties 4
Disclaimer
The content of all character property tables has been verified as far as possible by
the Unicode Consortium. However, in case of conflict, the most authoritative
version of the information for this version of the Unicode Standard is that sup-
plied in the Unicode Character Database on the Unicode website. The contents of
all the tables in this chapter may be superseded or augmented by information in
future versions of the Unicode Standard.
The Unicode Standard associates a rich set of semantics with characters and, in some
instances, with code points. The support of character semantics is required for confor-
mance; see Section 3.2, Conformance Requirements. Where character semantics can be
expressed formally, they are provided as machine-readable lists of character properties in
the Unicode Character Database (UCD). This chapter gives an overview of character prop-
erties, their status and attributes, followed by an overview of the UCD and more detailed
notes on some important character properties. For a further discussion of character prop-
erties, see Unicode Technical Report #23, “Unicode Character Property Model.”
Status and Attributes. Character properties may be normative, informative, contributory,
or provisional. Normative properties are those required for conformance. Many Unicode
character properties can be overridden by implementations as needed. Section 3.2, Confor-
mance Requirements, specifies when such overrides must be documented. A few properties,
such as Noncharacter_Code_Point, may not be overridden. See Section 3.5, Properties, for
the formal discussion of the status and attributes of properties.
Consistency of Properties. The Unicode Standard is the product of many compromises. It
has to strike a balance between uniformity of treatment for similar characters and compat-
ibility with existing practice for characters inherited from legacy encodings. Because of this
balancing act, one can expect a certain number of anomalies in character properties. For
example, some pairs of characters might have been treated as canonical equivalents but are
left unequivalent for compatibility with legacy differences. This situation pertains to
U+00B5 μ micro sign and U+03BC º greek small letter mu, as well as to certain
Korean jamo.
In addition, some characters might have had properties differing in some ways from those
assigned in this standard, but those properties are left as is for compatibility with existing
practice. This situation can be seen with the halfwidth voicing marks for Japanese
Character Properties 160
Data file changes are associated with specific, numbered versions of the standard; charac-
ter properties are never silently corrected between official versions.
Each version of the Unicode Character Database, once published, is absolutely stable and
will never change. Implementations or specifications that refer to a specific version of the
UCD can rely upon this stability. Detailed policies on character encoding stability as they
relate to properties are found on the Unicode website. See the subsection “Policies” in
Section B.3, Other Unicode Online Resources. See also the discussion of versioning and sta-
bility in Section 3.1, Versions of the Unicode Standard.
Aliases. Character properties and their values are given formal aliases to make it easier to
refer to them consistently in specifications and in implementations, such as regular expres-
sions, which may use them. These aliases are listed exhaustively in the Unicode Character
Database, in the data files PropertyAliases.txt and PropertyValueAliases.txt.
Many of the aliases have both a long form and a short form. For example, the General Cat-
egory has a long alias “General_Category” and a short alias “gc”. The long alias is more
comprehensible and is usually used in the text of the standard when referring to a particu-
lar character property. The short alias is more appropriate for use in regular expressions
and other algorithmic contexts.
In comparing aliases programmatically, loose matching is appropriate. That entails ignor-
ing case differences and any whitespace, underscore, and hyphen characters. For example,
“GeneralCategory”, “general_category”, and “GENERAL-CATEGORY” would all be con-
sidered equivalent property aliases. See Unicode Standard Annex #44, “Unicode Character
Database,” for further discussion of property and property value matching.
For each character property whose values are not purely numeric, the Unicode Character
Database provides a list of value aliases. For example, one of the values of the Line_Break
property is given the long alias “Open_Punctuation” and the short alias “OP”.
Property aliases and property value aliases can be combined in regular expressions that
pick out a particular value of a particular property. For example, “\p{lb=OP}” means the
Open_Punctuation value of the Line_Break property, and “\p{gc=Lu}” means the Upper-
case_Letter value of the General_Category property.
Property aliases define a namespace. No two character properties have the same alias. For
each property, the set of corresponding property value aliases constitutes its own name-
space. No constraint prevents property value aliases for different properties from having
the same property value alias. Thus “B” is the short alias for the Paragraph_Separator value
of the Bidi_Class property; “B” is also the short alias for the Below value of the Canonical_-
Combining_Class property. However, because of the namespace restrictions, any combi-
nation of a property alias plus an appropriate property value alias is guaranteed to
constitute a unique string, as in “\p{bc=B}” versus “\p{ccc=B}”.
For a recommended use of property and property value aliases, see Unicode Technical
Standard #18, “Unicode Regular Expressions.” Aliases are also used for normatively refer-
encing properties, as described in Section 3.1, Versions of the Unicode Standard.
Character Properties 163 4.1 Unicode Character Database
UCD in XML. Starting with Unicode Version 5.1.0, the complete Unicode Character Data-
base is also available formatted in XML. This includes both the non-Han part of the Uni-
code Character Database and all of the content of the Unihan Database. For details
regarding the XML schema, file names, grouping conventions, and other considerations,
see Unicode Standard Annex #42, “Unicode Character Database in XML.”
Online Availability. All versions of the UCD are available online on the Unicode website.
See the subsections “Online Unicode Character Database” and “Online Unihan Database”
in Section B.3, Other Unicode Online Resources.
Character Properties 164 4.2 Case
4.2 Case
Case is a normative property of characters in certain alphabets whereby characters are con-
sidered to be variants of a single letter. These variants, which may differ markedly in shape
and size, are called the uppercase letter (also known as capital or majuscule) and the lower-
case letter (also known as small or minuscule). The uppercase letter is generally larger than
the lowercase letter.
Because of the inclusion of certain composite characters for compatibility, such as U+01F1
latin capital letter dz, a third case, called titlecase, is used where the first character of a
word must be capitalized. An example of such a character is U+01F2 latin capital letter
d with small letter z. The three case forms are UPPERCASE, Titlecase, and lowercase.
For those scripts that have case (Latin, Greek, Coptic, Cyrillic, Glagolitic, Armenian,
archaic Georgian, Deseret, and Warang Citi), uppercase characters typically contain the
word capital in their names. Lowercase characters typically contain the word small. How-
ever, this is not a reliable guide. The word small in the names of characters from scripts
other than those just listed has nothing to do with case. There are other exceptions as well,
such as small capital letters that are not formally uppercase. Some Greek characters with
capital in their names are actually titlecase. (Note that while the archaic Georgian script
contained upper- and lowercase pairs, they are not used in modern Georgian. See
Section 7.7, Georgian.)
primarily on their letterforms. The additional characters are included in the derivations by
means of the contributory properties, Other_Lowercase and Other_Uppercase, defined in
PropList.txt. For example, Other_Lowercase adds the various modifier letters that are let-
terlike in shape, the circled lowercase letter symbols, and the compatibility lowercase
Roman numerals. Other_Uppercase adds the circled uppercase letter symbols, and the
compatibility uppercase Roman numerals.
A third set of definitions for case is fundamentally different in kind, and does not consist of
character properties at all. The functions isLowercase and isUppercase are string functions
returning a binary True/False value. These functions are defined in Section 3.13, Default
Case Algorithms, and depend on case mapping relations, rather than being based on letter-
forms per se. Basically, isLowercase is True for a string if the result of applying the toLow-
ercase mapping operation for a string is the same as the string itself.
Table 4-1 illustrates the various possibilities for how these definitions interact, as applied to
exemplary single characters or single character strings.
Note that for “caseless” characters, such as U+02B0, U+1D34, and U+02BD, isLowerCase
and isUpperCase are both True, because the inclusion of a caseless letter in a string is not
criterial for determining the casing of the string—a caseless letter always case maps to itself.
On the other hand, all modifier letters derived from letter shapes are also notionally lower-
case, whether the letterform itself is a minuscule or a majuscule in shape. Thus U+1D34
modifier letter capital h is actually Lowercase = True. Other modifier letters not
derived from letter shapes, such as U+02BD, are neither Lowercase nor Uppercase.
The string functions isLowerCase and isUpperCase also apply to strings longer than one
character, of course, for which the character properties General_Category, LowerCase, and
Uppercase are not relevant. In Table 4-2, the string function isTitleCase is also illustrated,
to show its applicability for the same strings.
Programmers concerned with manipulating Unicode strings should generally be dealing
with the string functions such as isLowerCase (and its functional cousin, toLowerCase),
unless they are working directly with single character properties. Care is always advised,
however, when dealing with case in the Unicode Standard, as expectations based simply on
Character Properties 166 4.2 Case
the behavior of the basic Latin alphabet (A..Z, a..z) do not generalize easily across the entire
repertoire of Unicode characters, and because case for modifier letters, in particular, can
result in unexpected behavior.
Case Mapping
The default case mapping tables defined in the Unicode Standard are normative, but may
be overridden to match user or implementation requirements. The Unicode Character
Database contains four files with case mapping information, as shown in Table 4-3. Full
case mappings for Unicode characters are obtained by using the basic mappings from
UnicodeData.txt and extending or overriding them where necessary with the mappings
from SpecialCasing.txt. Full case mappings may depend on the context surrounding the
character in the original string.
Some characters have a “best” single-character mapping in UnicodeData.txt as well as a full
mapping in SpecialCasing.txt. Any character that does not have a mapping in these files is
considered to map to itself. For more information on case mappings, see Section 5.18, Case
Mappings.
A set of charts that show the latest case mappings is also available on the Unicode website.
See “Charts” in Section B.3, Other Unicode Online Resources.
Character Properties 168 4.3 Combining Classes
230
216
202
220
tional placement with regard to a base letter, as described earlier. However, in the case of
the combining marks representing vowels (and sometimes consonants) in the Brahmi-
derived scripts and other abugidas, all of the combining marks are given the normative
combining class of zero, regardless of their positional placement within an aksara. The
placement and rendering of a class zero combining mark cannot be derived from its com-
bining class alone, but rather depends on having more information about the particulars of
the script involved. In some instances, the position may migrate in different historical peri-
ods for a script or may even differ depending on font style.
The identification of matras in Indic scripts is provided in the data file IndicSyllabicCate-
gory.txt in the Unicode Character Database. Information about their positional placement
can be found in the data file IndicPositionalCategory.txt. The following text in this section
subcategorizes some of the class zero combining marks for Brahmi-derived scripts, point-
ing out significant types that need to be handled consistently, and relating their positional
placement to the particular values documented in IndicPositionalCategory.txt.
Reordrant Class Zero Combining Marks. In many instances in Indic scripts, a vowel is
represented in logical order after the consonant of a syllable, but is displayed before (to the
left of ) the consonant when rendered. Such combining marks are termed reordrant to
reflect their visual reordering to the left of a consonant (or, in some instances, a consonant
cluster). Special handling is required for selection and editing of these marks. In particular,
the possibility that the combining mark may be reordered to the left side past a cluster, and
not simply past the immediate preceding character in the backing store, requires attention
to the details for each script involved.
The visual reordering of these reordrant class zero combining marks has nothing to do
with the reordering of combining character sequences in the Canonical Ordering Algo-
rithm. All of these marks are class zero and thus are never reordered by the Canonical
Ordering Algorithm for normalization. The reordering is purely a presentational issue for
glyphs during rendering of text.
Reordrant class zero combining marks correspond to the list of characters with Indic_Po-
sitional_Category = Left.
In addition, there are historically related vowel characters in the Thai, Lao, New Tai Lue,
and Tai Viet scripts that are not treated as combining marks. Instead, for these scripts, such
vowels are represented in the backing store in visual order and require no reordering for
rendering. The trade-off is that they have to be rearranged for correct sorting. Because of
that processing requirement, these characters are given a formal character property assign-
ment, the Logical_Order_Exception property. See PropList.txt in the Unicode Character
Database. The list of characters with the Logical_Order_Exception property is the same as
those documented with the value Indic_Positional_Category = Visual_Order_Left in
IndicPositionalCategory.txt.
Split Class Zero Combining Marks. In addition to the reordrant class zero combining
marks, there are a number of class zero combining marks whose representative glyph typi-
cally consists of two parts, which are split into different positions with respect to the conso-
nant (or consonant cluster) in an aksara. Sometimes these glyphic pieces are rendered
Character Properties 170 4.3 Combining Classes
both to the left and the right of a consonant. Sometimes one piece is rendered above or
below the consonant and the other piece is rendered to the left or the right. Particularly in
the instances where some piece of the glyph is rendered to the left of the consonant, these
split class zero combining marks pose similar implementation problems as for the reor-
drant marks.
The split class zero combining marks have various Indic_Positional_Category values such
as Left_And_Right, Top_And_Bottom, Top_And_Right, Top_And_Left, and so forth. See
IndicPositionalCategory.txt for the full listing.
One should pay very careful attention to all split class zero combining marks in implemen-
tations. Not only do they pose issues for rendering and editing, but they also often have
canonical equivalences defined involving the separate pieces, when those pieces are also
encoded as characters. As a consequence, the split combining marks may constitute excep-
tional cases under normalization. Some of the Tibetan split combining marks are depre-
cated.
The split vowels also pose difficult problems for understanding the standard, as the phono-
logical status of the vowel phonemes, the encoding status of the characters (including any
canonical equivalences), and the graphical status of the glyphs are easily confused, both for
native users of the script and for engineers working on implementations of the standard.
Subjoined Class Zero Combining Marks. Brahmi-derived scripts that are not represented
in the Unicode Standard with a virama may have class zero combining marks to represent
subjoined forms of consonants. These correspond graphologically to what would be repre-
sented by a sequence of virama plus consonant in other related scripts. The subjoined con-
sonants do not pose particular rendering problems, at least not in comparison to other
combining marks, but they should be noted as constituting an exception to the normal pat-
tern in Brahmi-derived scripts of consonants being represented with base letters. This
exception needs to be taken into account when doing linguistic processing or searching
and sorting.
Subjoined class zero combining marks are listed with the value Indic_Syllabic_Category =
Consonant_Subjoined in IndicSyllabicCategory.txt.
Strikethrough Class Zero Combining Marks. The Kharoshthi script is unique in having
some class zero combining marks for vowels that are struck through a consonant, rather
than being placed in a position around the consonant. These strikethrough combining
marks may involve particular problems for implementations. In addition to the Kharoshthi
vowels, there are a number of combining svarita marks for Vedic texts which are also ren-
dered as overstruck forms. These Kharoshthi vowels and Vedic svarita marks have the
property value Indic_Positional_Category = Overstruck in IndicPositionalCategory.txt.
Character Properties 171 4.4 Directionality
4.4 Directionality
Directional behavior is interpreted according to the Unicode Bidirectional Algorithm (see
Unicode Standard Annex #9, “Unicode Bidirectional Algorithm”). For this purpose, all
characters of the Unicode Standard possess a normative directional type, defined by the
Bidi_Class (bc) property in the Unicode Character Database. The directional types left-to-
right and right-to-left are called strong types, and characters of these types are called strong
directional characters. Left-to-right types include most alphabetic and syllabic characters
as well as all Han ideographic characters. Right-to-left types include the letters of predom-
inantly right-to-left scripts, such as Arabic, Hebrew, and Syriac, as well as most punctua-
tion specific to those scripts. In addition, the Unicode Bidirectional Algorithm uses weak
types and neutrals. Interpretation of directional properties according to the Unicode Bidi-
rectional Algorithm is needed for layout of right-to-left scripts such as Arabic and Hebrew.
Character Properties 172 4.5 General Category
Identifier and Pattern Syntax,” and in regular expression languages such as Perl. For more
information, see Unicode Technical Standard #18, “Unicode Regular Expressions.”
This property is also used to support common APIs such as isDigit(). Common func-
tions such as isLetter()and isUppercase()do not extend well to the larger and more
complex repertoire of Unicode. While it is possible to naively extend these functions to
Unicode using the General_Category and other properties, they will not work for the
entire range of Unicode characters and the kinds of tasks for which people intend them.
For more appropriate approaches, see Unicode Standard Annex #31, “Unicode Identifier
and Pattern Syntax”; Unicode Standard Annex #29, “Unicode Text Segmentation”;
Section 5.18, Case Mappings; and Section 4.10, Letters, Alphabetic, and Ideographic.
Although the General_Category property is normative, and its values are used in the deri-
vation of many other properties referred to by Unicode algorithms, it does not follow that
the General_Category always provides the most appropriate classification of a character
for any given purpose. Implementations are not required to treat characters solely accord-
ing to their General_Category values when classifying them in various contexts. The fol-
lowing examples illustrate some typical cases in which an implementation might
reasonably diverge from General_Category values for a character when grouping charac-
ters as “punctuation,” “symbols,” and so forth.
• A character picker application might classify U+0023 # number sign among
symbols, or perhaps under both symbols and punctuation.
• An “Ignore Punctuation” option for a search might choose not to ignore
U+0040 @ commercial at.
• A layout engine might treat U+0021 ! exclamation mark as a mathematical
operator in the context of a mathematical equation, and lay it out differently
than if the same character were used as terminal punctuation in text.
• A regular expression syntax could provide an operator to match all punctua-
tion, but include characters other than those limited to gc = P (for example,
U+00A7 § section sign ).
The general rule is that if an implementation purports to be using the Unicode General_-
Category property, then it must use the exact values specified in the Unicode Character
Database for that claim to be conformant. Thus, if a regular expression syntax explicitly
supports the Unicode General_Category property and matches gc = P, then that match
must be based on the precise UCD values.
Character Properties 175 4.6 Numeric Value
Ideographic accounting numbers are commonly used on checks and other financial instru-
ments to minimize the possibilities of misinterpretation or fraud in the representation of
numerical values. The set of accounting numbers varies somewhat between Japanese, Chi-
nese, and Korean usage. Table 4-6 gives a fairly complete listing of the known accounting
characters. Some of these characters are ideographs with other meanings pressed into ser-
vice as accounting numbers; others are used only as accounting numbers.
In Japan, U+67D2 is also pronounced urusi, meaning “lacquer,” and is treated as a variant
of the standard character for “lacquer,” U+6F06.
The Unihan Database gives the most up-to-date and complete listing of primary numeric
ideographs and ideographs used as accounting numbers, including those for CJK reper-
toire extensions beyond the Unified Repertoire and Ordering. See Unicode Standard
Annex #38, “Unicode Han Database (Unihan),” for more details.
Character Properties 177 4.6 Numeric Value
mat controls to override text layout direction, add mirrored glyphs to a font used for paleo-
graphic display, and make the display choice depend on resolved direction for a directional
run. HL3 “Emulate explicit directional formatting characters” in the UBA also allows a
higher-level protocol to use other techniques such as style sheets or markup to override
text directionality in structured text. In combination, such techniques can provide for the
layout requirements of paleographic scripts which may mirror letters or signs depending
on text layout direction. See the discussions of directionality and text layout in the respec-
tive sections regarding each script.
Related Properties. The Bidi Mirrored property is not to be confused with the related,
informative Bidi Mirroring Glyph property, which lists pairs of characters whose represen-
tative glyphs are mirror images of each other. The Unicode Bidirectional Algorithm also
requires two related, normative properties, Bidi Paired Bracket and Bidi Paired Bracket
Type, which are used for matching specific bracket pairs and to assign the same text direc-
tion to both members of each pair in bidirectional processing for text layout. These proper-
ties do not affect mirroring. For more information, see BidiMirroring.txt and
BidiBrackets.txt in the Unicode Character Database.
Character Properties 180 4.8 Name
4.8 Name
Unicode characters have names that serve as unique identifiers for each character. The
character names in the Unicode Standard are identical to those of the English-language
edition of ISO/IEC 10646.
Where possible, character names are derived from existing conventional names of a char-
acter or symbol in English, but in many cases the character names nevertheless differ from
traditional names widely used by relevant user communities. The character names of sym-
bols and punctuation characters often describe their shape, rather than their function,
because these characters are used in many different contexts. See also “Color Words in
Unicode Character Names” in Section 22.9, Miscellaneous Symbols.
Character names are listed in the code charts. Currently, the character with the longest
name is U+FBF9 arabic ligature uighur kirghiz yeh with hamza above with alef
maksura isolated form (Version 1.1) with 83 letters and spaces in its name, and the one
with the shortest name is U+1F402 ox (Version 6.0) with only two letters in its name.
Stability. Once assigned, a character name is immutable. It will never be changed in subse-
quent versions of the Unicode Standard. Implementers and users can rely on the fact that a
character name uniquely represents a given character.
Character Name Syntax. Unicode character names, as listed in the code charts, contain
only uppercase Latin letters A through Z, digits, space, and hyphen-minus. In more detail,
character names reflect the following rules:
R1 Only Latin capital letters A to Z (U+0041..U+005A), ASCII digits (U+0030..
U+0039), U+0020 space, and U+002D hyphen-minus occur in character names.
R2 Digits do not occur as the first character of a character name, nor immediately fol-
lowing a space character.
R3 U+002D hyphen-minus does not occur as the first or last character of a character
name, nor immediately between two spaces, nor immediately preceding or follow-
ing another hyphen-minus character. (In other words, multiple occurrences of
U+002D in sequence are not allowed.)
R4 A space does not occur as the first or last character of a character name, nor imme-
diately preceding or following another space character. (In other words, multiple
spaces in sequence are not allowed.)
See Appendix A, Notational Conventions, for the typographical conventions used when
printing character names in the text of the standard.
Names as Identifiers. Character names are constructed so that they can easily be trans-
posed into formal identifiers in another context, such as a computer language. Because
Unicode character names do not contain any underscore (“_”) characters, a common strat-
egy is to replace any hyphen-minus or space in a character name by a single “_” when con-
structing a formal identifier from a character name. This strategy automatically results in a
Character Properties 181 4.8 Name
syntactically correct identifier in most formal languages. Furthermore, such identifiers are
guaranteed to be unique, because of the special rules for character name matching.
Character Name Matching. When matching identifiers transposed from character names,
it is possible to ignore case, whitespace, and all medial hyphen-minus characters (or any “_”
replacing a hyphen-minus), except for the hyphen-minus in U+1180 hangul jungseong
o-e, and still result in a unique match. For example, “ZERO WIDTH SPACE” is equivalent
to “zero-width-space” or “ZERO_WIDTH_SPACE” or “ZeroWidthSpace”. However,
“TIBETAN LETTER A” should not match “TIBETAN LETTER -A”, because in that
instance the hyphen-minus is not medial between two letters, but is instead preceded by a
space. For more information on character name matching, see Section 5.9, “Matching
Rules” in Unicode Standard Annex #44, “Unicode Character Database.”
Named Character Sequences. Occasionally, character sequences are also given a norma-
tive name in the Unicode Standard. The names for such sequences are taken from the same
namespace as character names, and are also unique. For details, see Unicode Standard
Annex #34, “Unicode Named Character Sequences.” Named character sequences are not
listed in the code charts; instead, they are listed in the file NamedSequences.txt in the Uni-
code Character Database.
The names for named character sequences are also immutable. Once assigned, they will
never be changed in subsequent versions of the Unicode Standard.
Character Name Aliases. The Unicode Standard has a mechanism for the publication of
additional, normative formal aliases for characters. These formal aliases are known as char-
acter name aliases. (See Definition D5 in Section 3.3, Semantics.) They function essentially
as auxiliary names for a character. The original reason for defining character name aliases
was to provide corrections for known mistakes in character names, but they have also
proven useful for other purposes, as documented here.
Character name aliases are listed in the file NameAliases.txt in the Unicode Character
Database. That file also documents the type field which distinguishes among different
kinds of character name aliases, as shown in Table 4-7.
Character name aliases are immutable, once published. (See Definition D42 in Section 3.5,
Properties.) They follow the same syntax rules as character names and are also guaranteed
Character Properties 182 4.8 Name
to be unique in the Unicode namespace for character names. This attribute makes charac-
ter name aliases useful as identifiers. A character may, in principle, have more than one
normative character name alias, but each distinct character name alias uniquely identifies
only a single code point.
The first type of character name alias consists of corrections for known mistakes in charac-
ter names. Sometimes errors in a character name are only discovered after publication of a
version of the Unicode Standard. Because character names are immutable, such errors are
not corrected by changing the names after publication. However, in some limited instances
(as for obvious typos in the name), a character name alias is defined instead.
For example, the following Unicode character name has a well-known spelling error in it:
U+FE18 presentation form for vertical right white lenticular brakcet
Because the spelling error could not be corrected after publication of the data files which
first contained it, a character name alias with the corrected spelling was defined:
U+FE18 presentation form for vertical right white lenticular bracket
Character name aliases are provided for additional reasons besides corrections of errors in
the character names. For example, there are character name aliases which give definitive
labels to control codes, which have no actual Unicode character names:
U+0009 horizontal tabulation
Character name aliases of type alternate are for widely used alternate names of Unicode
format characters. Currently only one such alternate is normatively defined, but it is for an
important character:
U+FEFF byte order mark
Among the control codes there are a few which have had names propagate through the
computer implementation “lore,” despite the fact that they refer to ISO/IEC 10646 control
functions that were never formally adopted. These names are defined as character name
aliases of type figment, and are included in NameAliases.txt, because they occur in some
widely distributed implementations, such as the regex engine for Perl. Examples include:
U+0081 high octet preset
Additional character name aliases match existing and widely used abbreviations (or acro-
nyms) for control codes and for Unicode format characters:
U+0009 tab
U+200B zwsp
Specifying these additional, normative character name aliases serves two major functions.
First, it provides a set of well-defined aliases for use in regular expression matching and
searching, where users might expect to be able to use established names or abbreviations
for control codes and the like, but where those names or abbreviations are not part of the
actual Unicode Name property. Second, because character name aliases are guaranteed to
Character Properties 183 4.8 Name
be unique in the Unicode character name namespace, having them defined for control
codes and abbreviations prevents the potential for accidental collisions between de facto
current use and names which might be chosen in the future for newly encoded Unicode
characters.
It is acceptable and expected for external specifications to make normative references to
Unicode characters using one (or more) of their normative character name aliases, where
such references make sense. For example, when discussing Unicode encoding schemes and
the role of U+FEFF as a signature for byte order, it would not make much sense to insist on
referring to U+FEFF by its name zero width no-break space, when use of the character
name alias byte order mark or the widely used abbreviation bom would communicate
with less confusion.
A subset of character name aliases is listed in the code charts, using special typographical
conventions explained in Section 24.1, Character Names List.
A normative character name alias is distinct from the informative aliases listed in the code
charts. Informative aliases merely point out other common names in use for a given char-
acter. Informative aliases are not immutable and are not guaranteed to be unique; they
therefore cannot serve as an identifier for a character. Their main purposes are to help
readers of the standard to locate and to identify particular characters.
For example, the name of U+4E00 is cjk unified ideograph-4e00, constructed by con-
catenation of “cjk unified ideograph-” and the code point. Similarly, the character name
of U+17000 is tangut ideograph-17000.
NR3 For all other Graphic characters and for all Format characters, the Name prop-
erty value is as explicitly listed in Field 1 of UnicodeData.txt.
For example, U+0A15 gurmukhi letter ka or U+200D zero width joiner.
NR4 For all other Unicode code points of all other types (Control, Private-Use, Surro-
gate, Noncharacter, and Reserved), the value of the Name property is the null
string. In other words, na = “”.
The ranges of Hangul syllables and most ideographic characters subject to the name deri-
vation rules NR1 and NR2 are identified by a special convention in Field 1 of Unicode-
Data.txt. The start and end of each range are indicated by a pair of entries in the data file in
the general format:
NNNN;<RANGENAME, First>;Lo;0;L;;;;;N;;;;;
NNNN;<RANGENAME, Last>;Lo;0;L;;;;;N;;;;;
This convention originated as a compression technique for UnicodeData.txt, as all of the
UnicodeData.txt properties of these ranges were uniform, and the names for the characters
in the ranges could be specified by rule. Note that the same convention is used in Unicode-
Data.txt to specify properties for code point types which have a null string as their Name
property value, such as private use characters.
CJK compatibility ideographs are an exception. They have names derived by rule NR2, but
are explicitly listed in UnicodeData.txt with their names, because they typically have non-
uniform character properties, including most notably a nontrivial canonical decomposi-
tion value.
The exact ranges subject to name derivation rules NR1 and NR2, and the specified prefix
strings are summarized in Table 4-8.
Twelve of the CJK ideographs in the starred range in Table 4-8, in the CJK Compatibility
Ideographs block, are actually CJK unified ideographs. Nonetheless, their names are con-
structed with the “cjk compatibility ideograph-” prefix shared by all other code points
in that block. The status of a CJK ideograph as a unified ideograph cannot be deduced
from the Name property value for that ideograph; instead, the dedicated binary property
Unified_Ideograph should be used to determine that status. See “CJK Compatibility Ideo-
graphs” in Section 18.1, Han, and Section 4.4, “Listing of Characters Covered by the Uni-
han Database” in Unicode Standard Annex #38, “Unihan Database,” for more details about
these exceptional twelve CJK ideographs.
The generic term “character name” refers to the Name property value for an encoded Uni-
code character. An expression such as, “The reserved code point U+30000 has no name,” is
shorthand for the more precise statement that the reserved code point U+30000 (as for all
code points of type Reserved) has a property value of na = “” for the Name property.
Character Properties 185 4.8 Name
Name Uniqueness. The Unicode Name property values are unique for all non-null values,
but not every Unicode code point has a unique Unicode Name property value. Further-
more, because Unicode character names, character name aliases, and named character
sequences constitute a single, unique namespace, the Name property value uniqueness
requirement applies to all three kinds of names.
Interpretation of Field 1 of UnicodeData.txt. Where Field 1 of UnicodeData.txt contains a
string enclosed in angle brackets, “<” and “>”, such a string is not a character name, but a
meta-label indicating some other information—for example, the start or end of a character
range. In these cases, the Name property value for that code point is either empty (na = “”)
or is given by one of the rules described above. In all other cases, the value of Field 1 (that
is, the string of characters between the first and second semicolon separators on each line)
corresponds to the normative value of the Name property for that code point.
Control Codes. The Unicode Standard does not define character names for control codes
(characters with General_Category = Cc). In other words, all control codes have a property
value of na = “” for the Name property. Control codes are instead listed in UnicodeData.txt
with a special label “<control>” in Field 1. This value is not a character name, but instead
indicates the code point type (see Definition D10a in Section 3.4, Characters and Encoding).
For control characters, the values of the informative Unicode 1.0 name property (Uni-
code_1_Name) in Field 10 match the names of the associated control functions from ISO/
IEC 6429. (See Section 4.9, Unicode 1.0 Names.)
For each code point type without character names, code point labels are constructed by
using a lowercase prefix derived from the code point type, followed by a hyphen-minus and
then a 4- to 6-digit hexadecimal representation of the code point. The label construction
for the five affected code point types is illustrated in Table 4-9.
To avoid any possible confusion with actual, non-null Name property values, constructed
Unicode code point labels are often displayed between angle brackets: <control-0009>,
<noncharacter-FFFF>, and so on. This convention is used consistently in the data files for
the Unicode Character Database.
A constructed code point label is distinguished from the designation of the code point itself
(for example, “U+0009” or “U+FFFF”), which is also a unique identifier, as described in
Appendix A, Notational Conventions.
Chapter 5
Implementation Guidelines 5
It is possible to implement a substantial subset of the Unicode Standard as “wide ASCII”
with little change to existing programming practice. However, the Unicode Standard also
provides for languages and writing systems that have more complex behavior than English
does. Whether one is implementing a new operating system from the ground up or
enhancing existing programming environments or applications, it is necessary to examine
many aspects of current programming practice and conventions to deal with this more
complex behavior.
This chapter covers a series of short, self-contained topics that are useful for implementers.
The information and examples presented here are meant to help implementers understand
and apply the design and features of the Unicode Standard. That is, they are meant to pro-
mote good practice in implementations conforming to the Unicode Standard.
These recommended guidelines are not normative and are not binding on the imple-
menter, but are intended to represent best practice. When implementing the Unicode
Standard, it is important to look not only at the letter of the conformance rules, but also at
their spirit. Many of the following guidelines have been created specifically to assist people
who run into issues with conformant implementations, while reflecting the requirements
of actual usage.
Implementation Guidelines 196 5.1 Data Structures for Character Conversion
Issues
Conversion of characters between standards is not always a straightforward proposition.
Many characters have mixed semantics in one standard and may correspond to more than
one character in another. Sometimes standards give duplicate encodings for the same char-
acter; at other times the interpretation of a whole set of characters may depend on the appli-
cation. Finally, there are subtle differences in what a standard may consider a character.
For these reasons, mapping tables are usually required to map between the Unicode Stan-
dard and another standard. Mapping tables need to be used consistently for text data
exchange to avoid modification and loss of text data. For details, see Unicode Technical
Standard #22, “Character Mapping Markup Language (CharMapML).” By contrast, con-
versions between different Unicode encoding forms are fast, lossless permutations.
There are important security issues associated with encoding conversion. For more infor-
mation, see Unicode Technical Report #36, “Unicode Security Considerations.”
The Unicode Standard can be used as a pivot to transcode among n different standards.
This process, which is sometimes called triangulation, reduces the number of mapping
tables that an implementation needs from O(n2) to O(n).
Multistage Tables
Tables require space. Even small character sets often map to characters from several differ-
ent blocks in the Unicode Standard and thus may contain up to 64K entries (for the BMP)
or 1,088K entries (for the entire codespace) in at least one direction. Several techniques
exist to reduce the memory space requirements for mapping tables. These techniques apply
not only to transcoding tables, but also to many other tables needed to implement the Uni-
code Standard, including character property data, case mapping, collation tables, and
glyph selection tables.
Flat Tables. If diskspace is not at issue, virtual memory architectures yield acceptable
working set sizes even for flat tables because the frequency of usage among characters dif-
fers widely. Even small character sets contain many infrequently used characters. In addi-
tion, data intended to be mapped into a given character set generally does not contain
characters from all blocks of the Unicode Standard (usually, only a few blocks at a time
need to be transcoded to a given character set). This situation leaves certain sections of the
mapping tables unused—and therefore paged to disk. The effect is most pronounced for
large tables mapping from the Unicode Standard to other character sets, which have large
Implementation Guidelines 197 5.1 Data Structures for Character Conversion
sections simply containing mappings to the default character, or the “unmappable charac-
ter” entry.
Ranges. It may be tempting to “optimize” these tables for space by providing elaborate pro-
visions for nested ranges or similar devices. This practice leads to unnecessary perfor-
mance costs on modern, highly pipelined processor architectures because of branch
penalties. A faster solution is to use an optimized two-stage table, which can be coded with-
out any test or branch instructions. Hash tables can also be used for space optimization,
although they are not as fast as multistage tables.
Two-Stage Tables. Two-stage tables are a commonly employed mechanism to reduce table
size (see Figure 5-1). They use an array of pointers and a default value. If a pointer is NULL,
the value returned by a lookup operation in the table is the default value. Otherwise, the
pointer references a block of values used for the second stage of the lookup. For BMP char-
acters, it is quite efficient to organize such two-stage tables in terms of high byte and low
byte values. The first stage is an array of 256 pointers, and each of the secondary blocks
contains 256 values indexed by the low byte in the code point. For supplementary charac-
ters, it is often advisable to structure the pointers and second-stage arrays somewhat differ-
ently, so as to take best advantage of the very sparse distribution of supplementary
characters in the remaining codespace.
Optimized Two-Stage Table. Wherever any blocks are identical, the pointers just point to
the same block. For transcoding tables, this case occurs generally for a block containing
only mappings to the default or “unmappable” character. Instead of using NULL pointers
and a default value, one “shared” block of default entries is created. This block is pointed to
Implementation Guidelines 198 5.1 Data Structures for Character Conversion
by all first-stage table entries, for which no character value can be mapped. By avoiding
tests and branches, this strategy provides access time that approaches the simple array
access, but at a great savings in storage.
Multistage Table Tuning. Given a table of arbitrary size and content, it is a relatively simple
matter to write a small utility that can calculate the optimal number of stages and their
width for a multistage table. Tuning the number of stages and the width of their arrays of
index pointers can result in various trade-offs of table size versus average access time.
Implementation Guidelines 199 5.2 Programming Languages and Data Types
ANSI/ISO C wchar_t. With the wchar_t wide character type, ANSI/ISO C provides for
inclusion of fixed-width, wide characters. ANSI/ISO C leaves the semantics of the wide
character set to the specific implementation but requires that the characters from the por-
table C execution set correspond to their wide character equivalents by zero extension. The
Unicode characters in the ASCII range U+0020 to U+007E satisfy these conditions. Thus,
if an implementation uses ASCII to code the portable C execution set, the use of the Uni-
code character set for the wchar_t type, in either UTF-16 or UTF-32 form, fulfills the
requirement.
The width of wchar_t is compiler-specific and can be as small as 8 bits. Consequently,
programs that need to be portable across any C or C++ compiler should not use wchar_t
for storing Unicode text. The wchar_t type is intended for storing compiler-defined wide
characters, which may be Unicode characters in some compilers. However, programmers
who want a UTF-16 implementation can use a macro or typedef (for example, UNICHAR)
that can be compiled as unsigned short or wchar_t depending on the target compiler
and platform. Other programmers who want a UTF-32 implementation can use a macro or
typedef that might be compiled as unsigned int or wchar_t, depending on the target
compiler and platform. This choice enables correct compilation on different platforms and
compilers. Where a 16-bit implementation of wchar_t is guaranteed, such macros or
typedefs may be predefined (for example, TCHAR on the Win32 API).
On systems where the native character type or wchar_t is implemented as a 32-bit quan-
tity, an implementation may use the UTF-32 form to represent Unicode characters.
A limitation of the ISO/ANSI C model is its assumption that characters can always be pro-
cessed in isolation. Implementations that choose to go beyond the ISO/ANSI C model may
find it useful to mix widths within their APIs. For example, an implementation may have a
32-bit wchar_t and process strings in any of the UTF-8, UTF-16, or UTF-32 forms.
Another implementation may have a 16-bit wchar_t and process strings as UTF-8 or
UTF-16, but have additional APIs that process individual characters as UTF-32 or deal
with pairs of UTF-16 code units.
Implementation Guidelines 201 5.3 Unknown and Missing Characters
tracks the version of the standard at which a particular character was added to the stan-
dard. This information can be particularly helpful in some interactions with downlevel sys-
tems. If the protocol used for communication between the systems provides for an
announcement of the Unicode version on each one, an uplevel system can predict which
recently added characters will appear as unassigned characters to the downlevel system.
Implementation Guidelines 203 5.4 Handling Surrogate Pairs in UTF-16
5.6 Normalization
Alternative Spellings. The Unicode Standard contains explicit codes for the most fre-
quently used accented characters. These characters can also be composed; in the case of
accented letters, characters can be composed from a base character and nonspacing
mark(s).
The Unicode Standard provides decompositions for characters that can be composed
using a base character plus one or more nonspacing marks. The decomposition mappings
are specific to a particular version of the Unicode Standard. Further decomposition map-
pings may be added to the standard for new characters encoded in the future; however, no
existing decomposition mapping for a currently encoded character will ever be removed or
changed, nor will a decomposition mapping be added for a currently encoded character.
These constraints on changes for decomposition are enforced by the Normalization Stabil-
ity Policy. See the subsection “Policies” in Section B.3, Other Unicode Online Resources.
Normalization. Systems may normalize Unicode-encoded text to one particular sequence,
such as normalizing composite character sequences into precomposed characters, or vice
versa (see Figure 5-2).
Unnormalized
a @¨ @· ë @˜ ò
ä· ë̃ ò a @¨ @· e @¨ @˜ o @`
Precomposed Decomposed
Compared to the number of possible combinations, only a relatively small number of pre-
composed base character plus nonspacing marks have independent Unicode character val-
ues.
Systems that cannot handle nonspacing marks can normalize to precomposed characters;
this option can accommodate most modern Latin-based languages. Such systems can use
fallback rendering techniques to at least visually indicate combinations that they cannot
handle (see the “Fallback Rendering” subsection of Section 5.13, Rendering Nonspacing
Marks).
In systems that can handle nonspacing marks, it may be useful to normalize so as to elimi-
nate precomposed characters. This approach allows such systems to have a homogeneous
representation of composed characters and maintain a consistent treatment of such char-
Implementation Guidelines 207 5.6 Normalization
acters. However, in most cases, it does not require too much extra work to support mixed
forms, which is the simpler route.
The Unicode Normalization Forms are defined in Section 3.11, Normalization Forms. For
further information about implementation of normalization, see also Unicode Standard
Annex #15, “Unicode Normalization Forms.” For a general discussion of issues related to
normalization, see “Equivalent Sequences” in Section 2.2, Unicode Design Principles; and
Section 2.11, Combining Characters.
Implementation Guidelines 208 5.7 Compression
5.7 Compression
Using the Unicode character encoding may increase the amount of storage or memory
space dedicated to the text portion of files. Compressing Unicode-encoded files or strings
can therefore be an attractive option if the text portion is a large part of the volume of data
compared to binary and numeric data, and if the processing overhead of the compression
and decompression is acceptable.
Compression always constitutes a higher-level protocol and makes interchange dependent
on knowledge of the compression method employed. For a detailed discussion of compres-
sion and a standard compression scheme for Unicode, see Unicode Technical Standard #6,
“A Standard Compression Scheme for Unicode.”
Encoding forms defined in Section 2.5, Encoding Forms, have different storage characteris-
tics. For example, as long as text contains only characters from the Basic Latin (ASCII)
block, it occupies the same amount of space whether it is encoded with the UTF-8 or ASCII
codes. Conversely, text consisting of CJK ideographs encoded with UTF-8 will require
more space than equivalent text encoded with UTF-16.
For processing rather than storage, the Unicode encoding form is usually selected for easy
interoperability with existing APIs. Where there is a choice, the trade-off between decoding
complexity (high for UTF-8, low for UTF-16, trivial for UTF-32) and memory and cache
bandwidth (high for UTF-32, low for UTF-8 or UTF-16) should be considered.
Implementation Guidelines 209 5.8 Newline Guidelines
Definitions
Table 5-1 provides hexadecimal values for the acronyms used in these guidelines. The acro-
nyms shown in Table 5-1 correspond to characters or sequences of characters. The name
column shows the usual names used to refer to the characters in question, whereas the
other columns show the Unicode, ASCII, and EBCDIC encoded values for the characters.
Encoding. Except for LS and PS, the newline characters discussed here are encoded as con-
trol codes. Many control codes were originally designed for device control but, together
with TAB, the newline characters are commonly used as part of plain text. For more infor-
mation on how Unicode encodes control codes, see Section 23.1, Control Codes.
Implementation Guidelines 210 5.8 Newline Guidelines
Notation. This discussion of newline guidelines uses lowercase when referring to functions
having to do with line determination, but uses the acronyms when referring to the actual
characters involved. Keys on keyboards are indicated in all caps. For example:
The line separator may be expressed by LS in Unicode text or CR on
some platforms. It may be entered into text with the SHIFT-RETURN
key.
EBCDIC. Table 5-1 shows the two mappings of LF and NEL used by EBCDIC systems.
The first EBCDIC column shows the default control code mapping of these characters,
which is used in most EBCDIC environments. The second column shows the z/OS Unix
System Services mapping of LF and NEL. That mapping arises from the use of the LF char-
acter for the newline function in C programs and in Unix environments, while text files on
z/OS traditionally use NEL for the newline function.
NEL (next line) is not actually defined in 7-bit ASCII. It is defined in the ISO control func-
tion standard, ISO 6429, as a C1 control function. However, the 0x85 mapping shown in
the ASCII column in Table 5-1 is the usual way that this C1 control function is mapped in
ASCII-based character encodings.
Newline Function. The acronym NLF (newline function) stands for the generic control
function for indication of a new line break. It may be represented by different characters,
depending on the platform, as shown in Table 5-2.
A record separator is used to separate records. For example, when exchanging tabular data,
a common format is to tab-separate the cells and use a CRLF at the end of a line of cells.
This function is not precisely the same as line separation, but the same characters are often
used.
Traditionally, NLF started out as a line separator (and sometimes record separator). It is
still used as a line separator in simple text editors such as program editors. As platforms
and programs started to handle word processing with automatic line-wrap, these charac-
ters were reinterpreted to stand for paragraph separators. For example, even such simple
programs as the Windows Notepad program and the Mac SimpleText program interpret
their platform’s NLF as a paragraph separator, not a line separator.
Once NLF was reinterpreted to stand for a paragraph separator, in some cases another
control character was pressed into service as a line separator. For example, vertical tabula-
tion VT is used in Microsoft Word. However, the choice of character for line separator is
even less standardized than the choice of character for NLF.
Many Internet protocols and a lot of existing text treat NLF as a line separator, so an imple-
menter cannot simply treat NLF as a paragraph separator in all circumstances.
Recommendations
The Unicode Standard defines two unambiguous separator characters: U+2029 para-
graph separator (PS) and U+2028 line separator (LS). In Unicode text, the PS and LS
characters should be used wherever the desired function is unambiguous. Otherwise, the
following recommendations specify how to cope with an NLF when converting from other
character sets to Unicode, when interpreting characters in text, and when converting from
Unicode to other character sets.
Note that even if an implementer knows which characters represent NLF on a particular
platform, CR, LF, CRLF, and NEL should be treated the same on input and in interpreta-
tion. Only on output is it necessary to distinguish between them.
Converting from Other Character Code Sets
R1 If the exact usage of any NLF is known, convert it to LS or PS.
R1a If the exact usage of any NLF is unknown, remap it to the platform NLF.
Recommendation R1a does not really help in interpreting Unicode text unless the imple-
menter is the only source of that text, because another implementer may have left in LF,
CR, CRLF, or NEL.
Interpreting Characters in Text
R2 Always interpret PS as paragraph separator and LS as line separator.
R2a In word processing, interpret any NLF the same as PS.
R2b In simple text editors, interpret any NLF the same as LS.
Implementation Guidelines 212 5.8 Newline Guidelines
In line breaking, both PS and LS terminate a line; therefore, the Unicode Line Breaking
Algorithm in Unicode Standard Annex #14, “Unicode Line Breaking Algorithm,” is defined
such that any NLF causes a line break.
R2c In parsing, choose the safest interpretation.
For example, in recommendation R2c an implementer dealing with sentence break heuris-
tics would reason in the following way that it is safer to interpret any NLF as LS:
• Suppose an NLF were interpreted as LS, when it was meant to be PS. Because
most paragraphs are terminated with punctuation anyway, this would cause
misidentification of sentence boundaries in only a few cases.
• Suppose an NLF were interpreted as PS, when it was meant to be LS. In this
case, line breaks would cause sentence breaks, which would result in significant
problems with the sentence break heuristics.
Converting to Other Character Code Sets
R3 If the intended target is known, map NLF, LS, and PS depending on the target con-
ventions.
For example, when mapping to Microsoft Word’s internal conventions for documents, LS
would be mapped to VT, and PS and any NLF would be mapped to CRLF.
R3a If the intended target is unknown, map NLF, LS, and PS to the platform newline
convention (CR, LF, CRLF, or NEL).
In Java, for example, this is done by mapping to a string nlf, defined as follows:
String nlf = System.getProperty("line.separator");
Input and Output
R4 A readline function should stop at NLF, LS, FF, or PS. In the typical implemen-
tation, it does not include the NLF, LS, PS, or FF that caused it to stop.
Because the separator is lost, the use of such a readline function is limited to text pro-
cessing, where there is no difference among the types of separators.
R4a A writeline (or newline) function should convert NLF, LS, and PS according
to the recommendations R3 and R3a.
In C, gets is defined to terminate at a newline and replaces the newline with '\0', while
fgets is defined to terminate at a newline and includes the newline in the array into which
it copies the data. C implementations interpret '\n' either as LF or as the underlying plat-
form newline NLF, depending on where it occurs. EBCDIC C compilers substitute the rel-
evant codes, based on the EBCDIC execution set.
Page Separator
FF is commonly used as a page separator, and it should be interpreted that way in text.
When displaying on the screen, it causes the text after the separator to be forced to the next
page. It is interpreted in the same way as the LS for line breaking, in parsing, or in input
Implementation Guidelines 213 5.8 Newline Guidelines
code Standard Annex #29, “Unicode Text Segmentation,” for the definition of default
grapheme clusters and for a discussion of how grapheme clusters can be tailored to meet
the needs of defining arbitrary cluster boundaries.)
Atomic Character Boundaries. The use of atomic character boundaries is closest to selec-
tion of individual Unicode characters. However, most modern systems indicate selection
with some sort of rectangular highlighting. This approach places restrictions on the consis-
tency of editing because some sequences of characters do not linearly progress from the
start of the line. When characters stack, two mechanisms are used to visually indicate par-
tial selection: linear and nonlinear boundaries.
Linear Boundaries. Use of linear boundaries treats the entire width of the resultant glyph
as belonging to the first character of the sequence, and the remaining characters in the
backing-store representation as having no width and being visually afterward.
This option is the simplest mechanism. The advantage of this system is that it requires very
little additional implementation work. The disadvantage is that it is never easy to select
narrow characters, let alone a zero-width character. Mechanically, it requires the user to
select just to the right of the nonspacing mark and drag just to the left. It also does not
allow the selection of individual nonspacing marks if more than one is present.
Nonlinear Boundaries. Use of nonlinear boundaries divides any stacked element into
parts. For example, picking a point halfway across a lam + meem ligature can represent the
division between the characters. One can either allow highlighting with multiple rectangles
or use another method such as coloring the individual characters.
With more work, a precomposed character can behave in deletion as if it were a composed
character sequence with atomic character boundaries. This procedure involves deriving
the character’s decomposition on the fly to get the components to be used in simulation.
For example, deletion occurs by decomposing, removing the last character, then recom-
posing (if more than one character remains). However, this technique does not work in
general editing and selection.
In most editing systems, the code point is the smallest addressable item, so the selection
and assignment of properties (such as font, color, letterspacing, and so on) cannot be done
on any finer basis than the code point. Thus the accent on an “e” could not be colored dif-
ferently than the base in a precomposed character, although it could be colored differently
if the text were stored internally in a decomposed form.
Just as there is no single notion of text element, so there is no single notion of editing char-
acter boundaries. At different times, users may want different degrees of granularity in the
editing process. Two methods suggest themselves. First, the user may set a global prefer-
ence for the character boundaries. Second, the user may have alternative command mech-
anisms, such as Shift-Delete, which give more (or less) fine control than the default mode.
Implementation Guidelines 219 5.12 Strategies for Handling Nonspacing Marks
Standard Annex #14, “Unicode Line Breaking Algorithm”; and Unicode Standard Annex
#29, “Unicode Text Segmentation.”)
Keyboard Input
A common implementation for the input of combining character sequences is the use of
dead keys. These keys match the mechanics used by typewriters to generate such sequences
through overtyping the base character after the nonspacing mark. In computer implemen-
tations, keyboards enter a special state when a dead key is pressed for the accent and emit a
precomposed character only when one of a limited number of “legal” base characters is
entered. It is straightforward to adapt such a system to emit combining character
sequences or precomposed characters as needed.
Typists, especially in the Latin script, are trained on systems that work using dead keys.
However, many scripts in the Unicode Standard (including the Latin script) may be imple-
mented according to the handwriting sequence, in which users type the base character first,
followed by the accents or other nonspacing marks (see Figure 5-4).
Zrich u
Zrich
¨
Zrich Zurich
u ¨
Zürich Zürich
In the case of handwriting sequence, each keystroke produces a distinct, natural change on
the screen; there are no hidden states. To add an accent to any existing character, the user
positions the insertion point (caret) after the character and types the accent.
Truncation
There are two types of truncation: truncation by character count and truncation by dis-
played width. Truncation by character count can entail loss (be lossy) or be lossless.
Truncation by character count is used where, due to storage restrictions, a limited number
of characters can be entered into a field; it is also used where text is broken into buffers for
transmission and other purposes. The latter case can be lossless if buffers are recombined
seamlessly before processing or if lookahead is performed for possible combining charac-
ter sequences straddling buffers.
Implementation Guidelines 221 5.12 Strategies for Handling Nonspacing Marks
When fitting data into a field of limited storage length, some information will be lost. The
preferred position for truncating text in that situation is on a grapheme cluster boundary.
As Figure 5-5 shows, such truncation can mean truncating at an earlier point than the last
character that would have fit within the physical storage limitation. (See Unicode Standard
Annex #29, “Unicode Text Segmentation.”)
On Grapheme Cluster
Boundaries J o s e @´
Clipping José
Ellipsis Jo...
Truncation by displayed width is used for visual display in a narrow field. In this case, trun-
cation occurs on the basis of the width of the resulting string rather than on the basis of a
character count. In simple systems, it is easiest to truncate by width, starting from the end
and working backward by subtracting character widths as one goes. Because a trailing non-
spacing mark does not contribute to the measurement of the string, the result will not sep-
arate nonspacing marks from their base characters.
If the textual environment is more sophisticated, the widths of characters may depend on
their context, due to effects such as kerning, ligatures, or contextual formation. For such
systems, the width of a precomposed character, such as an “ï”, may be different than the
width of a narrow base character alone. To handle these cases, a final check should be
made on any truncation result derived from successive subtractions.
A different option is simply to clip the characters graphically. Unfortunately, this may
result in clipping off part of a character, which can be visually confusing. Also, if the clip-
ping occurs between characters, it may not give any visual feedback that characters are
being omitted. A graphic or ellipsis can be used to give this visual feedback.
Implementation Guidelines 222 5.13 Rendering Nonspacing Marks
Characters Glyphs
a + $¨ + $˜ + $ + $ → ä̃
0061 0308 ˙ ˆ
0303 0323 032D ˆ˙
+$ +$ →
0E02 0E36 0E49
does not collide with capital letters. This will mean that this mark is placed too high above
many lowercase letters. For example, the default positioning of a circumflex can be above
the ascent, which will place it above capital letters. Even though the result will not be par-
ticularly attractive for letters such as g-circumflex, the result should generally be recogniz-
able in the case of single nonspacing marks.
In a degenerate case, a nonspacing mark occurs as the first character in the text or is sepa-
rated from its base character by a line separator, paragraph separator, or other format char-
acter that causes a positional separation. This result is called a defective combining
character sequence (see Section 3.6, Combination). Defective combining character
sequences should be rendered as if they had a no-break space as a base character. (See
Section 7.9, Combining Marks.)
Bidirectional Positioning. In bidirectional text, the nonspacing marks are reordered with
their base characters; that is, they visually apply to the same base character after the algo-
rithm is used (see Figure 5-8). There are a few ways to accomplish this positioning.
The simplest method is similar to the Simple Overlap fallback method. In the Bidirectional
Algorithm, combining marks take the level of their base character. In that case, Arabic and
Hebrew nonspacing marks would come to the left of their base characters. The font is
designed so that instead of overlapping to the left, the Arabic and Hebrew nonspacing
marks overlap to the right. In Figure 5-8, the “glyph metrics” line shows the pen start and
end for each glyph with such a design. After aligning the start and end points, the final
result shows each nonspacing mark attached to the corresponding base letter. More
sophisticated rendering could then apply the positioning methods outlined in the next sec-
tion.
Some rendering software may require keeping the nonspacing mark glyphs consistently
ordered to the right of the base character glyphs. In that case, a second pass can be done
after producing the “screen order” to put the odd-level nonspacing marks on the right of
their base characters. As the levels of nonspacing marks will be the same as their base char-
acters, this pass can swap the order of nonspacing mark glyphs and base character glyphs
in right-to-left (odd) levels. (See Unicode Standard Annex #9, “Unicode Bidirectional Algo-
rithm.”)
Implementation Guidelines 224 5.13 Rendering Nonspacing Marks
g @ˆ U @V
Screen Order
g @ˆ @V U
Glyph Metrics
g ˆ
x x x x x V U
x x x
Aligned Glyphs
gˆ U
x Vx x x
Justification. Typically, full justification of text adds extra space at space characters so as to
widen a line; however, if there are too few (or no) space characters, some systems add extra
letterspacing between characters (see Figure 5-9). This process needs to be modified if
zero-width nonspacing marks are present in the text. Otherwise, if extra justifying space is
added after the base character, it can have the effect of visually separating the nonspacing
mark from its base.
Zürich
66 points/6 positions
Z u¨ r i c h = 11 points per position
66 points/5 positions
Zü rich = 13.2 points per position
Because nonspacing marks always follow their base character, proper justification adds let-
terspacing between characters only if the second character is a base character.
Implementation Guidelines 225 5.13 Rendering Nonspacing Marks
Canonical Equivalence
Canonical equivalence must be taken into account in rendering multiple accents, so that
any two canonically equivalent sequences display as the same. This is particularly import-
ant when the canonical order is not the customary keyboarding order, which happens in
Arabic with vowel signs or in Hebrew with points. In those cases, a rendering system may
be presented with either the typical typing order or the canonical order resulting from nor-
malization, as shown in Table 5-3.
With a restricted repertoire of nonspacing mark sequences, such as those required for Ara-
bic, a ligature mechanism can be used to get the right appearance, as described earlier.
When a fallback mechanism for placing accents based on their combining class is
employed, the system should logically reorder the marks before applying the mechanism.
Rendering systems should handle any of the canonically equivalent orders of combining
marks. This is not a performance issue: the amount of time necessary to reorder combining
marks is insignificant compared to the time necessary to carry out other work required for
rendering.
A rendering system can reorder the marks internally if necessary, as long as the resulting
sequence is canonically equivalent. In particular, any permutation of the non-zero combin-
ing class values can be used for a canonical-equivalent internal ordering. For example, a
rendering system could internally permute weights to have U+0651 arabic shadda pre-
cede all vowel signs. This would use the remapping shown in Table 5-4.
Only non-zero combining class values can be changed, and they can be permuted only, not
combined or split. This can be restated as follows:
Implementation Guidelines 226 5.13 Rendering Nonspacing Marks
• Two characters that have the same combining class values cannot be given dis-
tinct internal weights.
• Two characters that have distinct combining class values cannot be given the
same internal weight.
• Characters with a combining class of zero must be given an internal weight of
zero.
Positioning Methods
A number of methods are available to position nonspacing marks so that they are in the
correct location relative to the base character and previous nonspacing marks.
Positioning with Ligatures. A fixed set of combining character sequences can be rendered
effectively by means of fairly simple substitution, as shown in Figure 5-10.
a + $¨ → ä
A + $¨ → Ä
f + i → fi
Wherever the glyphs representing a sequence of <base character, nonspacing mark> occur,
a glyph representing the combined form is substituted. Because the nonspacing mark has a
zero advance width, the composed character sequence will automatically have the same
width as the base character. More sophisticated text rendering systems may take additional
measures to account for those cases where the composed character sequence kerns differ-
ently or has a slightly different advance width than the base character.
Positioning with ligatures is perhaps the simplest method of supporting nonspacing marks.
Whenever there is a small, fixed set, such as those corresponding to the precomposed char-
acters of ISO/IEC 8859-1 (Latin-1), this method is straightforward to apply. Because the
composed character sequence almost always has the same width as the base character, ren-
dering, measurement, and editing of these characters are much easier than for the general
case of ligatures.
If a combining character sequence does not form a ligature, then either positioning with
contextual forms or positioning with enhanced kerning can be applied. If they are not
available, then a fallback method can be used.
Positioning with Contextual Forms. A more general method of dealing with positioning of
nonspacing marks is to use contextual formation (see Figure 5-11). In this case for Devana-
gari, a consonant RA is rendered with a nonspacing glyph (reph) positioned above a base
consonant. (See “Rendering Devanagari” in Section 12.1, Devanagari.) Depending on the
position of the stem for the corresponding base consonant glyph, a contextual choice is
Implementation Guidelines 227 5.13 Rendering Nonspacing Marks
made between reph glyphs with different side bearings, so that the tip of the reph will be
placed correctly with respect to the base consonant’s stem. Base glyphs generally fall into a
fairly small number of classes, depending on their general shape and width, so a corre-
sponding number of contextually distinct glyphs for the nonspacing mark suffice to pro-
duce correct rendering.
In general cases, a number of different heights of glyphs can be chosen to allow stacking of
glyphs, at least for a few deep. (When these bounds are exceeded, then the fallback methods
can be used.) This method can be combined with the ligature method so that in specific
cases ligatures can be used to produce fine variations in position and shape.
Positioning with Enhanced Kerning. A third technique for positioning diacritics is an
extension of the normal process of kerning to be both horizontal and vertical (see
Figure 5-12). Typically, kerning maps from pairs of glyphs to a positioning offset. For exam-
ple, in the word “To” the “o” should nest slightly under the “T”. An extension of this system
maps to both a vertical and a horizontal offset, allowing glyphs to be positioned arbitrarily.
T o w´
To ẃ
For effective use in the general case, the kerning process must be extended to handle more
than simple kerning pairs, as multiple diacritics may occur after a base letter.
Positioning with enhanced kerning can be combined with the ligature method so that in
specific cases ligatures can be used to produce fine variations in position and shape.
Implementation Guidelines 228 5.14 Locating Text Element Boundaries
5.15 Identifiers
A common task facing an implementer of the Unicode Standard is the provision of a pars-
ing and/or lexing engine for identifiers. To assist in the standard treatment of identifiers in
Unicode character-based parsers, a set of guidelines is provided in Unicode Standard
Annex #31, “Unicode Identifier and Pattern Syntax,” as a recommended default for the
definition of identifier syntax. That document provides details regarding the syntax and
conformance considerations. Associated data files defining the character properties
referred to by the identifier syntax can be found in the Unicode Character Database.
Implementation Guidelines 230 5.16 Sorting and Searching
Language-Insensitive Sorting
In some circumstances, an application may need to do language-insensitive sorting—that
is, sorting of textual data without consideration of language-specific cultural expectations
about how strings should be ordered. For example, a temporary index may need only to be
in some well-defined order, but the exact details of the order may not matter or be visible to
Implementation Guidelines 231 5.16 Sorting and Searching
Searching
Searching is subject to many of the same issues as comparison. Other features are often
added, such as only matching words (that is, where a word boundary appears on each side
of the match). One technique is to code a fast search for a weak match. When a candidate is
found, additional tests can be made for other criteria (such as matching diacriticals, word
match, case match, and so on).
When searching strings, it is necessary to check for trailing nonspacing marks in the target
string that may affect the interpretation of the last matching character. That is, a search for
“San Jose” may find a match in the string “Visiting San José, Costa Rica, is a...”. If an exact
(diacritic) match is desired, then this match should be rejected. If a weak match is sought,
then the match should be accepted, but any trailing nonspacing marks should be included
when returning the location and length of the target substring. The mechanisms discussed
in Unicode Standard Annex #29, “Unicode Text Segmentation,” can be used for this pur-
pose.
One important application of weak equivalence is case-insensitive searching. Many tradi-
tional implementations map both the search string and the target text to uppercase. How-
ever, case mappings are language-dependent and not unambiguous. The preferred method
of implementing case insensitivity is described in Section 5.18, Case Mappings.
A related issue can arise because of inaccurate mappings from external character sets. To
deal with this problem, characters that are easily confused by users can be kept in a weak
equivalency class (ë d-bar, eth, ê capital d-bar, – capital eth). This approach tends to do
a better job of meeting users’ expectations when searching for named files or other objects.
Implementation Guidelines 232 5.16 Sorting and Searching
Sublinear Searching
International searching is clearly possible using the information in the collation, just by
using brute force. However, this tactic requires an O(m*n) algorithm in the worst case and
an O(m) algorithm in common cases, where n is the number of characters in the pattern
that is being searched for and m is the number of characters in the target to be searched.
A number of algorithms allow for fast searching of simple text, using sublinear algorithms.
These algorithms have only O(m/n) complexity in common cases by skipping over charac-
ters in the target. Several implementers have adapted one of these algorithms to search text
pre-transformed according to a collation algorithm, which allows for fast searching with
native-language matching (see Figure 5-13).
The main problems with adapting a language-aware collation algorithm for sublinear
searching relate to multiple mappings and ignorables. Additionally, sublinear algorithms
precompute tables of information. Mechanisms like the two-stage tables shown in
Figure 5-1 are efficient tools in reducing memory requirements.
Implementation Guidelines 233 5.17 Binary Order
divided up into thirty-two 2K chunks. The 28th chunk corresponds to the values
0xD800..0xDFFF—that is, the surrogate code units. The 29th through 32nd chunks corre-
spond to the values 0xE000..0xFFFF. The addition of 0x2000 to the surrogate code units
rotates them up to the range 0xF800..0xFFFF. Adding 0xF800 to the values 0xE000..0xFFFF
and ignoring the unsigned integer overflow rotates them down to the range
0xD800..0xF7FF. Calculating the final difference for the return from the rotated values pro-
duces the same result as basing the comparison on code points, rather than the UTF-16
code units. The use of the hack of unsigned integer overflow on addition avoids the need
for a conditional test to accomplish the rotation of values.
Note that this mechanism works correctly only on well-formed UTF-16 text. A modified
algorithm must be used to operate on 16-bit Unicode strings that could contain isolated
surrogates.
Implementation Guidelines 236 5.18 Case Mappings
Titlecasing
Titlecasing refers to a casing practice wherein the first letter of a word is an uppercase letter
and the rest of the letters are lowercase. This typically applies, for example, to initial words
of sentences and to proper nouns. Depending on the language and orthographic practice,
this convention may apply to other words as well, as for common nouns in German.
Titlecasing also applies to entire strings, as in instances of headings or titles of documents,
for which multiple words are titlecased. The choice of which words to titlecase in headings
and titles is dependent on language and local conventions. For example, “The Merry Wives
of Windsor” is the appropriate titlecasing of that play’s name in English, with the word “of”
not titlecased. In German, however, the title is “Die lustigen Weiber von Windsor,” and
both “lustigen” and “von” are not titlecased. In French even fewer words are titlecased:
“Les joyeuses commères de Windsor.”
Moreover, the determination of what actually constitutes a word is language dependent,
and this can influence which letter or letters of a “word” are uppercased when titlecasing
strings. For example l’arbre is considered two words in French, whereas can’t is considered
one word in English.
The need for a normative Titlecase_Mapping property in the Unicode Standard derives
from the fact that the standard contains certain digraph characters for compatibility. These
digraph compatibility characters, such as U+01F3 “dz” latin small letter dz, require
one form when being uppercased, U+01F1 “DZ” latin capital letter dz, and another
form when being titlecased, U+01F2 “Dz” latin capital letter d with small letter z.
The latter form is informally referred to as a titlecase character, because it is mixed case,
with the first letter uppercase. Most characters in the standard have identical values for
their Titlecase_Mapping and Uppercase_Mapping; however, the two values are distin-
guished for these few digraph compatibility characters.
Implementation Guidelines 237 5.18 Case Mappings
ı I ı I
0131 ഄ 0049 0131 എ 0049
i+
0069
@˙ എ
0307 0049
I+
0307
@˙ i+
0069 0307
@˙ എ İ + @˙
0130 0307
İ
ഄ i+ @˙ I˙
എ i
0130 0069 0307 0130 0069
I+
0049
@˙ എ
0307
i+
0069
@˙
0307
I+
0049
@˙ ഄ
0307 0069
i
In both of the Turkish case mapping figures, a mapping with a double-sided arrow round-
trips—that is, the opposite case mapping results in the original sequence. A mapping with
a single-sided arrow does not round-trip.
Caseless Characters. Because many characters are really caseless (most of the IPA block,
for example) and have no matching uppercase, the process of uppercasing a string does not
mean that it will no longer contain any lowercase letters.
German sharp s. The German sharp s character has several complications in case map-
ping. Not only does its uppercase mapping expand in length, but its default case-pairings
are asymmetrical. The default case mapping operations follow standard German orthogra-
phy, which uses the string “SS” as the regular uppercase mapping for U+00DF ß latin
small letter sharp s. In contrast, the alternate, single character uppercase form,
U+1E9E latin capital letter sharp s, is intended for typographical representations of
signage and uppercase titles, and in other environments where users require the sharp s to
be preserved in uppercase. Overall, such usage is uncommon. Thus, when using the default
Unicode casing operations, capital sharp s will lowercase to small sharp s, but not vice
versa: small sharp s uppercases to “SS”, as shown in Figure 5-16. A tailored casing operation
is needed in circumstances requiring small sharp s to uppercase to capital sharp s.
Implementation Guidelines 239 5.18 Case Mappings
ß ẞ ß ẞ
ss SS ss SS
Reversibility
No casing operations are reversible. For example:
toUpperCase(toLowerCase(“John Brown”)) → “JOHN BROWN”
There are even single words like vederLa in Italian or the name McGowan in English,
which are neither upper-, lower-, nor titlecase. This format is sometimes called inner-
caps—or more informally camelcase—and it is often used in programming and in Web
names. Once the string “McGowan” has been uppercased, lowercased, or titlecased, the
original cannot be recovered by applying another uppercase, lowercase, or titlecase opera-
tion. There are also single characters that do not have reversible mappings, such as the
Greek sigmas.
For word processors that use a single command-key sequence to toggle the selection
through different casings, it is recommended to save the original string and return to it via
the sequence of keys. The user interface would produce the following results in response to
a series of command keys. In the following example, notice that the original string is
restored every fourth time.
1. The quick brown
2. THE QUICK BROWN
3. the quick brown
4. The Quick Brown
5. The quick brown (repeating from here on)
Uppercase, titlecase, and lowercase can be represented in a word processor by using a char-
acter style. Removing the character style restores the text to its original state. However, if
this approach is taken, any spell-checking software needs to be aware of the case style so
that it can check the spelling against the actual appearance.
Implementation Guidelines 240 5.18 Case Mappings
Caseless Matching
Caseless matching is implemented using case folding, which is the process of mapping
characters of different case to a single form, so that case differences in strings are erased.
Case folding allows for fast caseless matches in lookups because only binary comparison is
required. It is more than just conversion to lowercase. For example, it correctly handles
cases such as the Greek sigma, so that “xy{|” and “butu” will match.
Normally, the original source string is not replaced by the folded string because that substi-
tution may erase important information. For example, the name “Marco di Silva” would be
folded to “marco di silva,” losing the information regarding which letters are capitalized.
Typically, the original string is stored along with a case-folded version for fast compari-
sons.
The CaseFolding.txt file in the Unicode Character Database is used to perform locale-inde-
pendent case folding. This file is generated from the case mappings in the Unicode Charac-
ter Database, using both the single-character mappings and the multicharacter mappings.
It folds all characters having different case forms together into a common form. To com-
pare two strings for caseless matching, one can fold each string using this data and then use
a binary comparison.
Case folding logically involves a set of equivalence classes constructed from the Unicode
Character Database case mappings as follows.
For each character X in Unicode, apply the following rules in order:
R1 If X is already in an equivalence class, continue to the next character. Otherwise,
form a new equivalence class and add X.
R2 Add any other character that uppercases, lowercases, or titlecases to anything in
the equivalence class.
R3 Add any other characters to which anything in the equivalence class uppercases,
lowercases, or titlecases.
R4 Repeat R2 and R3 until nothing further is added.
R5 From each class, one representative element (a single lowercase letter where possi-
ble) is chosen to be the common form.
For rule R5, it is preferable to choose a single lowercase letter for the common form, but
this is not possible in all instances. For case folding of Cherokee letters, for example, a sin-
gle uppercase letter must be chosen instead, because the uppercase letters for Cherokee
were encoded in an earlier version of the Unicode Standard, and the lowercase letters were
encoded in a later version. This choice is required to keep case folding stable across Uni-
code versions.
Each equivalence class is completely disjoint from all the others, and every Unicode char-
acter is in one equivalence class. CaseFolding.txt thus contains the mappings from other
characters in the equivalence classes to their common forms. As an exception, the case fold-
ings for dotless i and dotted I do not follow the derivation algorithm for all other case fold-
Implementation Guidelines 241 5.18 Case Mappings
ings. Instead, their case foldings are hard-coded in the derivation for best default matching
behavior. There are alternate case foldings for these characters, which can be used for case
folding for Turkic languages. However, the use of those alternate case foldings does not
maintain canonical equivalence. Furthermore, it is often undesirable to have differing
behavior for caseless matching. Because language information is often not available when
caseless matching is applied to strings, it also may not be clear which alternate to choose.
The Unicode case folding algorithm is defined to be simpler and more efficient than case
mappings. It is context-insensitive and language-independent (except for the optional, alter-
nate Turkic case foldings). As a result, there are a few rare cases where a caseless match does
not match pairs of strings as expected; the most notable instance of this is for Lithuanian. In
Lithuanian typography for dictionary use, an “i” retains its dot when a grave, acute, or tilde
accent is placed above it. This convention is represented in Unicode by using an explicit
combining dot above, occurring in sequence between the “i” and the respective accent. (See
Figure 7-2.) When case folded using the default case folding algorithm, strings containing
these sequences will still contain the combining dot above. In the unusual situation where
case folding needs to be tailored to provide for these special Lithuanian dictionary require-
ments, strings can be preprocessed to remove any combining dot above characters occurring
between an “i” and a subsequent accent, so that the folded strings will match correctly.
Where case distinctions are not important, other distinctions between Unicode characters
(in particular, compatibility distinctions) are generally ignored as well. In such circum-
stances, text can be normalized to Normalization Form NFKC or NFKD after case folding,
thereby producing a normalized form that erases both compatibility distinctions and case
distinctions. However, such normalization should generally be done only on a restricted
repertoire, such as identifiers (alphanumerics). See Unicode Standard Annex #15, “Uni-
code Normalization Forms,” and Unicode Standard Annex #31, “Unicode Identifier and
Pattern Syntax,” for more information. For a summary, see “Equivalent Sequences” in
Section 2.2, Unicode Design Principles.
Caseless matching is only an approximation of the language-specific rules governing the
strength of comparisons. Language-specific case matching can be derived from the colla-
tion data for the language, where only the first- and second-level differences are used. For
more information, see Unicode Technical Standard #10, “Unicode Collation Algorithm.”
In most environments, such as in file systems, text is not and cannot be tagged with lan-
guage information. In such cases, the language-specific mappings must not be used. Other-
wise, data structures such as B-trees might be built based on one set of case foldings and
used based on a different set of case foldings. This discrepancy would cause those data
structures to become corrupt. For such environments, a constant, language-independent,
default case folding is required.
Stability. The definition of case folding is guaranteed to be stable, in that any string of
characters case folded according to these rules will remain case folded in Version 5.0 or
later of the Unicode Standard. To achieve this stability, there are constraints on additions
of case pairs for existing encoded characters. Typically, no new lowercase character will be
added to the Unicode Standard as a casing pair of an existing upper- or titlecase character
Implementation Guidelines 242 5.18 Case Mappings
that does not already have a lowercase pair. In exceptional circumstances, where lowercase
characters must be added to the standard in a later version than the version in which the
corresponding uppercase characters were encoded, such lowercase characters can only be
defined as new case pairs with a corresponding change to case folding to ensure that they
case fold to the old uppercase letters. See the subsection “Policies” in Section B.3, Other Uni-
code Online Resources.
The original string is in Normalization Form NFC format. When uppercased, the small j
with caron turns into an uppercase J with a separate caron. If followed by a combining mark
below, that sequence is not in a normalized form. The combining marks have to be put in
canonical order for the sequence to be normalized.
If text in a particular system is to be consistently normalized to a particular form such as
NFC, then the casing operators should be modified to normalize after performing their
core function. The actual process can be optimized; there are only a few instances where a
casing operation causes a string to become denormalized. If a system specifically checks for
those instances, then normalization can be avoided where not needed.
Normalization also interacts with case folding. For any string X, let Q(X) =
NFC(toCasefold(NFD(X))). In other words, Q(X) is the result of normalizing X, then
case folding the result, then putting the result into Normalization Form NFC format.
Because of the way normalization and case folding are defined, Q(Q(X)) = Q(X). Repeat-
edly applying Q does not change the result; case folding is closed under canonical normal-
ization for either Normalization Form NFC or NFD.
Case folding is not, however, closed under compatibility normalization for either Normal-
ization Form NFKD or NFKC. That is, given R(X) = NFKC(toCasefold(NFD(X))),
there are some strings such that R(R(X)) ≠ R(X). NFKC_Casefold, a derived property, is
closed under both case folding and NFKC normalization. The property values for NFKC_-
Casefold are found in DerivedNormalizationProps.txt in the Unicode Character Database.
Implementation Guidelines 243 5.19 Mapping Compatibility Variants
This general implementation approach to the problems associated with visual similarities
among compatibility variants, by focusing first on the remapping of compatibility decom-
posable characters, is useful for two reasons. First, the large majority of compatibility vari-
ants are in fact also compatibility decomposable characters, so this approach deals with the
biggest portion of the problem. Second, it is simply and reproducibly implementable in
terms of a well-defined Unicode Normalization Form.
Extending restrictions on usage to other compatibility variants is more problematical,
because there is no exact specification of which characters are compatibility variants. Fur-
thermore, there may be valid reasons to restrict usage of certain characters which may be
visually confusable or otherwise problematical for some process, even though they are not
generally considered to be compatibility variants. Best practice in such cases is to depend
on carefully constructed and justified lists of confusable characters.
For more information on security implications and a discussion of confusables, see Uni-
code Technical Report #36, “Unicode Security Considerations” and Unicode Technical
Standard #39, “Unicode Security Mechanisms.”
Implementation Guidelines 245 5.20 Unicode Security
interacting with a hostile website as if it was a trusted site (or user). In this case, the confu-
sion is not at the level of the software process handling the code points, but rather in the
human end users, who see one character but mistake it for another, and who then can be
fooled into doing something that will breach security or otherwise result in unintended
results.
To be effective, spoofing does not require an exact visual match—for example, using the
digit “1” instead of the letter “l”. The Unicode Standard contains many confusables—that is,
characters whose glyphs, due to historical derivation or sheer coincidence, resemble each
other more or less closely. Certain security-sensitive applications or systems may be vul-
nerable due to possible misinterpretation of these confusables by their users.
Many legacy character sets, including ISO/IEC 8859-1 or even ASCII, also contain confus-
ables, albeit usually far fewer of them than in the Unicode Standard simply because of the
sheer scale of Unicode. The legacy character sets all carry the same type of risks when it
comes to spoofing, so there is nothing unique or inadequate about Unicode in this regard.
Similar steps will be needed in system design to assure integrity and to lessen the potential
for security risks, no matter which character encoding is used.
The Unicode Standard encodes characters, not glyphs, and it is impractical for many rea-
sons to try to avoid spoofing by simply assigning a single character code for every possible
confusable glyph among all the world’s writing systems. By unifying an encoding based
strictly on appearance, many common text-processing tasks would become convoluted or
impossible. For example, Latin B and Greek Beta í look the same in most fonts, but lower-
case to two different letters, Latin b and Greek beta ≤, which have very distinct appear-
ances. A simplistic fix to the confusability of Latin B and Greek Beta would result in great
difficulties in processing Latin and Greek data, and in many cases in data corruptions as
well.
Because all character encodings inherently have instances of characters that might be con-
fused with one another under some conditions, and because the use of different fonts to
display characters might even introduce confusions between characters that the designers
of character encodings could not prevent, character spoofing must be addressed by other
means. Systems or applications that are security-conscious can test explicitly for known
spoofings, such as “MICROS0FT,” “A0L,” or the like (substituting the digit “0” for the letter
“O”). Unicode-based systems can provide visual clues so that users can ensure that labels,
such as domain names, are within a single script to prevent cross-script spoofing. However,
provision of such clues is clearly the responsibility of the system or application, rather than
being a security condition that could be met by somehow choosing a “secure” character
encoding that was not subject to spoofing. No such character encoding exists.
Unicode Standard Annex #24, “Unicode Script Property,” presents a classification of Uni-
code characters by script. By using such a classification, a program can check that labels
consist only of characters from a given script or characters that are expected to be used
with more than one script (such as the “Common” or “Inherited” script names defined in
Unicode Standard Annex #24, “Unicode Script Property”). Because cross-script names may
be legitimate, the best method of alerting a user might be to highlight any unexpected
Implementation Guidelines 247 5.20 Unicode Security
boundaries between scripts and let the user determine the legitimacy of such a string
explicitly.
For further discussion of security issues, see Unicode Technical Report #36, “Unicode
Security Considerations,” and Unicode Technical Standard #39, “Unicode Security Mech-
anisms.”
Implementation Guidelines 248 5.21 Ignoring Characters in Processing
tain complex rendering logic which contributes to the text rendering process. This discus-
sion is not meant to preclude any particular approach to the design of a full text rendering
process. A phrase such as, “a font displays a glyph for the character,” or “a font displays no
glyph for the character,” is simply a general way of describing the intended display out-
come for rendering that character.
Normal Rendering. Many characters, including format characters and variation selectors,
have no visible glyph or advance width directly associated with them. Such characters with-
out glyphs are typically shown in the code charts with special display glyphs using a dotted
box and a mnemonic label. (See Section 24.1, Character Names List, for code chart display
conventions.) Outside of the particular context of code chart display, a font will typically
display no glyph for such characters. However, it is not unusual for format characters and
variation selectors to have a visible effect on other characters in their vicinity. For example,
ZWJ and ZWNJ may affect cursive joining or the appearance of ligatures. A variation selec-
tor may change the choice of glyph for display of the base character it follows. In such
cases, even though the format character or variation selector has no visible glyph of its own,
it would be inappropriate to say that it is ignored for display, because the intent of its use is
to change the display in some visible way. Additional cases where a format character has no
glyph, but may otherwise affect display include:
• Bidirectional format characters do not affect the glyph forms of displayed char-
acters, but may cause significant rearrangements of spans of text in a line.
• U+00AD Á soft hyphen has a null default appearance in the middle of a
line: the appearance of “therÁapist” is simply “therapist”—no visible glyph. In
line break processing, it indicates a possible intraword break. At any intraword
break that is used for a line break—whether resulting from this character or by
some automatic process—a hyphen glyph (perhaps with spelling changes) or
some other indication can be shown, depending on language and context.
In other contexts, a format character may have no visible effect on display at all. For exam-
ple, a ZWJ might occur in text between two characters which are not subject to cursive
joining and for which no ligature is available or appropriate: <x, ZWJ, x>. In such a case,
the ZWJ simply has no visible effect, and one can meaningfully say that it is ignored for dis-
play. Another example is a variation selector following a base character for which no stan-
dardized or registered variation sequence exists. In that case, the variation selector has no
effect on the display of the text.
Finally, there are some format characters whose function is not intended to affect display.
U+200B zero width space affects word segmentation, but has no visible display. U+034F
combining grapheme joiner is likewise always ignored for display. Additional examples
include:
• U+2060 É word joiner does not produce a visible change in the appearance
of surrounding characters; instead, its only effect is to indicate that there
should be no line break at that point.
Implementation Guidelines 252 5.21 Ignoring Characters in Processing
few other exceptional characters, such as Hangul fillers. The exact list is defined in
DerivedCoreProperties.txt in the Unicode Character Database.
The Default_Ignorable_Code_Point property is also given to certain ranges of code points:
U+2060..U+206F, U+FFF0..U+FFF8, and U+E0000..U+E0FFF, including any unassigned
code points in those ranges. These ranges are designed and reserved for future encoding of
format characters and similar special-use characters, to allow a certain degree of forward
compatibility. Implementations which encounter unassigned code points in these ranges
should ignore them for display in fallback rendering.
Surrogate code points, private-use characters, and control characters are not given the
Default_Ignorable_Code_Point property. To avoid security problems, such characters or
code points, when not interpreted and not displayable by normal rendering, should be dis-
played in fallback rendering with a fallback glyph, so that there is a visible indication of
their presence in the text. For more information, see Unicode Technical Report #36, “Uni-
code Security Considerations.”
A small number of format characters (General_Category = Cf ) are also not given the
Default_Ignorable_Code_Point property. This may surprise implementers, who often
assume that all format characters are generally ignored in fallback display. The exact list of
these exceptional format characters can be found in the Unicode Character Database.
There are, however, three important sets of such format characters to note:
• prepended concatenation marks
• interlinear annotation characters
• Egyptian hieroglyph format controls
The prepended concatenation marks always have a visible display. These are visible format
characters which span groups of numbers, particularly for the Arabic script—for example,
U+0601 arabic sign sanah, the Arabic year sign. See “Signs Spanning Numbers” in
Section 9.2, Arabic for more discussion of the use and display of these signs.
The other two notable sets of format characters that exceptionally are not ignored in fall-
back display consist of the interlinear annotation characters, U+FFF9 interlinear anno-
tation anchor through U+FFFB interlinear annotation terminator, and the
Egyptian hieroglyph format controls, U+13430 egyptian hieroglyph vertical joiner
through U+13438 egyptian hieroglyph end segment. These characters should have a
visible glyph display for fallback rendering, because if they are not displayed, it is too easy
to misread the resulting displayed text. See “Annotation Characters” in Section 23.8, Spe-
cials, as well as Section 11.4, Egyptian Hieroglyphs for more discussion of the use and dis-
play of these characters.
Implementation Guidelines 254 5.22 U+FFFD Substitution in Conversion
Chapter 6
Writing Systems and Punctuation 6
This chapter begins the portion of the Unicode Standard devoted to the detailed descrip-
tion of each script or other related group of Unicode characters. Each of the subsequent
chapters presents a historically or geographically related group of scripts. This chapter
presents a general introduction to writing systems, explains how they can be used to clas-
sify scripts, and then presents a detailed discussion of punctuation characters that are
shared across scripts.
Scripts and Blocks. The codespace of the Unicode Standard is divided into subparts called
blocks (see D10b in Section 3.4, Characters and Encoding). Character blocks generally con-
tain characters from a single script, and in many cases, a script is fully represented in its
block; however, some scripts are encoded using several blocks, which are not always adja-
cent. Discussion of scripts and other groups of characters are structured by blocks. Corre-
sponding subsection headers identify each block and its associated range of Unicode code
points. The Unicode code charts are also organized by blocks.
Scripts and Writing Systems. There are many different kinds of writing systems in the
world. Their variety poses some significant issues for character encoding in the Unicode
Standard as well as for implementers of the standard. Those who first approach the Uni-
code Standard without a background in writing systems may find the huge list of scripts
bewilderingly complex. Therefore, before considering the script descriptions in detail, this
chapter first presents a brief introduction to the types of writing systems. That introduction
explains basic terminology about scripts and character types that will be used again and
again when discussing particular scripts.
Punctuation. The rest of this chapter deals with a special case: punctuation marks, which
tend to be scattered about in different blocks and which may be used in common by many
scripts. Punctuation characters occur in several widely separated places in the blocks,
including Basic Latin, Latin-1 Supplement, General Punctuation, Supplemental Punctua-
tion, and CJK Symbols and Punctuation. There are also occasional punctuation characters
in blocks for specific scripts.
Most punctuation characters are intended for common usage with any script, although
some of them are script-specific. Some scripts use both common and script-specific punc-
tuation characters, usually as the result of recent adoption of standard Western punctua-
tion marks. While punctuation characters vary in details of appearance and function
between different languages and scripts, their overall purpose is shared: they serve to sepa-
rate or otherwise organize units of text, such as sentences and phrases, thereby helping to
clarify the meaning of the text. Certain punctuation characters also occur in mathematical
and scientific formulae.
Writing Systems and Punctuation 256 6.1 Writing Systems
ing. Some abjads allow consonant letters to mark long vowels, as the use of waw and yeh in
Arabic for /u:/ or /i:/.
Hebrew and Arabic are typically written without any vowel marking at all. The vowels,
when they do occur in writing, are referred to as points or harakat, and are indicated by the
use of diacritic dots and other marks placed above and below the consonantal letters.
Syllabaries. In a syllabary, each symbol of the system typically represents both a consonant
and a vowel, or in some instances more than one consonant and a vowel. One of the best-
known examples of a syllabary is Hiragana, used for Japanese, in which the units of the sys-
tem represent the syllables ka, ki, ku, ke, ko, sa, si, su, se, so, and so on. In general parlance,
the elements of a syllabary are not called letters, but rather syllables. This can lead to some
confusion, however, because letters of alphabets and units of other writing systems are also
used, singly or in combinations, to write syllables of languages. So in a broad sense, the
term “letter” can be used to refer to the syllables of a syllabary.
In syllabaries such as Cherokee, Hiragana, Katakana, and Yi, each symbol has a unique
shape, with no particular shape relation to any of the consonant(s) or vowels of the sylla-
bles. In other cases, however, the syllabic symbols of a syllabary are not atomic; they can be
built up out of parts that have a consistent relationship to the phonological parts of the syl-
lable. The best example of this is the Hangul writing system for Korean. Each Hangul sylla-
ble is made up of a part for the initial consonant (or consonant cluster), a part for the vowel
(or diphthong), and an optional part for the final consonant (or consonant cluster). The
relationship between the sounds and the graphic parts to represent them is systematic
enough for Korean that the graphic parts collectively are known as jamos and constitute a
kind of alphabet on their own.
The jamos of the Hangul writing system have another characteristic: their shapes are not
completely arbitrary, but were devised with intentionally iconic shapes relating them to
articulatory features of the sounds they represent in Korean. The Hangul writing system
has thus also been classified as a featural syllabary.
Abugidas. Abugidas represent a kind of blend of syllabic and alphabetic characteristics in a
writing system. The Ethiopic script is an abugida. The term “abugida” is derived from the
first four letters of the Ethiopic script in the Semitic order: alf, bet, gaml, dant. The order of
vowels (-ä -u -i -a) is that of the traditional vowel order in the first four columns of the Ethi-
opic syllable chart. Historically, abugidas spread across South Asia and were adapted by
many languages, often of phonologically very different types.
This process has also resulted in many extensions, innovations, and/or simplifications of
the original patterns. The best-known example of an abugida is the Devanagari script, used
in modern times to write Hindi and many other Indian languages, and used classically to
write Sanskrit. See Section 12.1, Devanagari, for a detailed description of how Devanagari
works and is rendered.
In an abugida, each consonant letter carries an inherent vowel, usually /a/. There are also
vowel letters, often distinguished between a set of independent vowel letters, which occur
on their own, and dependent vowel letters, or matras, which are subordinate to consonant
Writing Systems and Punctuation 258 6.1 Writing Systems
letters. When a dependent vowel letter follows a consonant letter, the vowel overrides the
inherent vowel of the consonant. This is shown schematically in Figure 6-1.
terms, but rather follows the customary usage associated with a given script or writing sys-
tem. For the Han script, the term CJK ideograph (or Han ideograph) is used.
There are a number of other historical examples of logosyllabaries, such as Tangut. They
vary in the degree to which they combine logographic writing principles, where the sym-
bols stand for morphemes or entire words, and syllabic writing principles, where the sym-
bols come to represent syllables per se, divorced from their meaning as morphemes or
words. In some notable instances, as for Sumero-Akkadian cuneiform, a logosyllabary may
evolve through time into a syllabary or alphabet by shedding its use of logographs. In other
instances, as for the Han script, the use of logographic characters is very well entrenched
and persistent. However, even for the Han script a small number of characters are used
purely to represent syllabic sounds, so as to be able to represent such things as foreign per-
sonal names and place names.
Egyptian hieroglyphs constitute another mixed example. The majority of the hieroglyphs
are logographs, but Egyptian hieroglyphs also contain a well-defined subset that functions
as an alphabet, in addition to other signs that represent sequences of consonants. And
some hieroglyphs serve as semantic determinatives, rather than logographs in their own
right—a function which bears some comparison to the way radicals work in CJK ideo-
graphs. To simplify the overall typology of Unicode scripts, Egyptian hieroglyphs and
other hieroglyphic systems are lumped together with true logosyllabaries such as Han, but
there are many differences in detail. For more about Egyptian hieroglyphs, in particular,
see Section 11.4, Egyptian Hieroglyphs.
The classification of a writing system is often rendered somewhat ambiguous by complica-
tions in the exact ways in which it matches up written elements to the phonemes or sylla-
bles of a language. For example, although Hiragana is classified as a syllabary, it does not
always have an exact match between syllables and written elements. Syllables with long
vowels are not written with a single element, but rather with a sequence of elements. Thus
the syllable with a long vowel k^ is written with two separate Hiragana symbols, {ku}+{u}.
There may also be complications when a writing system deviates from the historical model
from which it derives. For example, Mahajani and Multani are both based on the Brahmi
model, but are structurally simpler than an abugida. These writing systems do not contain
a virama. They also do not have matras and consonant conjunct formation characteristic to
abugidas. Instead, Mahajani and Multani behave respectively as an alphabet and an abjad,
and are encoded and classified accordingly in the Unicode Standard.
Because of these kinds of complications, one must always be careful not to assume too
much about the structure of a writing system from its nominal classification.
Typology of Scripts in the Unicode Standard. Table 6-1 lists all of the scripts currently
encoded in the Unicode Standard, showing the writing system type for each. The list is an
approximate guide, rather than a definitive classification, because of the mix of features
seen in many scripts. The writing systems for some languages may be quite complex, mix-
ing more than one type of script together in a composite system. Japanese is the best exam-
ple; it mixes a logosyllabary (Han), two syllabaries (Hiragana and Katakana), and one
alphabet (Latin, for romaji).
Writing Systems and Punctuation 260 6.1 Writing Systems
Notational Systems. In addition to scripts for written natural languages, there are nota-
tional systems for other kinds of information. Some of these more closely resemble text
than others. The Unicode Standard encodes symbols for use with mathematical notation,
Western and Byzantine musical notation, Duployan shorthand, Sutton SignWriting nota-
tion for sign languages, and Braille, as well as symbols for use in divination, such as the
Yijing hexagrams. Notational systems can be classified by how closely they resemble text.
Even notational systems that do not fully resemble text may have symbols used in text. In
the case of musical notation, for example, while the full notation is two-dimensional, many
of the encoded symbols are frequently referenced in texts about music and musical nota-
tion.
Writing Systems and Punctuation 261 6.2 General Punctuation
when their image is not bilaterally symmetric, such as slash or the curly quotes. See also
“Paired Punctuation” later in this section.
In vertical writing, many punctuation characters have special vertical glyphs. Normally,
fonts contain both the horizontal and vertical glyphs, and the selection of the appropriate
glyph is based on the text orientation in effect at rendering time. However, see “CJK Com-
patibility Forms: Vertical Forms” later in this section.
Figure 6-2 shows a set of three common shapes used for ideographic comma and ideo-
graphic full stop. The first shape in each row is that used for horizontal text, the last shape is
that for vertical text. The centered form may be used with both horizontal and vertical text.
See also Figure 6-4 for an example of vertical and horizontal forms for quotation marks.
The General Punctuation block (U+2000..U+206F) contains the most common punctua-
tion characters widely used in Latin typography, as well as a few specialized punctuation
marks and a large number of format control characters. All of these punctuation characters
are intended for generic use, and in principle they could be used with any script.
The Supplemental Punctuation block (U+2E00..U+2E7F) is devoted to less commonly
encountered punctuation marks, including those used in specialized notational systems or
occurring primarily in ancient manuscript traditions.
The CJK Symbols and Punctuation block (U+3000..U+303F) has the most commonly
occurring punctuation specific to East Asian typography—that is, typography involving the
rendering of text with CJK ideographs.
The Vertical Forms block (U+FE10..U+FE1F), the CJK Compatibility Forms block
(U+FE30..U+FE4F), the Small Form Variants block (U+FE50..U+FE6F), and the Half-
width and Fullwidth Forms block (U+FF00..U+FFEF) contain many compatibility charac-
ters for punctuation marks, encoded for compatibility with a number of East Asian
character encoding standards. Their primary use is for round-trip mapping with those leg-
acy standards. For vertical text, the regular punctuation characters are used instead, with
alternate glyphs for vertical layout supplied by the font.
The punctuation characters in these various blocks are discussed below in terms of their
general types.
Space Characters
Space characters are found in several blocks in the Unicode Standard. The list of space
characters appears in Table 6-2.
The space characters in the Unicode Standard can be identified by their General Category,
(gc = Zs), in the Unicode Character Database. One exceptional “space” character is
U+200B zero width space. This character, although called a “space” in its name, does not
actually have any width or visible glyph in display. It functions primarily to indicate word
boundaries in writing systems that do not actually use orthographic spaces to separate
words in text. It is given the General Category (gc = Cf ) and is treated as a format control
character, rather than as a space character, in implementations. Further discussion of
U+200B zero width space, as well as other zero-width characters with special properties,
can be found in Section 23.2, Layout Controls.
The most commonly used space character is U+0020 space. In ideographic text, U+3000
ideographic space is commonly used because its width matches that of the ideographs.
The main difference among other space characters is their width. U+2000..U+2006 are
standard quad widths used in typography. U+2007 figure space has a fixed width, known
as tabular width, which is the same width as digits used in tables. U+2008 punctuation
space is a space defined to be the same width as a period. U+2009 thin space and U+200A
hair space are successively smaller-width spaces used for narrow word gaps and for justi-
fication of type. The fixed-width space characters (U+2000..U+200A) are derived from
conventional (hot lead) typography. Algorithmic kerning and justification in computerized
typography do not use these characters. However, where they are used (for example, in
typesetting mathematical formulae), their width is generally font-specified, and they typi-
Writing Systems and Punctuation 265 6.2 General Punctuation
cally do not expand during justification. The exception is U+2009 thin space, which
sometimes gets adjusted.
In addition to the various fixed-width space characters, there are a few script-specific space
characters in the Unicode Standard. U+1680 ogham space mark is unusual in that it is
generally rendered with a visible horizontal line, rather than being blank.
No-Break Space. U+00A0 no-break space (NBSP) is the nonbreaking counterpart of
U+0020 space. It has the same width, but behaves differently for line breaking. For more
information, see Unicode Standard Annex #14, “Unicode Line Breaking Algorithm.”
Unlike U+0020, U+00A0 no-break space behaves as a numeric separator for the purposes
of bidirectional layout. See Unicode Standard Annex #9, “Unicode Bidirectional Algo-
rithm,” for a detailed discussion of the Unicode Bidirectional Algorithm.
U+00A0 no-break space has an additional, important function in the Unicode Standard.
It may serve as the base character for displaying a nonspacing combining mark in apparent
isolation. Versions of the standard prior to Version 4.1 indicated that U+0020 space could
also be used for this function, but space is no longer recommended, because of potential
interactions with the handling of space in XML and other markup languages. See
Section 2.11, Combining Characters, for further discussion.
Narrow No-Break Space. U+202F narrow no-break space (NNBSP) is a narrow version
of U+00A0 no-break space. The NNBSP can be used to represent the narrow space occur-
ring around punctuation characters in French typography, which is called an “espace fine
insécable.” It is used especially in Mongolian text, before certain grammatical suffixes, to
provide a small gap that not only prevents word breaking and line breaking, but also trig-
gers special shaping for those suffixes. See “Narrow No-Break Space” in Section 13.5, Mon-
golian, for more information.
U+2212 minus sign should each be taken as indicating a minus sign, as in “x = a - b”, unless
a higher-level protocol precisely defines which of these characters serves that function.
U+2014 em dash is used to make a break—like this—in the flow of a sentence. (Some
typographers prefer to use U+2013 en dash set off with spaces – like this – to make the
same kind of break.) Like many other conventions for punctuation characters, such usage
may depend on language. This kind of dash is commonly represented with a typewriter as
a double hyphen. In older mathematical typography, U+2014 em dash may also used to
indicate a binary minus sign. U+2015 horizontal bar is used to introduce quoted text in
some typographic styles.
Dashes and hyphen characters may also be found in other blocks in the Unicode Standard.
A list of dash and hyphen characters appears in Table 6-3. For a description of the line
breaking behavior of dashes and hyphens, see Unicode Standard Annex #14, “Unicode
Line Breaking Algorithm.”
Soft Hyphen. Despite its name, U+00AD soft hyphen is not a hyphen, but rather an invis-
ible format character used to indicate optional intraword breaks. As described in
Section 23.2, Layout Controls, its effect on the appearance of the text depends on the lan-
guage and script used.
Writing Systems and Punctuation 267 6.2 General Punctuation
Tilde. Although several shapes are commonly used to render U+007E “~” tilde, modern
fonts generally render it with a center line glyph, as shown here and in the code charts.
However, it may also appear as a raised, spacing tilde, serving as a spacing clone of U+0303
“u” combining tilde (see “Spacing Clones of Diacritical Marks” in Section 7.9, Combining
Marks). This is a form common in older implementations, particularly for terminal emula-
tion and typewriter-style fonts.
Some of the common uses of a tilde include indication of alternation, an approximate
value, or, in some notational systems, indication of a logical negation. In the latter context,
it is really being used as a shape-based substitute character for the more precise U+00AC
“¬” not sign. A tilde is also used in dictionaries to repeat the defined term in examples. In
that usage, as well as when used as punctuation to indicate alternation, it is more appropri-
ately represented by a wider form, encoded as U+2053 “n” swung dash. U+02DC “o”
small tilde is a modifier letter encoded explicitly as the spacing form of the combining
tilde as a diacritic. For mathematical usage, U+223C “~” tilde operator should be used
to unambiguously encode the operator.
Dictionary Abbreviation Symbols. In addition to the widespread use of tilde in dictionar-
ies, more specialized dictionaries may make use of symbols consisting of hyphens or tildes
with dots or circles above or below them to abbreviate the representation of inflected or
derived forms (plurals, case forms, and so on) in lexical entries. U+2E1A hyphen with
diaeresis, for example, is typically used in German dictionaries as a short way of indicat-
ing that the addition of a plural suffix also causes placement of an umlaut on the main stem
vowel. U+2E1B tilde with ring above indicates a change in capitalization for a derived
form, and so on. Such conventions are particularly widespread in German dictionaries, but
may also appear in other dictionaries influenced by German lexicography.
Paired Punctuation
Mirroring of Paired Punctuation. Paired punctuation marks such as parentheses
(U+0028, U+0029), square brackets (U+005B, U+005D), and braces (U+007B, U+007D)
are interpreted semantically rather than graphically in the context of bidirectional or verti-
cal texts; that is, the orientation of these characters toward the enclosed text is maintained
by the software, independent of the writing direction. In a bidirectional context, the glyphs
are adjusted as described in Unicode Standard Annex #9, “Unicode Bidirectional Algo-
rithm.” (See also Section 4.7, Bidi Mirrored.) During display, the software must ensure that
the rendered glyph is the correct one in the context of bidirectional or vertical texts.
Paired punctuation marks containing the qualifier “left” in their name are taken to denote
opening; characters whose name contains the qualifier “right” are taken to denote closing.
For example, U+0028 left parenthesis and U+0029 right parenthesis are interpreted
as opening and closing parentheses, respectively. In a right-to-left directional run, U+0028
is rendered as “)”. In a left-to-right run, the same character is rendered as “(”. In some
mathematical usage, brackets may not be paired, or may be deliberately used in the
reversed sense, such as ]a,b[. Mirroring assures that in a right-to-left environment, such
Writing Systems and Punctuation 268 6.2 General Punctuation
specialized mathematical text continues to read ]b,a[ and not [b, a]. See also “Language-
Based Usage of Quotation Marks” later in this section.
Quotation Marks and Brackets. Like brackets, quotation marks occur in pairs, with some
overlap in usage and semantics between these two types of punctuation marks. For exam-
ple, some of the CJK quotation marks resemble brackets in appearance, and they are often
used when brackets would be used in non-CJK text. Similarly, both single and double guil-
lemets may be treated more like brackets than quotation marks. Unlike brackets, quotation
marks are not mirrored in a bidirectional context.
Some of the editing marks used in annotated editions of scholarly texts exhibit features of
both quotation marks and brackets. The particular convention employed by the editors
determines whether editing marks are used in pairs, which editing marks form a pair, and
which is the opening character.
Horizontal brackets—for example, those used in annotating mathematical expressions—
are not paired punctuation, even though the set includes both top and bottom brackets. See
“Horizontal Brackets” in Section 22.7, Technical Symbols, for more information.
automation purposes and for books. Swedish books sometimes use the guillemet, U+00BB
right-pointing double angle quotation mark, for both opening and closing.
Hungarian and Polish usage of quotation marks is similar to the Scandinavian usage,
except that they use low double quotes for opening quotations. Presumably, these lan-
guages avoid the low single quote so as to prevent confusion with the comma.
French, Greek, Russian, and Slovenian, among others, use the guillemets, but Slovenian
usage is the same as German usage in their direction. Of these languages, at least French
inserts space between text and quotation marks. In the French case, U+00A0 no-break
space can be used for the space that is enclosed between quotation mark and text; this
choice helps line breaking algorithms.
“English” « French »
„German“ »Slovenian«
”Swedish” »Swedish books»
Glyph Variation in Curly Quotes. The glyphs for the quotation marks in the range
U+2018..U+201F may vary significantly across fonts. The two most typical styles use curly
or wedge-shaped glyphs. See Table 6-4.
Rotated model
(curly glyph style)
Rotated model
(wedge glyph style)
Mirrored model
(Tahoma, Verdana)
Writing Systems and Punctuation 270 6.2 General Punctuation
The Unicode code charts use a curly style in a serifed, Times-like font. Because quotation
marks are used in pairs, glyphs within a single style are expected to be in a certain visual
relationship, and that relationship stands regardless of glyph style. The visual relationship
follows either a rotated or a mirrored model. The rotated model is predominant in both
curly and wedge glyph style fonts. These two models are illustrated in Table 6-4 using sam-
ple fonts with different glyph styles. The glyphs are enlarged for clarity.
In the rotated model, turning the ink of the glyph for U+201D right double quotation
mark 180 degrees results in the glyph for U+201C left double quotation mark; flip-
ping it horizontally results in the glyph for U+201F double high-reversed-9 quotation
mark. The same symmetries apply to the raised single quotation marks. Similarly, the
glyphs for the low double quotation marks, U+201E double low-9 quotation mark and
U+2E42 double low-reversed-9 quotation mark, are horizontally flipped images of
each other.
Some fonts in widespread use instead follow the mirrored model, in which the glyph for
U+201C looks like a mirrored image of the glyph for U+201D instead of a rotated image of
it. Most fonts that follow the mirrored model use wedge style glyphs for quotation marks.
In particular, in fonts such as Tahoma and Verdana, the glyph for U+201F is a rotated
image of the glyph for U+201D, which makes the glyphs for U+201C and U+201F appear
swapped compared to the typical design of wedge style quote glyphs using the rotated
model. The sets of glyphs which show these swapped appearances are highlighted by a light
background in Table 6-4.
East Asian Usage. The glyph for each quotation mark character for an Asian character set
occupies predominantly a single quadrant of the character cell. The quadrant used
depends on whether the character is opening or closing and whether the glyph is for use
with horizontal or vertical text.
The pairs of quotation characters are listed in Table 6-5.
Glyph Variation in East Asian Usage. In East Asian usage, the glyphs for “double-prime”
quotation marks U+301D reversed double prime quotation mark and U+301F low
double prime quotation mark consist of a pair of wedges, slanted either forward or
backward, with the tips of the wedges pointing either up or down. In a pair of double-
prime quotes, the closing and the opening character of the pair slant in opposite directions.
Two common variations exist, as shown in Figure 6-4. To confuse matters more, another
form of double-prime quotation marks is used with Western-style horizontal text, in addi-
tion to the curly single or double quotes.
Writing Systems and Punctuation 271 6.2 General Punctuation
“Text”
Font style-based glyph alternates
Three pairs of quotation marks are used with Western-style horizontal text, as shown in
Table 6-6.
Overloaded Character Codes. The character codes for standard quotes can refer to regular
narrow quotes from a Latin font used with Latin text as well as to wide quotes from an
Asian font used with other wide characters. This situation can be handled with some suc-
cess where the text is marked up with language tags. For more information on narrow and
wide characters, see Unicode Standard Annex #11, “East Asian Width.”
Consequences for Semantics. The semantics of U+00AB left-pointing double angle
quotation mark, U+00BB right-pointing double angle quotation mark, and
U+201D right double quotation mark are context dependent. By contrast, the seman-
tics of U+201A single low-9 quotation mark and U+201B single high-reversed-9
quotation mark are always opening. That usage is distinct from that of U+301F low
double prime quotation mark, which is unambiguously closing. All other quotation
marks may represent opening or closing quotation marks depending on the usage.
Writing Systems and Punctuation 272 6.2 General Punctuation
Apostrophes
U+0027 apostrophe is the most commonly used character for apostrophe. For historical
reasons, U+0027 is a particularly overloaded character. In ASCII, it is used to represent a
punctuation mark (such as right single quotation mark, left single quotation mark, apos-
trophe punctuation, vertical line, or prime) or a modifier letter (such as apostrophe modi-
fier or acute accent). Punctuation marks generally break words; modifier letters generally
are considered part of a word.
When text is set, U+2019 right single quotation mark is preferred as apostrophe, but
only U+0027 is present on most keyboards. Software commonly offers a facility for auto-
matically converting the U+0027 apostrophe to a contextually selected curly quotation
glyph. In these systems, a U+0027 in the data stream is always represented as a straight ver-
tical line and can never represent a curly apostrophe or a right quotation mark.
Letter Apostrophe. U+02BC modifier letter apostrophe is preferred where the apos-
trophe is to represent a modifier letter (for example, in transliterations to indicate a glottal
stop). In the latter case, it is also referred to as a letter apostrophe.
Punctuation Apostrophe. U+2019 right single quotation mark is preferred where the
character is to represent a punctuation mark, as for contractions: “We’ve been here before.”
In this latter case, U+2019 is also referred to as a punctuation apostrophe.
An implementation cannot assume that users’ text always adheres to the distinction
between these characters. The text may come from different sources, including mapping
from other character sets that do not make this distinction between the letter apostrophe
and the punctuation apostrophe/right single quotation mark. In that case, all of them will
generally be represented by U+2019.
The semantics of U+2019 are therefore context dependent. For example, if surrounded by
letters or digits on both sides, it behaves as an in-text punctuation character and does not
separate words or lines.
Other Punctuation
Hyphenation Point. U+2027 hyphenation point is a raised dot used to indicate correct
word breaking, as in dic·tion·ar·ies. It is a punctuation mark, to be distinguished from
U+00B7 middle dot, which has multiple semantics.
Word Separator Middle Dot. Historic texts in many scripts, especially those that are hand-
written (manuscripts), sometimes use a raised dot to separate words. Such word-separating
punctuation is comparable in function to the use of space to separate words in modern
typography.
U+2E31 word separator middle dot is a middle dot punctuation mark which is analo-
gous in function to the script-specific character U+16EB runic single punctuation, but
is for use with any script that needs a raised dot for separating words. For example, it can be
used for the word-separating dot seen in Avestan or Samaritan texts.
Writing Systems and Punctuation 273 6.2 General Punctuation
Fraction Slash. U+2044 fraction slash is used between digits to form numeric fractions,
such as 2/3 and 3/9. The standard form of a fraction built using the fraction slash is defined
as follows: any sequence of one or more decimal digits (General Category = Nd), followed
by the fraction slash, followed by any sequence of one or more decimal digits. Such a frac-
tion should be displayed as a unit, such as ¾ or !. The precise choice of display can depend
on additional formatting information.
If the displaying software is incapable of mapping the fraction to a unit, then it can also be
displayed as a simple linear sequence as a fallback (for example, 3/4). If the fraction is to be
separated from a previous number, then a space can be used, choosing the appropriate
width (normal, thin, zero width, and so on). For example, 1 + thin space + 3 + fraction
slash + 4 is displayed as 1¾.
Spacing Overscores and Underscores. U+203E overline is the above-the-line counter-
part to U+005F low line. It is a spacing character, not to be confused with U+0305 com-
bining overline. As with all overscores and underscores, a sequence of these characters
should connect in an unbroken line. The overscoring characters also must be distinguished
from U+0304 combining macron, which does not connect horizontally in this way.
Doubled Punctuation. Several doubled punctuation characters that have compatibility
decompositions into a sequence of two punctuation marks are also encoded as single char-
acters: U+203C double exclamation mark, U+2048 question exclamation mark, and
U+2049 exclamation question mark. These doubled punctuation marks are included as
an implementation convenience for East Asian and Mongolian text, when rendered verti-
cally.
Period or Full Stop. The period, or U+002E full stop, can be circular or square in appear-
ance, depending on the font or script. The hollow circle period used in East Asian texts is
separately encoded as U+3002 ideographic full stop. Likewise, Armenian, Arabic, Ethi-
opic, and several other script-specific periods are coded separately because of their signifi-
cantly different appearance.
In contrast, the various functions of the period, such as its use as sentence-ending punctu-
ation, an abbreviation mark, or a decimal point, are not separately encoded. The specific
semantic therefore depends on context.
In old-style numerals, where numbers vary in placement above and below the baseline, a
decimal or thousands separator may be displayed with a dot that is raised above the base-
line. Because it would be inadvisable to have a stylistic variation between old-style and
new-style numerals that actually changes the underlying representation of text, the Uni-
code Standard considers this raised dot to be merely a glyphic variant of U+002E “.” full
stop.
Ellipsis. The omission of text is often indicated by a sequence of three dots “...”, a punctua-
tion convention called ellipsis. Typographic traditions vary in how they lay out these dots.
In some cases the dots are closely spaced; in other cases the dots are spaced farther apart.
U+2026 horizontal ellipsis is the ordinary Unicode character intended for the repre-
sentation of an ellipsis in text and typically shows the dots separated with a moderate
Writing Systems and Punctuation 274 6.2 General Punctuation
degree of spacing. A sequence of three U+002E full stop characters can also be used to
indicate an ellipsis, in which case the space between the dots will depend on the font used
for rendering. For example, in a monowidth font, a sequence of three full stops will be wider
than the horizontal ellipsis, but in a typical proportional font, a full stop is very narrow and
a sequence of three of them will be more tightly spaced than the dots in horizontal ellipsis.
Conventions that use four dots for an ellipsis in certain grammatical contexts should repre-
sent them either as a sequence of <full stop, horizontal ellipsis> or <horizontal ellipsis, full
stop> or simply as a sequence of four full stop characters, depending on the requirements
of those conventions.
In East Asian typographic traditions, particularly in Japan, an ellipsis is raised to the center
line of text. When an ellipsis is represented by U+2026 horizontal ellipsis or by
sequences of full stops, this effect requires specialized rendering support. In practice, it is
relatively common for authors of East Asian text to substitute U+22EF midline horizon-
tal ellipsis for this. Because the midline ellipsis is a mathematical symbol, intended to
represent column elision in matrix notation, it is typically used with layout on a mathemat-
ical center line. With appropriate font design to harmonize with East Asian typography,
this midline ellipsis can produce the desired appearance without having to support contex-
tual shifting of the baseline for U+2026 horizontal ellipsis.
Vertical Ellipsis. When text is laid out vertically, the ellipsis is normally oriented so that the
dots run from top to bottom. Most commonly, an East Asian font will contain a vertically
oriented glyph variant of U+2026 for use in vertical text layout. U+FE19 presentation
form for vertical horizontal ellipsis is a compatibility character for use in mapping
to the GB 18030 standard; it would not usually be used for an ellipsis except in systems that
cannot handle the contextual choice of glyph variants for vertical rendering.
U+22EE vertical ellipsis and U+22EF midline horizontal ellipsis are part of a set of
special ellipsis characters used for row or column elision in matrix notation. Although their
primary use is for a mathematical context, U+22EF midline horizontal ellipsis has also
become popular for the midline ellipsis in East Asian typography. When U+22EF is used
this way, an East Asian font will typically contain a rotated glyph variant for use in vertical
text layout. If an appropriate mechanism for glyph variant substitution (such as the “vert”
GSUB feature in the Open Font Format) in vertically rendered text is not available,
U+FE19 presentation form for vertical horizontal ellipsis is the preferred charac-
ter substitution to represent a vertical ellipsis, instead of the mathematical U+22EE verti-
cal ellipsis.
U+205D tricolon has a superficial resemblance to a vertical ellipsis, but is part of a set of
dot delimiter punctuation marks for various manuscript traditions. As for the colon, the
dots in the tricolon are always oriented vertically.
Leader Dots. Leader dots are typically seen in contexts such as a table of contents or in
indices, where they represent a kind of style line, guiding the eye from an entry in the table
to its associated page number. Usually leader dots are generated automatically by page for-
matting software and do not require the use of encoded characters. However, there are
occasional plain text contexts in which a string of leader dots is represented as a sequence
Writing Systems and Punctuation 275 6.2 General Punctuation
of characters. U+2024 one dot leader and U+2025 two dot leader are intended for
such usage. U+2026 horizontal ellipsis can also serve as a three-dot version of leader
dots. These leader dot characters can be used to control, to a certain extent, the spacing of
leader dots based on font design, in contexts where a simple sequence of full stops will not
suffice.
U+2024 one dot leader also serves as a “semicolon” punctuation in Armenian, where it is
distinguished from U+002E full stop. See Section 7.6, Armenian.
Other Basic Latin Punctuation Marks. The interword punctuation marks encoded in the
Basic Latin block are used for a variety of other purposes. This can complicate the tasks of
parsers trying to determine sentence boundaries. As noted later in this section, some can be
used as numeric separators. Both period and U+003A “:” colon can be used to mark abbre-
viations as in “etc.” or as in the Swedish abbreviation “S:ta” for “Sankta”. U+0021 “!”
exclamation mark is used as a mathematical operator (factorial). U+003F “?” question
mark is often used as a substitution character when mapping Unicode characters to other
character sets where they do not have a representation. This practice can lead to unex-
pected results when the converted data are file names from a file system that supports “?” as
a wildcard character.
Several punctuation marks, such as colon, middle dot and solidus closely resemble mathe-
matical operators, such as U+2236 ratio, U+22C5 dot operator and U+2215 division
slash. The latter are the preferred characters, but the former are often substituted because
they are more easily typed.
Canonical Equivalence Issues for Greek Punctuation. Some commonly used Greek punc-
tuation marks are encoded in the Greek and Coptic block, but are canonical equivalents to
generic punctuation marks encoded in the C0 Controls and Basic Latin block, because they
are indistinguishable in shape. Thus, U+037E “;” greek question mark is canonically
equivalent to U+003B “;” semicolon, and U+0387 “·” greek ano teleia is canonically
equivalent to U+00B7 “·” middle dot. In these cases, as for other canonical singletons, the
preferred form is the character that the canonical singletons are mapped to, namely
U+003B and U+00B7 respectively. Those are the characters that will appear in any normal-
ized form of Unicode text, even when used in Greek text as Greek punctuation. Text seg-
mentation algorithms need to be aware of this issue, as the kinds of text units delimited by
a semicolon or a middle dot in Greek text will typically differ from those in Latin text.
The character properties for U+00B7 middle dot are particularly problematical, in part
because of identifier issues for that character. There is no guarantee that all of its properties
align exactly with U+0387 greek ano teleia, because the latter’s properties are based on
the limited function of the middle dot in Greek as a delimiting punctuation mark.
Bullets. U+2022 bullet is the typical character for a bullet. Within the general punctua-
tion, several alternative forms for bullets are separately encoded: U+2023 triangular
bullet, U+204C black leftwards bullet, and so on. U+00B7 middle dot also often
functions as a small bullet. Bullets mark the head of specially formatted paragraphs, often
occurring in lists, and may use arbitrary graphics or dingbat forms as well as more conven-
Writing Systems and Punctuation 276 6.2 General Punctuation
tional bullet forms. U+261E white right pointing index, for example, is often used to
highlight a note in text, as a kind of gaudy bullet.
Paragraph Marks. U+00A7 section sign and U+00B6 pilcrow sign are often used as
visible indications of sections or paragraphs of text, in editorial markup, to show format
modes, and so on. Which character indicates sections and which character indicates
paragraphs may vary by convention. U+204B reversed pilcrow sign is a fairly common
alternate representation of the paragraph mark.
Numeric Separators. Any of the characters U+002C comma, U+002E full stop, and the
Arabic characters U+060C, U+066B, or U+066C (and possibly others) can be used as
numeric separator characters, depending on the locale and user customizations.
Obelus. Originally a punctuation mark to denote questionable passages in manuscripts,
U+00F7 ÷ division sign is now most commonly used as a symbol indicating division.
However, modern use is not limited to that meaning. The character is sometimes used to
indicate a range (similar to the en-dash) or as a form of minus sign. The former use is
attested for Russian, Polish and Italian, and latter use is still widespread in Scandinavian
countries in some contexts, but may occur elsewhere as well. (See also the following text on
“Commercial Minus.”)
Commercial Minus. U+2052 % commercial minus sign is used in commercial or tax-
related forms or publications in several European countries, including Germany and Scan-
dinavia. The string “./.” is used as a fallback representation for this character.
The symbol may also appear as a marginal note in letters, denoting enclosures. One varia-
tion replaces the top dot with a digit indicating the number of enclosures.
An additional usage of the sign appears in the Uralic Phonetic Alphabet (UPA), where it
marks a structurally related borrowed element of different pronunciation. In Finland and a
number of other European countries, the dingbats % and ! are always used for “correct”
and “incorrect,” respectively, in marking a student’s paper. This contrasts with American
practice, for example, where ! and " might be used for “correct” and “incorrect,” respec-
tively, in the same context.
At Sign. U+0040 commercial at has acquired a prominent modern use as part of the syn-
tax for e-mail addresses. As a result, users in practically every language community sud-
denly needed to use and refer to this character. Consequently, many colorful names have
been invented for this character. Some of these contain references to animals or even pas-
tries. Table 6-7 gives a sample.
Two other bracket characters, U+2E1C left low paraphrase bracket and U+2E1D
right low paraphrase bracket, have particular usage in the N’Ko script, but also may
be used for general editorial punctuation.
Ancient Greek Editorial Marks. Ancient Greek scribes generally wrote in continuous
uppercase letters without separating letters into words. On occasion, the scribe added
punctuation to indicate the end of a sentence or a change of speaker or to separate words.
Editorial and punctuation characters appear abundantly in surviving papyri and have been
rendered in modern typography when possible, often exhibiting considerable glyphic vari-
ation. A number of these editorial marks are encoded in the range U+2E0E..U+2E16.
The punctuation used in Greek manuscripts can be divided into two categories: marginal
or semi-marginal characters that mark the end of a section of text (for example, coronis,
paragraphos), and characters that are mixed in with the text to mark pauses, end of sense,
or separation between words (for example, stigme, hypodiastole). The hypodiastole is used
in contrast with comma and is not a glyph variant of it.
A number of editorial characters are attributed to and named after Aristarchos of Samo-
thrace (circa 216–144 bce), fifth head of the Library at Alexandria. Aristarchos provided a
major edition of the works of Homer, which forms the basis for modern editions.
A variety of Ancient Greek editorial marks are shown in the text of Figure 6-5, including the
editorial coronis and upwards ancora on the left. On the right are illustrated the dotted obe-
los, capital dotted lunate sigma symbol, capital reversed lunate sigma symbol, and a glyph
variant of the downards ancora. The numbers on the left indicate text lines. A paragraphos
appears below the start of line 12. The opening brackets “[” indicate fragments, where text
is illegible or missing in the original. These examples are slightly adapted and embellished
from editions of the Oxyrhynchus Papyri and Homer’s Iliad.
U+2E0F paragraphos is placed at the beginning of the line but may refer to a break in the
text at any point in the line. The paragraphos should be a horizontal line, generally stretch-
ing under the first few letters of the line it refers to, and possibly extending into the margin.
It should be given a no-space line of its own and does not itself constitute a line or para-
Writing Systems and Punctuation 280 6.2 General Punctuation
graph break point for the rest of the text. Examples of the paragraphos, forked paragraphos,
and reversed forked paragraphos are illustrated in Figure 6-6.
δαιμονα...
δαιμονα...
δαιμονα...
δευοντοσου... δευοντοσου... δευοντοσου...
Indic Punctuation
Dandas. Dandas are phrase-ending punctuation common to the scripts of South and
South East Asia. The Devanagari danda and double danda characters are intended for
generic use across the scripts of India. They are also occasionally used in Latin translitera-
tion of traditional texts from Indic scripts.
There are minor visual differences in the appearance of the dandas, which may require
script-specific fonts or a font that can provide glyph alternates based on script environ-
ment. For the four Philippine scripts, the analogues to the dandas are encoded once in
Hanunóo and shared across all four scripts. The other Brahmi-derived scripts have sepa-
rately encoded equivalents for the danda and double danda. In some scripts, as for Tibetan,
multiple, differently ornamented versions of dandas may occur. The dandas encoded in the
Unicode Standard are listed in Table 6-8.
The Bidirectional Class of the dandas matches that for the scripts they are intended for.
Kharoshthi, which is written from right to left, has Bidirectional Class R for U+10A56
kharoshthi punctuation danda. For more on bidirectional classes, see Unicode Stan-
dard Annex #9, “Unicode Bidirectional Algorithm.”
Note that the name of the danda in Hindi is viram, while the different Unicode character
named virama is called halant in Hindi. If this distinction is not kept in mind, it can lead to
confusion as to which character is meant.
CJK Punctuation
CJK Punctuation comprises punctuation marks and symbols used by writing systems that
employ Han ideographs. Most of these characters are found in East Asian standards. Typi-
cal for many of these wide punctuation characters is that the actual image occupies only the
left or the right half of the normal square character cell. The extra whitespace is frequently
removed in a kerning step during layout, as shown in Figure 6-7. Unlike ordinary kerning,
which uses tables supplied by the font, the character space adjustment of wide punctuation
characters is based on their character code.
( + ( → ((
FF08 FF08 After Kerning
U+3000 ideographic space is provided for compatibility with legacy character sets. It is a
fixed-width wide space appropriate for use with an ideographic font. For more information
about wide characters, see Unicode Standard Annex #11, “East Asian Width.”
U+3030 wavy dash is a special form of a dash found in East Asian character standards.
(For a list of other space and dash characters in the Unicode Standard, see Table 6-2 and
Table 6-3.)
U+3037 ideographic telegraph line feed separator symbol is a visible indicator of
the line feed separator symbol used in the Chinese telegraphic code. It is comparable to the
pictures of control codes found in the Control Pictures block.
Writing Systems and Punctuation 283 6.2 General Punctuation
U+3005 ideographic iteration mark is used to stand for the second of a pair of identical
ideographs occurring in adjacent positions within a document.
U+3006 ideographic closing mark is used frequently on signs to indicate that a store or
booth is closed for business. The Japanese pronunciation is shime, most often encountered
in the compound shime-kiri.
The U+3008 and U+3009 angle brackets are unambiguously wide, as are other bracket
characters in this block, such as double angle brackets, tortoise shell brackets, and white
square brackets. Where mathematical and other non-CJK contexts use brackets of similar
shape, the Unicode Standard encodes them separately.
U+3012 postal mark is used in Japanese addresses immediately preceding the numerical
postal code. It is also used on forms and applications to indicate the blank space in which a
postal code is to be entered. U+3020 postal mark face and U+3036 circled postal
mark are properly glyphic variants of U+3012 and are included for compatibility.
U+3031 vertical kana repeat mark and U+3032 vertical kana repeat with voiced
sound mark are used only in vertically written Japanese to repeat pairs of kana characters
occurring immediately prior in a document. The voiced variety U+3032 is used in cases
where the repeated kana are to be voiced. For instance, a repetitive phrase like toki-doki
could be expressed as <U+3068, U+304D, U+3032> in vertical writing. Both of these char-
acters are intended to be represented by “double-height” glyphs requiring two ideographic
“cells” to print; this intention also explains the existence in source standards of the charac-
ters representing the top and bottom halves of these characters (that is, the characters
U+3033, U+3034, and U+3035). In horizontal writing, similar characters are used, and
they are separately encoded. In Hiragana, the equivalent repeat marks are encoded at
U+309D and U+309E; in Katakana, they are U+30FD and U+30FE.
Wave Dash. U+301C wave dash is a compatibility character that was originally encoded
to represent the character in the JIS C 6226-1978 standard and all subsequent revisions and
extensions with the kuten code: 1-33 (0x8160 in Shift-JIS encoding). The mapping of this
character has been problematical. Some major implementations originally mapped, and
continue to map for compatibility purposes, that JIS character to U+FF5E fullwidth
tilde, instead. The mapping issue has been documented in the Unicode Standard since
Version 3.0.
From Version 2.0 through Version 7.0 of the Unicode Standard, U+301C was shown in the
code charts with a representative glyph that had a wide reversed tilde shape. Starting with
Version 8.0, however, the representative glyph has been corrected to a wide tilde shape, to
reflect predominant practice in commercial fonts. For most purposes, U+301C wave dash
should be treated simply as a duplicate representation of U+FF5E fullwidth tilde.
Sesame Dots. U+FE45 sesame dot and U+FE46 white sesame dot are used in vertical
text, where a series of sesame dots may appear beside the main text, as a sidelining to pro-
vide visual emphasis. In this respect, their usage is similar to such characters as U+FE34
presentation form for vertical wavy low line, which are also used for sidelining ver-
tical text for emphasis. Despite being encoded in the block for CJK compatibility forms, the
Writing Systems and Punctuation 284 6.2 General Punctuation
sesame dots are not compatibility characters. They are in general typographic use and are
found in the Japanese standard, JIS X 0213.
U+FE45 sesame dot is historically related to U+3001 ideographic comma, but is not
simply a vertical form variant of it. The function of an ideographic comma in connected text
is distinct from that of a sesame dot.
overscores or underscores. They were intended, in the Chinese standard, for the represen-
tation of various types of overlining or underlining, for emphasis of text when laid out hor-
izontally. Except for round-trip mapping with legacy character encodings, the use of these
characters is discouraged; use of styles is the preferred way to handle such effects in mod-
ern text rendering.
Small Form Variants. CNS 11643 also contains a number of small variants of ASCII punc-
tuation characters. The Unicode Standard encodes those variants as compatibility charac-
ters in the Small Form Variants block, U+FE50..U+FE6F. Those characters, while
construed as fullwidth characters, are nevertheless depicted using small forms that are set
in a fullwidth display cell. (See the discussion in Section 18.4, Hiragana and Katakana.)
These characters are provided for compatibility with legacy implementations.
Two small form variants from CNS 11643/plane 1 were unified with other characters out-
side the ASCII block: 213116 was unified with U+00B7 middle dot, and 226116 was uni-
fied with U+2215 division slash.
Fullwidth and Halfwidth Variants. For compatibility with East Asian legacy character
sets, the Unicode Standard encodes fullwidth variants of ASCII punctuation and halfwidth
variants of CJK punctuation. See Section 18.5, Halfwidth and Fullwidth Forms, for more
information.
Writing Systems and Punctuation 286 6.2 General Punctuation
287
Chapter 7
Europe-I 7
Modern and Liturgical Scripts
Modern European alphabetic scripts are derived from or influenced by the Greek script,
which itself was an adaptation of the Phoenician alphabet. A Greek innovation was writing
the letters from left to right, which is the writing direction for all the scripts derived from or
inspired by Greek.
The alphabetic scripts and additional characters described in this chapter are:
Some scripts whose geographic area of primary usage is outside Europe are included in this
chapter because of their relationship with Greek script. Coptic is used primarily by the
Coptic church in Egypt and elsewhere; Armenian and Georgian are primarily associated
with countries in the Caucasus (which is often not included as part of Europe), although
Armenian in particular is used by a large diaspora.
These scripts are all written from left to right. Many have separate lowercase and uppercase
forms of the alphabet. Spaces are used to separate words. Accents and diacritical marks are
used to indicate phonetic features and to extend the use of base scripts to additional lan-
guages. Some of these modification marks have evolved into small free-standing signs that
can be treated as characters in their own right.
The Latin script is used to write or transliterate texts in a wide variety of languages. The
International Phonetic Alphabet (IPA) is an extension of the Latin alphabet, enabling it to
represent the phonetics of all languages. Other Latin phonetic extensions are used for the
Uralic Phonetic Alphabet and the Teuthonista transcription system.
The Latin alphabet is derived from the alphabet used by the Etruscans, who had adopted a
Western variant of the classical Greek alphabet (Section 8.5, Old Italic). Originally it con-
tained only 24 capital letters. The modern Latin alphabet as it is found in the Basic Latin
block owes its appearance to innovations of scribes during the Middle Ages and practices
of the early Renaissance printers.
The Cyrillic script was developed in the ninth century and is also based on Greek. Like
Latin, Cyrillic is used to write or transliterate texts in many languages. The Georgian and
Armenian scripts were devised in the fifth century and are influenced by Greek.
Europe-I 288
The Coptic script was the last stage in the development of Egyptian writing. It represented
the adaptation of the Greek alphabet to writing Egyptian, with the retention of forms from
Demotic for sounds not adequately represented by Greek letters. Although primarily used
in Egypt from the fourth to the tenth century, it is described in this chapter because of its
close relationship to the Greek script.
Glagolitic is an early Slavic script related in some ways to both the Greek and the Cyrillic
scripts. It was widely used in the Balkans but gradually died out, surviving the longest in
Croatia. Like Coptic, however, it still has some modern use in liturgical contexts.
This chapter also describes modifier letters and combining marks used with the Latin
script and other scripts.
The block descriptions for other archaic European alphabetic scripts, such as Gothic,
Ogham, Old Italic, and Runic, can be found in Chapter 8, Europe-II.
Europe-I 289 7.1 Latin
7.1 Latin
The Latin script was derived from the Greek script. Today it is used to write a wide variety
of languages all over the world. In the process of adapting it to other languages, numerous
extensions have been devised. The most common is the addition of diacritical marks. Fur-
thermore, the creation of digraphs, inverse or reverse forms, and outright new characters
have all been used to extend the Latin script.
The Latin script is written in linear sequence from left to right. Spaces are used to separate
words and provide the primary line breaking opportunities. Hyphens are used where lines
are broken in the middle of a word. (For more information, see Unicode Standard Annex
#14, “Unicode Line Breaking Algorithm.”) Latin letters come in uppercase and lowercase
pairs.
Languages. Some indication of language or other usage is given for many characters within
the names lists accompanying the character charts.
Diacritical Marks. Speakers of different languages treat the addition of a diacritical mark
to a base letter differently. In some languages, the combination is treated as a letter in the
alphabet for the language. In others, such as English, the same words can often be spelled
with and without the diacritical mark without implying any difference. Most languages
that use the Latin script treat letters with diacritical marks as variations of the base letter,
but do not accord the combination the full status of an independent letter in the alphabet.
Widely used accented character combinations are provided as single characters to accom-
modate interoperation with pervasive practice in legacy encodings. Combining diacritical
marks can express these and all other accented letters as combining character sequences.
In the Unicode Standard, all diacritical marks are encoded in sequence after the base char-
acters to which they apply. For more details, see the subsection “Combining Diacritical
Marks” in Section 7.9, Combining Marks, and also Section 2.11, Combining Characters.
Alternative Glyphs. Some characters have alternative representations, although they have
a common semantic. In such cases, a preferred glyph is chosen to represent the character in
the code charts, even though it may not be the form used under all circumstances. Some
Latin examples to illustrate this point are provided in Figure 7-1 and discussed in the text
that follows.
aa gg
@AU ST WV
C D, " LR
Europe-I 290 7.1 Latin
Common typographical variations of basic Latin letters include the open- and closed-loop
forms of the lowercase letters “a” and “g”, as shown in the first example in Figure 7-1. In
ordinary Latin text, such distinctions are merely glyphic alternates for the same characters;
however, phonetic transcription systems, such as IPA, often make systematic distinctions
between these forms.
Variations in Diacritical Marks. The shape and placement of diacritical marks can be
subject to considerable variation that might surprise a reader unfamiliar with such distinc-
tions. For example, when Czech is typeset, U+010F latin small letter d with caron
and U+0165 latin small letter t with caron are often rendered by glyphs with an
apostrophe instead of with a caron, commonly known as a há`ek. See the second example
in Figure 7-1. In Slovak, this use also applies to U+013E latin small letter l with
caron and U+013D latin capital letter l with caron. The use of an apostrophe can
avoid some line crashes over the ascenders of those letters and so result in better typogra-
phy. In typewritten or handwritten documents, or in didactic and pedagogical material,
glyphs with há`eks are preferred.
Characters with cedillas, commas or ogoneks below often are subject to variable typo-
graphical usage, depending on the availability and quality of fonts used, the technology, the
era and the geographic area. Various hooks, cedillas, commas, and squiggles may be substi-
tuted for the nominal forms of these diacritics below, and even the directions of the hooks
may be reversed.
The character U+0327 combining cedilla can be displayed by a wide variety of forms,
including cedillas and commas below. This variability also occurs for the precomposed
characters whose decomposition includes U+0327. For text in some languages, a specific
form is typically preferred. In particular, Latvian and Romanian prefer a comma below,
while a cedilla is preferred in Turkish and Marshallese. These language-specific prefer-
ences are discussed in more detail in the text that follows.
Also, as a result of legacy encodings and practices, and the mapping of those legacy encod-
ings to Unicode, some particular shapes for U+0327 combining cedilla are preferred in
the absence of language or locale context. A rendering as cedilla is preferred for the letters
listed in the first column, while rendering as comma below is preferred for those listed in
the second column of Table 7-1.
Latvian Cedilla. There is specific variation involved in the placement and shapes of cedil-
las on Latvian characters. This is illustrated by the Latvian letter U+0123 latin small let-
ter g with cedilla, as shown in example 3 in Figure 7-1. In good Latvian typography, this
character is always shown with a rotated comma over the g, rather than a cedilla below the
g, because of the typographical design and layout issues resulting from trying to place a
cedilla below the descender loop of the g. Poor Latvian fonts may substitute an acute accent
Europe-I 291 7.1 Latin
for the rotated comma, and handwritten or other printed forms may actually show the
cedilla below the g. The uppercase form of the letter is always shown with a cedilla, as the
rounded bottom of the G poses no problems for attachment of the cedilla.
Other Latvian letters with a cedilla below (U+0137 latin small letter k with cedilla,
U+0146 latin small letter n with cedilla, and U+0157 latin small letter r with
cedilla) always prefer a glyph with a floating comma below, as there is no proper attach-
ment point for a cedilla at the bottom of the base form.
Cedilla and Comma Below in Turkish and Romanian. The Latin letters s and t with
comma below or with cedilla diacritics pose particular interpretation issues for Turkish
and Romanian data, both in legacy character sets and in the Unicode Standard. Legacy
character sets generally include a single form for these characters. While the formal inter-
pretation of legacy character sets is that they contain only one of the forms, in practice this
single character has been used to represent any of the forms. For example, 0xBA in ISO
8859-2 is formally defined as a lowercase s with cedilla, but has been used to represent a
lowercase s with comma below for Romanian.
The Unicode Standard provides unambiguous representations for all of the forms, for
example, U+0219 n latin small letter s with comma below versus U+015F m latin
small letter s with cedilla. In modern usage, the preferred representation of Roma-
nian text is with U+0219 n latin small letter s with comma below, while Turkish data
is represented with U+015F m latin small letter s with cedilla.
However, due to the prevalence of legacy implementations, a large amount of Romanian
data will contain U+015F m latin small letter s with cedilla or the corresponding
code point 0xBA in ISO 8859-2. When converting data represented using ISO 8859-2,
0xBA should be mapped to the appropriate form. When processing Romanian Unicode
data, implementations should treat U+0219 n latin small letter s with comma below
and U+015F m latin small letter s with cedilla as equivalent.
Exceptional Case Pairs. The characters U+0130 latin capital letter i with dot above
and U+0131 latin small letter dotless i (used primarily in Turkish) are assumed to
take ASCII “i” and “I”, respectively, as their case alternates. This mapping makes the corre-
sponding reverse mapping language-specific; mapping in both directions requires special
attention from the implementer (see Section 5.18, Case Mappings).
Diacritics on i and j. A dotted (normal) i or j followed by some common nonspacing
marks above loses the dot in rendering. Thus, in the word naïve, the ï could be spelled with
i + diaeresis. A dotted-i is not equivalent to a Turkish dotless-i + overdot, nor are other cases
of accented dotted-i equivalent to accented dotless-i (for example, i + ¨ ‡ ı + ¨). The same
pattern is used for j. Dotless-j is used in the Landsmålsalfabet, where it does not have a case
pair.
To express the forms sometimes used in the Baltic (where the dot is retained under a top
accent in dictionaries), use i + overdot + accent (see Figure 7-2).
All characters that use their dot in this manner have the Soft_Dotted property in Unicode.
Europe-I 292 7.1 Latin
The Vietnamese vowels and other letters are found in the Basic Latin, Latin-1 Supplement,
and Latin Extended-A blocks. Additional precomposed vowels and tone marks are found
in the Latin Extended Additional block.
The characters U+0300 combining grave accent, U+0309 combining hook above,
U+0303 combining tilde, U+0301 combining acute accent, and U+0323 combining
dot below should be used in representing the Vietnamese tone marks. The characters
U+0340 combining grave tone mark and U+0341 combining acute tone mark have
canonical equivalences to U+0300 combining grave accent and U+0301 combining
acute accent, respectively; they are not recommended for use in representing Vietnam-
ese tones, despite the presence of tone mark in their character names.
Standards. Unicode follows ISO/IEC 8859-1 in the layout of Latin letters up to U+00FF.
ISO/IEC 8859-1, in turn, is based on older standards—among others, ASCII (ANSI X3.4),
which is identical to ISO/IEC 646:1991-IRV. Like ASCII, ISO/IEC 8859-1 contains Latin
letters, punctuation signs, and mathematical symbols. These additional characters are
widely used with scripts other than Latin. The descriptions of these characters are found in
Chapter 6, Writing Systems and Punctuation, and Chapter 22, Symbols.
The Latin Extended-A block includes characters contained in ISO/IEC 8859—Part 2. Latin
alphabet No. 2, Part 3. Latin alphabet No. 3, Part 4. Latin alphabet No. 4, and Part 9. Latin
alphabet No. 5. Many of the other graphic characters contained in these standards, such as
punctuation, signs, symbols, and diacritical marks, are already encoded in the Latin-1 Sup-
plement block. Other characters from these parts of ISO/IEC 8859 are encoded in other
blocks, primarily in the Spacing Modifier Letters block (U+02B0..U+02FF) and in the
Europe-I 293 7.1 Latin
character blocks starting at and following the General Punctuation block. The Latin
Extended-A block also covers additional characters from ISO/IEC 6937.
The Latin Extended-B block covers, among others, characters in ISO 6438
Documentation—African coded character set for bibliographic information interchange,
Pinyin Latin transcription characters from the People’s Republic of China national stan-
dard GB 2312 and from the Japanese national standard JIS X 0212, and Sami characters
from ISO/IEC 8859 Part 10. Latin alphabet No. 6.
The characters in the IPA block are taken from the 1989 revision of the International Pho-
netic Alphabet, published by the International Phonetic Association. Extensions from later
IPA sources have also been added.
Related Characters. For other Latin-derived characters, see Letterlike Symbols
(U+2100..U+214F), Currency Symbols (U+20A0..U+20CF), Number Forms
(U+2150..U+218F), Enclosed Alphanumerics (U+2460..U+24FF), CJK Compatibility
(U+3300..U+33FF), Fullwidth Forms (U+FF21..U+FF5A), and Mathematical Alphanu-
meric Symbols (U+1D400..U+1D7FF).
of most European languages that employ the Latin script. Many other languages can also
be written with the characters in this block. Most of these characters are equivalent to pre-
composed combinations of base character forms and combining diacritical marks. These
combinations may also be represented by means of composed character sequences. See
Section 2.11, Combining Characters, and Section 7.9, Combining Marks.
Compatibility Digraphs. The Latin Extended-A block contains five compatibility
digraphs, encoded for compatibility with ISO/IEC 6937:1984. Two of these characters,
U+0140 latin small letter l with middle dot and its uppercase version, were origi-
nally encoded in ISO/IEC 6937 for support of Catalan. In current conventions, the repre-
sentation of this digraphic sequence in Catalan simply uses a sequence of an ordinary “l”
and U+00B7 middle dot.
Another pair of characters, U+0133 latin small ligature ij and its uppercase version,
was provided to support the digraph “ij” in Dutch, often termed a “ligature” in discussions
of Dutch orthography. When adding intercharacter spacing for line justification, the “ij” is
kept as a unit, and the space between the i and j does not increase. In titlecasing, both the i
and the j are uppercased, as in the word “IJsselmeer.” Using a single code point might sim-
plify software support for such features; however, because a vast amount of Dutch data is
encoded without this digraph character, under most circumstances one will encounter an
<i, j> sequence.
Finally, U+0149 latin small letter n preceded by apostrophe was encoded for use in
Afrikaans. The character is deprecated, and its use is strongly discouraged. In nearly all
cases it is better represented by a sequence of an apostrophe followed by “n”.
Languages. Most languages supported by this block also require the concurrent use of
characters contained in the Basic Latin and Latin-1 Supplement blocks. When combined
with these two blocks, the Latin Extended-A block supports Afrikaans, Basque, Breton,
Croatian, Czech, Esperanto, Estonian, French, Frisian, Greenlandic, Hungarian, Latin,
Latvian, Lithuanian, Maltese, Polish, Provençal, Rhaeto-Romanic, Romanian, Romany,
Sámi, Slovak, Slovenian, Sorbian, Turkish, Welsh, and many others.
of compatibility digraph codes is provided for one-to-one transliteration. There are two
potential uppercase forms for each digraph, depending on whether only the initial letter is
to be capitalized (titlecase) or both (all uppercase). The Unicode Standard offers both
forms so that software can convert one form to the other without changing font sets. The
appropriate cross references are given for the lowercase letters.
Pinyin Diacritic–Vowel Combinations. The Chinese standard GB 2312, the Japanese
standard JIS X 0212, and some other standards include codes for Pinyin, which is used for
Latin transcription of Mandarin Chinese. Most of the letters used in Pinyin romanization
are already covered in the preceding Latin blocks. The group of 16 characters provided
here completes the Pinyin character set specified in GB 2312 and JIS X 0212.
Case Pairs. A number of characters in this block are uppercase forms of characters whose
lowercase forms are part of some other grouping. Many of these characters came from the
International Phonetic Alphabet; they acquired uppercase forms when they were adopted
into Latin script-based writing systems. Occasionally, however, alternative uppercase
forms arose in this process. In some instances, research has shown that alternative upper-
case forms are merely variants of the same character. If so, such variants are assigned a sin-
gle Unicode code point, as is the case of U+01B7 latin capital letter ezh. But when
research has shown that two uppercase forms are actually used in different ways, then they
are given different codes; such is the case for U+018E latin capital letter reversed e
and U+018F latin capital letter schwa. In this instance, the shared lowercase form is
copied to enable unique case-pair mappings: U+01DD latin small letter turned e is a
copy of U+0259 latin small letter schwa.
For historical reasons, the names of some case pairs differ. For example, U+018E latin
capital letter reversed e is the uppercase of U+01DD latin small letter
turned e—not of U+0258 latin small letter reversed e. For default case mappings of
Unicode characters, see Section 4.2, Case.
Caseless Letters. A number of letters used with the Latin script are caseless—for example,
the caseless glottal stop at U+0294 and U+01BB latin letter two with stroke, and the
various letters denoting click sounds. Caseless letters retain their shape when uppercased.
When titlecasing words, they may also act transparently; that is, if they occur in the leading
position, the next following cased letter may be uppercased instead.
Over the last several centuries, the trend in typographical development for the Latin script
has tended to favor the eventual introduction of case pairs. See the following discussion of
the glottal stop. The Unicode Standard may encode additional uppercase characters in
such instances. However, for reasons of stability, the standard will never add a new lower-
case form for an existing uppercase character. See also “Caseless Matching” in Section 5.18,
Case Mappings.
Glottal Stop. There are two patterns of usage for the glottal stop in the Unicode Standard.
U+0294 j latin letter glottal stop is a caseless letter used in IPA. It is also widely seen
in language orthographies based on IPA or Americanist phonetic usage, in those instances
where no casing is apparent for glottal stop. Such orthographies may avoid casing for glottal
Europe-I 296 7.1 Latin
stop to the extent that when titlecasing strings, a word with an initial glottal stop may have
its second letter uppercased instead of the first letter.
In a small number of orthographies for languages of northwestern Canada, and in particu-
lar, for Chipewyan, Dogrib, and Slavey, case pairs have been introduced for glottal stop. For
these orthographies, the cased glottal stop characters should be used: U+0241 k latin cap-
ital letter glottal stop and U+0242 l latin small letter glottal stop.
The glyphs for the glottal stop are somewhat variable and overlap to a certain extent. The
glyph shown in the code charts for U+0294 j latin letter glottal stop is a cap-height
form as specified in IPA, but the same character is often shown with a glyph that resembles
the top half of a question mark and that may or may not be cap height. U+0241 k latin
capital letter glottal stop, while shown with a larger glyph in the code charts, often
appears identical to U+0294. U+0242 l latin small letter glottal stop is a small form
of U+0241.
Various small, raised hook- or comma-shaped characters are often substituted for a glottal
stop—for instance, U+02BC m modifier letter apostrophe, U+02BB n modifier letter
turned comma, U+02C0 o modifier letter glottal stop, or U+02BE p modifier let-
ter right half ring. U+02BB, in particular, is used in Hawaiian orthography as the
nokina.
tutes an extended alphabet and phonetic transcriptional standard, rather than a character
encoding standard.
Unifications. The IPA characters are unified as much as possible with other letters, albeit
not with nonletter symbols such as U+222B ´ integral. The IPA characters have also
been adopted into the Latin-based alphabets of many written languages, such as some used
in Africa. It is futile to attempt to distinguish a transcription from an actual alphabet in
such cases. Therefore, many IPA characters are found outside the IPA Extensions block.
IPA characters that are not found in the IPA Extensions block are listed as cross references
at the beginning of the character names list for this block.
IPA Alternates. In a few cases IPA practice has, over time, produced alternate forms, such
as U+0269 “ι”latin small letter iota versus U+026A “i”latin letter small capital i.
The Unicode Standard provides separate encodings for the two forms because they are
used in a meaningfully distinct fashion.
Case Pairs. IPA does not sanction case distinctions; in effect, its phonetic symbols are all
lowercase. When IPA symbols are adopted into a particular alphabet and used by a given
written language (as has occurred, for example, in Africa), they acquire uppercase forms.
Because these uppercase forms are not themselves IPA symbols, they are generally encoded
in the Latin Extended-B block (or other Latin extension blocks) and are cross-referenced
with the IPA names list.
Typographic Variants. IPA includes typographic variants of certain Latin and Greek let-
ters that would ordinarily be considered variations of font style rather than of character
identity, such as small capital letterforms. Examples include a typographic variant of the
Greek letter phi φ and the borrowed letter Greek iota ι, which has a unique Latin uppercase
form. These forms are encoded as separate characters in the Unicode Standard because
they have distinct semantics in plain text.
Affricate Digraph Ligatures. IPA officially sanctions six digraph ligatures used in tran-
scription of coronal affricates. These are encoded at U+02A3..U+02A8. The IPA digraph
ligatures are explicitly defined in IPA and have possible semantic values that make them
not simply rendering forms. For example, while U+02A6 latin small letter ts digraph
is a transcription for the sounds that could also be transcribed in IPA as “ts” <U+0074,
U+0073>, the choice of the digraph ligature may be the result of a deliberate distinction
made by the transcriber regarding the systematic phonetic status of the affricate. The
choice of whether to ligate cannot be left to rendering software based on the font available.
This ligature also differs in typographical design from the “ts” ligature found in some old-
style fonts.
Arrangement. The IPA Extensions block is arranged in approximate alphabetical order
according to the Latin letter that is graphically most similar to each symbol. This order has
nothing to do with a phonetic arrangement of the IPA letters.
Europe-I 298 7.1 Latin
The Phonetic Extensions Supplement block also contains 37 superscript modifier letters.
These complement the much more commonly used superscript modifier letters found in
the Spacing Modifier Letters block.
U+1D77 latin small letter turned g and U+1D78 modifier letter cyrillic en are
used in Caucasian linguistics. U+1D79 latin small letter insular g is used in older
Irish phonetic notation. It is to be distinguished from a Gaelic style glyph for U+0067
latin small letter g.
Digraph for th. U+1D7A latin small letter th with strikethrough is a digraphic
notation commonly found in some English-language dictionaries, representing the voice-
less (inter)dental fricative, as in thin. While this character is clearly a digraph, the obliga-
tory strikethrough across two letters distinguishes it from a “th” digraph per se, and there is
no mechanism involving combining marks that can easily be used to represent it. A com-
mon alternative glyphic form for U+1D7A uses a horizontal bar to strike through the two
letters, instead of a diagonal stroke.
Uyghur. The Latin orthography for the Uyghur language was influenced by widespread
conventions for extension of the Cyrillic script for representing Central Asian languages. In
particular, a number of Latin characters were extended with a Cyrillic-style descender dia-
critic to create new letters for use with Uyghur.
Claudian Letters. The Roman emperor Claudius invented three additional letters for use
with the Latin script. Those letters saw limited usage during his reign, but were abandoned
soon afterward. The half h letter is encoded in this block. The other two letters are encoded
in other blocks: U+2132 turned capital f and U+2183 roman numeral reversed one
hundred (unified with the Claudian letter reversed c). Claudian letters in inscriptions are
uppercase only, but may be transcribed by scholars in lowercase.
Lhuyd in his 1707 work Archæologia Britannica, which described the Late Cornish lan-
guage in a phonetic alphabet using these Insular characters. Other specialists may make
use of these letters contrastively in Old English or Irish manuscript contexts or in second-
ary material discussing such manuscripts.
Orthographic Letter Additions. The letters and modifier letters in the range
U+A788..U+A78C occur in modern orthographies of a few small languages of Africa, Mex-
ico, and New Guinea. Several of these characters were based on punctuation characters
originally, so their shapes are confusingly similar to ordinary ASCII punctuation. Because
of this potential confusion, their use is not generally recommended outside the specific
context of the few orthographies already incorporating them.
Sinological Dot. U+A78F latin letter sinological dot is a middle dot used in the sino-
logical tradition to represent a glottal stop. This convention of representing a glottal stop
with a middle dot was introduced by Bernhard Karlgren in the early 20th century for Mid-
dle Chinese reconstructions, and was adopted by other influential sinologists and Tangu-
tologists. This dot is also used in Latin transliterations of Phags-pa text.
The representative glyph for U+A78F is larger than a typical middle dot used as punctua-
tion, to avoid visual confusion with U+00B7 middle dot. Use of the sinological dot should
be limited to the appropriate scholarly contexts; it is not intended as a letter substitution
for other functions of U+00B7 middle dot.
Early Pinyin Letters. Early, experimental drafts of Pinyin included a number of Latin let-
ters with retroflex or palatal hooks, for example, U+0282 latin small letter s with
hook. These letters were not adopted in standard Pinyin, but are attested in the early doc-
uments and in discussions about the history of Pinyin. Because Pinyin allows for Latin cap-
italization conventions, those letters with hooks also occurred in uppercase forms. The
uppercase letters in the range U+A7C4..U+A7C6 are encoded to represent those uppercase
forms of early Pinyin letters with hooks.
Latvian Letters. The letters with strokes in the range U+A7A0..U+A7A9 are for use in the
pre-1921 orthography of Latvian. During the 19th century and early 20th century, Latvian
was usually typeset in a Fraktur typeface. Because Fraktur typefaces do not work well with
detached diacritical marks, the extra letters required for Latvian were formed instead with
overstruck bars. The new orthography introduced in 1921 replaced these letters with the
current Latvian letters with cedilla diacritics. The barred s letters were also used in Fraktur
representation of Lower Sorbian until about 1950.
Ancient Roman Epigraphic Letters. There are a small number of additional Latin epi-
graphic letters known from Ancient Roman inscriptions. These letters only occurred as
monumental capitals in the inscriptions, and were not part of the regular Latin alphabet
which later developed case distinctions.
Ascoli transcription system, more generally known as “Teuthonista.” The Teuthonista sys-
tem was extensively used in the 20th century to transcribe Germanic dialects. Teuthonista
or closely related systems were also used in Switzerland and Italy to transcribe Romance
dialects. For related characters, see the Combining Diacritical Marks Extended block,
which contains a number of specialized combining diacritics for use in Teuthonista.
The Latin Extended-E block also contains a few rarely used letters from other transcription
systems, including conventions used in Sino-Tibetan studies.
7.2 Greek
Greek: U+0370–U+03FF
The Greek script is used for writing the Greek language. The Greek script had a strong
influence on the development of the Latin, Cyrillic, and Coptic scripts.
The Greek script is written in linear sequence from left to right with the frequent use of
nonspacing marks. There are two styles of such use: monotonic, which uses a single mark
called tonos, and polytonic, which uses multiple marks. Greek letters come in uppercase
and lowercase pairs. Spaces are used to separate words and provide the primary line break-
ing opportunities. Archaic Greek texts do not use spaces.
Standards. The Unicode encoding of Greek is based on ISO/IEC 8859-7, which is equiva-
lent to the Greek national standard ELOT 928, designed for monotonic Greek. A number
of variant and archaic characters are taken from the bibliographic standard ISO 5428.
Polytonic Greek. Polytonic Greek, used for ancient Greek (classical and Byzantine) and
occasionally for modern Greek, may be encoded using either combining character
sequences or precomposed base plus diacritic combinations. For the latter, see the follow-
ing subsection, “Greek Extended: U+1F00–U+1FFF.”
Nonspacing Marks. Several nonspacing marks commonly used with the Greek script are
found in the Combining Diacritical Marks range (see Table 7-2).
Because the characters in the Combining Diacritical Marks block are encoded by shape,
not by meaning, they are appropriate for use in Greek where applicable. The character
U+0344 combining greek dialytika tonos should not be used. The combination of dia-
lytika plus tonos is instead represented by the sequence <U+0308 combining diaeresis,
U+0301 combining acute accent>.
Europe-I 304 7.2 Greek
Multiple nonspacing marks applied to the same baseform character are encoded in inside-
out sequence. See the general rules for applying nonspacing marks in Section 2.11, Combin-
ing Characters.
The basic Greek accent written in modern Greek is called tonos. It is represented by an
acute accent (U+0301). The shape that the acute accent takes over Greek letters is generally
steeper than that shown over Latin letters in Western European typographic traditions,
and in earlier editions of this standard was mistakenly shown as a vertical line over the
vowel. Polytonic Greek has several contrastive accents, and the accent, or tonos, written
with an acute accent is referred to as oxia, in contrast to the varia, which is written with a
grave accent.
U+0342 combining greek perispomeni may appear as a circumflex N, an inverted breve
., a tilde O, or occasionally a macron -. Because of this variation in form, the perispomeni
was encoded distinctly from U+0303 combining tilde.
U+0313 combining comma above and U+0343 combining greek koronis both take the
form of a raised comma over a baseform letter. U+0343 combining greek koronis was
included for compatibility reasons; U+0313 combining comma above is the preferred
form for general use. Greek uses guillemets for quotation marks; for Ancient Greek, the
quotations tend to follow local publishing practice. Because of the possibility of confusion
between smooth breathing marks and curly single quotation marks, the latter are best
avoided where possible. When either breathing mark is followed by an acute or grave
accent, the pair is rendered side-by-side rather than vertically stacked.
Accents are typically written above their base letter in an all-lowercase or all-uppercase
word; they may also be omitted from an all-uppercase word. However, in a titlecase word,
accents applied to the first letter are commonly written to the left of that letter. This is a
matter of presentation only—the internal representation is still the base letter followed by
the combining marks. It is not the stand-alone version of the accents, which occur before
the base letter in the text stream.
Iota. The nonspacing mark ypogegrammeni (also known as iota subscript in English) can
be applied to the vowels alpha, eta, and omega to represent historic diphthongs. This mark
appears as a small iota below the vowel. When applied to a single uppercase vowel, the iota
does not appear as a subscript, but is instead normally rendered as a regular lowercase iota
to the right of the uppercase vowel. This form of the iota is called prosgegrammeni (also
known as iota adscript in English). In completely uppercased words, the iota subscript
should be replaced by a capital iota following the vowel. Precomposed characters that con-
tain iota subscript or iota adscript also have special mappings. (See Section 5.18, Case Map-
pings.) Archaic representations of Greek words, which did not have lowercase or accents,
use the Greek capital letter iota following the vowel for these diphthongs. Such archaic rep-
resentations require special case mapping, which may not be automatically derivable.
Variant Letterforms. U+03A5 greek capital letter upsilon has two common forms:
one looks essentially like the Latin capital Y, and the other has two symmetric upper
branches that curl like rams’ horns, “Y”. The Y-form glyph has been chosen consistently for
use in the code charts, both for monotonic and polytonic Greek. For mathematical usage,
Europe-I 305 7.2 Greek
the rams’ horn form of the glyph is required to distinguish it from the Latin Y. A third form
is also encoded as U+03D2 greek upsilon with hook symbol (see Figure 7-4). The pre-
composed characters U+03D3 greek upsilon with acute and hook symbol and
U+03D4 greek upsilon with diaeresis and hook symbol should not normally be
needed, except where necessary for backward compatibility for legacy character sets.
XYZ
Variant forms of several other Greek letters are encoded as separate characters in this
block. Often (but not always), they represent different forms taken on by the character
when it appears in the final position of a word. Examples include U+03C2 greek small
letter final sigma used in a final position and U+03D0 greek beta symbol, which is
the form that U+03B2 greek small letter beta would take on in a medial or final posi-
tion.
Of these variant letterforms, only final sigma should be used in encoding standard Greek
text to indicate a final sigma. It is also encoded in ISO/IEC 8859-7 and ISO 5428 for this
purpose. Because use of the final sigma is a matter of spelling convention, software should
not automatically substitute a final form for a nominal form at the end of a word. However,
when performing lowercasing, the final form needs to be generated based on the context.
See Section 3.13, Default Case Algorithms.
In contrast, U+03D0 greek beta symbol, U+03D1 greek theta symbol, U+03D2 greek
upsilon with hook symbol, U+03D5 greek phi symbol, U+03F0 greek kappa symbol,
U+03F1 greek rho symbol, U+03F4 greek capital theta symbol, U+03F5 greek
lunate epsilon symbol, and U+03F6 greek reversed lunate epsilon symbol should
be used only in mathematical formulas—never in Greek text. If positional or other shape
differences are desired for these characters, they should be implemented by a font or ren-
dering engine.
Representative Glyphs for Greek Phi. Starting with The Unicode Standard, Version 3.0, and
the concurrent second edition of ISO/IEC 10646-1, the representative glyphs for U+03C6
ϕ greek small letter phi and U+03D5 φ greek phi symbol were swapped compared to
earlier versions. In ordinary Greek text, the character U+03C6 is used exclusively, although
this character has considerable glyphic variation, sometimes represented with a glyph
more like the representative glyph shown for U+03C6 ϕ (the “loopy” form) and less often
with a glyph more like the representative glyph shown for U+03D5 φ (the “straight” form).
For mathematical and technical use, the straight form of the small phi is an important sym-
bol and needs to be consistently distinguishable from the loopy form. The straight-form
phi glyph is used as the representative glyph for the symbol phi at U+03D5 to satisfy this
distinction.
Europe-I 306 7.2 Greek
The representative glyphs were reversed in versions of the Unicode Standard prior to Uni-
code 3.0. This resulted in the problem that the character explicitly identified as the mathe-
matical symbol did not have the straight form of the character that is the preferred glyph
for that use. Furthermore, it made it unnecessarily difficult for general-purpose fonts sup-
porting ordinary Greek text to add support for Greek letters used as mathematical symbols.
This resulted from the fact that many of those fonts already used the loopy-form glyph for
U+03C6, as preferred for Greek body text; to support the phi symbol as well, they would
have had to disrupt glyph choices already optimized for Greek text.
When mapping symbol sets or SGML entities to the Unicode Standard, it is important to
make sure that codes or entities that require the straight form of the phi symbol be mapped
to U+03D5 and not to U+03C6. Mapping to the latter should be reserved for codes or enti-
ties that represent the small phi as used in ordinary Greek text.
Fonts used primarily for Greek text may use either glyph form for U+03C6, but fonts that
also intend to support technical use of the Greek letters should use the loopy form to
ensure appropriate contrast with the straight form used for U+03D5.
Greek Letters as Symbols. The use of Greek letters for mathematical variables and opera-
tors is well established. Characters from the Greek block may be used for these symbols.
For compatibility purposes, a few Greek letters are separately encoded as symbols in other
character blocks. Examples include U+00B5 μ micro sign in the Latin-1 Supplement char-
acter block and U+2126 Ω ohm sign in the Letterlike Symbols character block. The ohm
sign is canonically equivalent to the capital omega, and normalization would remove any
distinction. Its use is therefore discouraged in favor of capital omega. The same equivalence
does not exist between micro sign and mu, and use of either character as a micro sign is
common. For Greek text, only the mu should be used.
Symbols Versus Numbers. The characters stigma, koppa, and sampi are used only as
numerals, whereas archaic koppa and digamma are used only as letters.
Compatibility Punctuation. Two specific modern Greek punctuation marks are encoded
in the Greek and Coptic block: U+037E “;” greek question mark and U+0387 “·” greek
ano teleia. The Greek question mark (or erotimatiko) has the shape of a semicolon, but
functions as a question mark in the Greek script. The ano teleia has the shape of a middle
dot, but functions as a semicolon in the Greek script.
These two compatibility punctuation characters have canonical equivalences to U+003B
semicolon and U+00B7 middle dot, respectively; as a result, normalized Greek text will
lose any distinctions between the Greek compatibility punctuation characters and the
common punctuation marks. Furthermore, ISO/IEC 8859-7 and most vendor code pages
for Greek simply make use of semicolon and middle dot for the punctuation in question.
Therefore, use of U+037E and U+0387 is not necessary for interoperating with legacy
Greek data, and their use is not generally encouraged for representation of Greek punctua-
tion.
Historic Letters. Historic Greek letters have been retained from ISO 5428.
Europe-I 307 7.2 Greek
Coptic-Unique Letters. In the Unicode Standard prior to Version 4.1, the Coptic script was
regarded primarily as a stylistic variant of the Greek alphabet. The letters unique to Coptic
were encoded in a separate range at the end of the Greek character block. Those characters
were to be used together with the basic Greek characters to represent the complete Coptic
alphabet. Coptic text was supposed to be rendered with a font using the Coptic style of
depicting the characters it shared with the Greek alphabet. Texts that mixed Greek and
Coptic languages using that encoding model could be rendered only by associating an
appropriate font by language.
The Unicode Technical Committee and ISO/IEC JTC1/SC2 determined that Coptic is
better handled as a separate script. Starting with Unicode 4.1, a new Coptic block added all
the letters formerly unified with Greek characters as separate Coptic characters. (See
Section 7.3, Coptic.) Implementations that supported Coptic under the previous encoding
model may, therefore, need to be modified. Coptic fonts may need to continue to support
the display of both the Coptic and corresponding Greek character with the same shape to
facilitate their use with older documents.
Related Characters. For math symbols, see Section 22.5, Mathematical Symbols. For addi-
tional punctuation to be used with this script, see C0 Controls and ASCII Punctuation
(U+0000..U+007F).
editions and scholarly studies of Greek inscriptions. Uppercase Greek letters from the
Greek block are also used for acrophonic numerals.
The Greek acrophonic number system is similar to the Roman one in that it does not use
decimal position, does not require a placeholder for zero, and has special symbols for 5, 50,
500, and so on. The system is language specific because of the acrophonic principle. In
some cases the same symbol represents different values in different geographic regions.
The symbols are also differentiated by the unit of measurement—for example, talents ver-
sus staters.
Other Numerical Symbols. Other numerical symbols encoded in the range
U+10175..U+1018A appear in a large number of ancient papyri. The standard symbols
used for the representation of numbers, fractions, weights, and measures, they have consis-
tently been used in modern editions of Greek papyri as well as various publications related
to the study and interpretation of ancient documents. Several of these characters have con-
siderable glyphic variation. Some of these glyph variants are similar in appearance to other
characters.
Symbol for Zero. U+1018A greek zero sign occurs whenever a sexagesimal notation is
used in historical astronomical texts to record degrees, minutes and seconds, or hours,
minutes and seconds. The most common form of zero in the papyri is a small circle with a
horizontal stroke above it, but many variations exist. These are taken to be scribal varia-
tions and are considered glyph variants.
Europe-I 310 7.3 Coptic
7.3 Coptic
Coptic: U+2C80–U+2CFF
The Coptic script is the final stage in the development of the Egyptian writing system. Cop-
tic was subject to strong Greek influences because Greek was more identified with the
Christian tradition, and the written demotic Egyptian no longer matched the spoken lan-
guage. The Coptic script was based on the Greek uncial alphabets with several Coptic addi-
tional letters unique to Coptic. The Coptic language died out in the fourteenth century, but
it is maintained as a liturgical language by Coptic Christians. Coptic is written from left to
right in linear sequence; in modern use, spaces are used to separate words and provide the
primary line breaking opportunities.
Prior to Version 4.1, the Unicode Standard treated Coptic as a stylistic variant of Greek.
Seven letters unique to Coptic (14 characters with the case pairs) were encoded in the
Greek and Coptic block. In addition to these 14 characters, Version 4.1 added a Coptic
block containing the remaining characters needed for basic Coptic text processing. This
block also includes standard logotypes used in Coptic text as well as characters for Old
Coptic and Nubian.
Development of the Coptic Script. The best-known Coptic dialects are Sahidic and
Bohairic. Coptic scholarship recognizes a number of other dialects that use additional
characters. The repertoires of Sahidic and Bohairic reflect efforts to standardize the writing
of Coptic, but attempts to write the Egyptian language with the Greek script preceded that
standardization by several centuries. During the initial period of writing, a number of dif-
ferent solutions to the problem of representing non-Greek sounds were made, mostly by
borrowing letters from Demotic writing. These early efforts are grouped by Copticists
under the general heading of Old Coptic.
Casing. Coptic is considered a bicameral script. Historically, it was caseless, but it has
acquired case through the typographic developments of the last centuries. Already in Old
Coptic manuscripts, letters could be written larger, particularly at the beginning of para-
graphs, although the capital letters tend to have the most distinctive shapes in the Bohairic
tradition. To facilitate scholarly and other modern casing operations, Coptic has been
encoded as a bicameral script, including uniquely Old Coptic characters.
Font Styles. Bohairic Coptic uses only a subset of the letters in the Coptic repertoire. It also
uses a font style distinct from that for Sahidic. Prior to Version 5.0, the Coptic letters
derived from Demotic, encoded in the range U+03E2..U+03EF in the Greek and Coptic
block, were shown in the code charts in a Bohairic font style. Starting from Version 5.0, all
Coptic letters in the standard, including those in the range U+03E2..U+03EF, are shown in
the code charts in a Sahidic font style, instead.
Characters for Cryptogrammic Use. U+2CB7 coptic small letter cryptogrammic eie
and U+2CBD coptic small letter cryptogrammic ni are characters for cryptogram-
mic use. A common Coptic substitution alphabet that was used to encrypt texts had the dis-
advantageous feature whereby three of the letters (eie, ni, and fi) were substituted by
Europe-I 311 7.3 Coptic
themselves. However, because eie and ni are two of the highest-frequency characters in
Coptic, Copts felt that the encryption was not strong enough, so they replaced those letters
with these cryptogrammic ones. Two additional cryptogrammic letters in less frequent use
are also encoded: U+2CEC coptic small letter cryptogrammic shei and U+2CEE
coptic small letter cryptogrammic gangia. Copticists preserve these letter substitu-
tions in modern editions of these encrypted texts and do not consider them to be glyph
variants of the original letters.
U+2CC0 coptic capital letter sampi has a numeric value of 900 and corresponds to
U+03E0 greek letter sampi. It is not found in abecedaria, but is used in cryptogrammic
contexts as a letter.
Crossed Shei. U+2CC3 < coptic small letter crossed shei is found in Dialect I of Old
Coptic, where it represents a sound /ç/. It is found alongside U+03E3 = coptic small
letter shei, which represents /"/. The diacritic is not productive.
Supralineation. In Coptic texts, a line is often drawn across the top of two or more charac-
ters in a row. There are two distinct conventions for this supralineation, each of which is
represented by different sequences of combining marks.
The first of these is a convention for abbreviation, in which words are shortened by
removal of certain letters. A line is then drawn across the tops of all of the remaining letters,
extending from the beginning of the first to the end of the last letter of the abbreviated
form. This convention is represented by following each character of the abbreviated form
with U+0305 combining overline. When rendered together, these combining overlines
should connect into a continuous line.
The other convention is to distinguish the spelling of certain common words or to high-
light proper names of divinities and heroes—a convention related to the use of cartouches
in hieroglyphic Egyptian. In this case the supralineation extends from the middle of the
first character in the sequence to the middle of the last character in the sequence. Instead of
using U+0305 combining overline for the entire sequence, one uses U+FE24 combining
macron left half after the first character, U+FE25 combining macron right half
after the last character, and U+FE26 combining conjoining macron after any interven-
ing characters. This gives the effect of a line starting and ending in the middle of letters,
rather than at their edges.
Combining Diacritical Marks. Bohairic text uses a mark called jinkim to represent syllabic
consonants, which is indicated by either U+0307 combining dot above or U+0300 com-
bining grave accent. Other dialects, including Sahidic, use U+0304 combining macron
for the same purpose. A number of other generic diacritical marks are used with Coptic.
U+2CEF coptic combining ni above is a script-specific combining mark, typically used at
the end of a line to indicate a final ni after a vowel. In rendering, this mark typically hangs
over the space to the right of its base character.
The characters U+2CF0 coptic combining spiritus asper and U+2CF1 coptic com-
bining spiritus lenis are analogues of the Greek breathing marks. They are used rarely in
Coptic. When used, they typically occur over the letter U+2C8F coptic small letter
Europe-I 312 7.3 Coptic
hate, sometimes to indicate that it is the borrowed Greek conjunction “or”, written with
the cognate Greek letter eta.
Punctuation. Coptic texts use common punctuation, including colon, full stop, semicolon
(functioning, as in Greek, as a question mark), and middle dot. Quotation marks are found
in edited texts. In addition, Coptic-specific punctuation occurs: U+2CFE coptic full
stop and U+2CFF coptic morphological divider. Several other historic forms of punc-
tuation are known only from Old Nubian texts.
Numerical Use of Letters. Numerals are indicated with letters of the alphabet, as in Greek.
Sometimes the numerical use is indicated specifically by marking a line above, represented
with U+0305 combining overline. U+0375 greek lower numeral sign or U+033F
combining double overline can be used to indicate multiples of 1,000, as shown in
Figure 7-5.
U+0374 greek numeral sign is used to indicate fractions. For example, r indicates the
fractional value 1/3. There is, however, a special symbol for 1/2: U+2CFD coptic frac-
tion one half.
Europe-I 313 7.4 Cyrillic
7.4 Cyrillic
The Cyrillic script is one of several scripts that were ultimately derived from the Greek
script. The details of the history of that development and of the relationship between early
forms of writing systems for Slavic languages has been lost. Cyrillic has traditionally been
used for writing various Slavic languages, among which Russian is predominant. The earli-
est attestations of Cyrillic are for Old Church Slavonic manuscripts, dating to the 10th cen-
tury ce. Old Church Slavonic is also commonly referred to as Old Church Slavic, and is
abbreviated as OCS.
In the nineteenth and early twentieth centuries, Cyrillic was extended to write the non-
Slavic minority languages of Russia and neighboring countries.
Structure. The Cyrillic script is written in linear sequence from left to right with the occa-
sional use of nonspacing marks. Cyrillic letters have uppercase and lowercase pairs. Spaces
are used to separate words and provide the primary line breaking opportunities.
Historic Letterforms. The historic form of the Cyrillic alphabet—most notably that seen in
Old Church Slavonic manuscripts—is treated as a font style variation of modern Cyrillic.
The historic forms of the letters are relatively close to their modern appearance, and some
of the historic letters are still in modern use in languages other than Russian. For example,
U+0406 “I”cyrillic capital letter byelorussian-ukrainian i is used in modern
Ukrainian and Byelorussian, and is encoded amidst other modern Cyrillic extensions.
Some of the historic letterforms were used in modern typefaces in Russian and Bulgarian.
Prior to 1917, Russian made use of yat, fita, and izhitsa; prior to 1945, Bulgaria made use of
these three as well as big yus.
Glagolitic. The particular early Slavic writing known as Glagolitic is treated as a distinct
script from Cyrillic, rather than as a font style variation. The letterforms for Glagolitic,
even though historically related, appear unrecognizably different from most modern Cyril-
lic letters. Glagolitic was also limited to a certain historic period; it did not grow to match
the repertoire expansion of the Cyrillic script. See Section 7.5, Glagolitic.
Cyrillic: U+0400–U+04FF
Standards. The Cyrillic block of the Unicode Standard is based on ISO/IEC 8859-5.
Extended Cyrillic. These letters are used in alphabets for Turkic languages such as Azer-
baijani, Bashkir, Kazakh, and Tatar; for Caucasian languages such as Abkhasian, Avar, and
Chechen; and for Uralic languages such as Mari, Khanty, and Kildin Sami. The orthogra-
phies of some of these languages have often been revised in the past; some of them have
switched from Arabic to Latin to Cyrillic, and back again. Azerbaijani, for instance, is now
officially using a Turkish-based Latin script.
Abkhasian. The Cyrillic orthography for Abkhasian has been updated fairly frequently
over the course of the 20th and early 21st centuries. Some of these revisions involved
changes in letterforms, often for the diacritic descenders used under extended Cyrillic let-
ters for Abkhasian. The most recent such reform has been reflected in glyph changes for
Europe-I 314 7.4 Cyrillic
Abkhaz-specific Cyrillic letters in the code charts. In particular, U+04BF cyrillic small
letter abkhasian che with descender, is now shown with a straight descender dia-
critic. In code charts for Version 5.1 and earlier, that character was displayed with a repre-
sentative glyph using an ogonek-type hook descender, more typical of historic
orthographies for Abkhasian. The glyph for U+04A9 cyrillic small letter abkhasian
ha was also updated.
Other changes for Abkhasian orthography represent actual respellings of text. Of particular
note, the character added in Version 5.2, U+0525 cyrillic small letter pe with
descender, is intended as a replacement for U+04A7 cyrillic small letter pe with
middle hook, which was used in older orthographies.
Palochka. U+04C0 “I” cyrillic letter palochka is used in Cyrillic orthographies for a
number of Caucasian languages, such as Adyghe, Avar, Chechen, and Kabardian. The
name palochka itself is based on the Russian word for “stick,” referring to the shape of the
letter. The glyph for palochka is usually indistinguishable from an uppercase Latin “I” or
U+0406 “I” cyrillic capital letter byelorussian-ukrainian i; however, in some ser-
ifed fonts it may be displayed without serifs to make it more visually distinct.
In use, palochka typically modifies the reading of a preceding letter, indicating that it is an
ejective. The palochka is generally caseless and should retain its form even in lowercased
Cyrillic text. However, there is some evidence of distinctive lowercase forms; for those
instances, U+04CF cyrillic small letter palochka may be used.
Broad Omega. The name of U+047D cyrillic small letter omega with titlo is
anomalous. It does not actually have a titlo, but instead represents a broad omega with a
great apostrof diacritic. (See U+A64D cyrillic small letter broad omega.) The great
apostrof is a stylized diacritical mark consisting of the soft breathing mark (see U+0486
combining cyrillic psili pneumata) and the Cyrillic kamora (see U+0311 combining
inverted breve). Functionally, U+047D is analogous to the Greek character U+1F66
greek small letter omega with psili and perispomeni. Both the Greek and the
Church Slavonic characters have identical functions—to record the exclamation “Oh!”
U+047D is also known as the Cyrillic beautiful omega.
Digraph Onik and Monograph Uk. U+0479 cyrillic small letter uk was intended for
representation of the Church Slavonic uk vowel, which sometimes is rendered as a digraph
onik form and sometimes as a monograph uk form. However, that ambiguity of rendering
is not optimal for the representation of Church Slavonic text. The current recommenda-
tion is to avoid the use of U+0479, as well as its corresponding uppercase U+0478. The
digraph onik has the preferred spelling consisting of the letter sequence <U+043E cyrillic
small letter o, U+0443 cyrillic small letter u>. The monograph uk should be repre-
sented instead by an unambiguous letter intended specifically for that form: U+A64B
cyrillic small letter monograph uk.
Palatalization. U+0484 combining cyrillic palatalization is a diacritical mark used in
ancient manuscripts and in academic work to indicate that a consonant is softened, a phe-
nomenon called palatalization in Cyrillic studies. Although the shape of the diacritic is sim-
ilar, this should not be confused with the use of U+0311 combining inverted breve to
Europe-I 315 7.4 Cyrillic
contrasted in Figure 7-6 for the sequence <U+2DE3 combining cyrillic letter de,
U+A675 combining cyrillic letter i>.
$ⷣ + $ꙵ → $
2DE3 A675 Ligated
$ⷣ + $ꙵ → $ꙵ ⷣ
2DE3 A675 Stacked
A wide variety of composite titlo letters can be encountered in Old Church Slavonic manu-
scripts, including such combinations as ghe-o, de-ie, de-i, de-o, de-uk, el-i, em-i, es-te, and
many others. One of these combinations has been encoded atomically in Unicode as
U+2DF5 combining cyrillic letter es-te. However, the preferred representation of a
composite titlo es-te is the sequence <U+2DED combining cyrillic letter es, U+2DEE
combining cyrillic letter te>.
The glyphs in the code chart for the Cyrillic Extended-A block are based on the modern
Cyrillic letters to which these titlo letters correspond, but in Old Church Slavonic manu-
scripts, the actual glyphs used are related to the older forms of Cyrillic letters.
arate uppercase letters are encoded for these historic variants; they pair with the existing
uppercase Cyrillic letters.
Europe-I 318 7.5 Glagolitic
7.5 Glagolitic
Glagolitic: U+2C00–U+2C5F
Glagolitic, from the Slavic root glagol, meaning “word,” is an alphabet considered to have
been devised by Saint Cyril in or around 862 ce for his translation of the Scriptures and
liturgical books into Slavonic. The relatively few Glagolitic inscriptions and manuscripts
that survive from this early period are of great philological importance. Glagolitic was
eventually supplanted by the alphabet now known as Cyrillic.
Like Cyrillic, the Glagolitic script is written in linear sequence from left to right with no
contextual modification of the letterforms. Spaces are used to separate words and provide
the primary line breaking opportunities.
In parts of Croatia where a vernacular liturgy was used, Glagolitic continued in use until
modern times: the last Glagolitic missal was printed in Rome in 1893 with a second edition
in 1905. In these areas Glagolitic is still occasionally used as a decorative alphabet.
Glyph Forms. Glagolitic exists in two styles, known as round and square. Round Glagolitic
is the original style and more geographically widespread, although surviving examples are
less numerous. Square Glagolitic (and the cursive style derived from it) was used in Croatia
from the thirteenth century. There are a few documents written in a style intermediate
between the two. The letterforms used in the charts are round Glagolitic. Several of the let-
ters have variant glyph forms, which are not encoded separately.
Ordering. The ordering of the Glagolitic alphabet is largely derived from that of the Greek
alphabet, although nearly half the Glagolitic characters have no equivalent in Greek and
not every Greek letter has its equivalent in Glagolitic.
Punctuation and Diacritics. Glagolitic texts use common punctuation, including comma,
full stop, semicolon (functioning, as in Greek, as a question mark), and middle dot. In addi-
tion, several forms of multiple-dot, archaic punctuation occur, including U+2056 three
dot punctuation, U+2058 four dot punctuation, and U+2059 five dot punctua-
tion. Quotation marks are found in edited texts. Glagolitic also used numerous diacritical
marks, many of them shared in common with Cyrillic.
Numerical Use of Letters. Glagolitic letters have inherent numerical values. A letter may be
rendered with a line above or a tilde above to indicate the numeric usage explicitly. Alterna-
tively, U+00B7 middle dot may be used, flanking a letter on both sides, to indicate
numeric usage of the letter.
7.6 Armenian
Armenian: U+0530–U+058F
The Armenian script is used primarily for writing the Armenian language. It is written
from left to right. Armenian letters have uppercase and lowercase pairs. Spaces are used to
separate words and provide the primary line breaking opportunities.
The Armenian script was devised about 406 ce by Mesrop Ma}toc‘ to give Armenians
access to Christian scriptural and liturgical texts, which were otherwise available only in
Greek and Syriac. The script has been used to write Classical or Grabar Armenian, Middle
Armenian, and both of the literary dialects of Modern Armenian: East and West Armenian.
Orthography. Mesrop’s original alphabet contained 30 consonants and 6 vowels in the fol-
lowing ranges:
U+0531..U+0554 !..D Ayb to K‘[
U+0561..U+0584 H..k ayb to k‘[
Armenian spelling was consistent during the Grabar period, from the fifth to the tenth cen-
turies ce; pronunciation began to change in the eleventh century. In the twelfth century,
the letters ] and f[ were added to the alphabet to represent the diphthong [aw] (previously
written Hi aw) and the foreign sound [f ], respectively. The Soviet Armenian government
implemented orthographic reform in 1922 and again in 1940, creating a difference
between the traditional Mesropian orthography and what is known as Reformed orthogra-
phy. The 1922 reform limited the use of w to the digraph ow (or u) and treated this digraph
as a single letter of the alphabet.
User Community. The Mesropian orthography is presently used by West Armenian speak-
ers who live in the diaspora and, rarely, by East Armenian speakers whose origins are in
Armenia but who live in the diaspora. The Reformed orthography is used by East Arme-
nian speakers living in the Republic of Armenia and, occasionally, by West Armenian
speakers who live in countries formerly under the influence of the former Soviet Union.
Spell-checkers and other linguistic tools need to take the differences between these orthog-
raphies into account, just as they do for British and American English.
Punctuation. Armenian makes use of a number of punctuation marks also used in other
European scripts. Armenian words are delimited with spaces and may terminate on either
a space or a punctuation mark. U+0589 w armenian full stop, called veryak[t in Arme-
nian, is used to end sentences. A shorter stop functioning like the semicolon (like the ano
teleia in Greek, but normally placed on the baseline like U+002E full stop) is called
miyak[t; it is represented by U+2024 . one dot leader. U+055D q armenian comma is
actually used more as a kind of colon than as a comma; it combines the functionality of
both elision and pause. Its Armenian name is bowt’. In Armenian dialect materials, U+0308
combining diaeresis, appears over uppercase U+0531 ayb and lowercase U+0561 ayb,
and lowercase U+0585 oh and U+0578 vo.
Europe-I 320 7.6 Armenian
7.7 Georgian
Georgian: U+10A0–U+10FF
Georgian Extended: U+1C90–U+1CBF
Georgian Supplement: U+2D00–U+2D2F
The Georgian script is used primarily for writing the Georgian language and its dialects. It
is also used for the Svan and Mingrelian languages and in the past was used for Abkhaz and
other languages of the Caucasus. It is written from left to right. Spaces are used to separate
words and provide the primary line breaking opportunities.
Script Forms. The script name “Georgian” in the Unicode Standard is used for what are
really two closely related scripts. The original Georgian writing system was an inscriptional
form called Asomtavruli, from which a manuscript form called Nuskhuri was derived.
Together these forms are categorized as Khutsuri (ecclesiastical), in which Asomtavruli is
used as the uppercase and Nuskhuri as the lowercase. This development of a bicameral
script parallels the evolution of the Latin alphabet, in which the original linear monumen-
tal style became the uppercase and manuscript styles of the same alphabet became the low-
ercase. The Khutsuri script is still used for liturgical purposes, but was replaced, through a
history now uncertain, by an alphabet called Mkhedruli (military), which is the form used
for nearly all modern Georgian writing. The Georgian Mkhedruli alphabet has been funda-
mentally caseless since its development.
The scholar Akaki Shanidze attempted to introduce a casing practice for Georgian in the
1950s, but this system failed to gain popularity. In his typographic departure, he used the
Asomtavruli forms to represent uppercase letters, alongside “lowercase” Mkhedruli.
Following this failed casing practice with Asomtavruli forms, Mtavruli forms developed as
a particular style of Mkhedruli in which the distinction between letters with ascenders and
descenders was not maintained. All letters written in the Mtavruli style appear with an
equal height standing on the baseline, similar to small caps in the Latin script.
Version 11.0 of the Unicode standard added a set of Mtavruli letters at U+1C90..U+1CBF.
These Mtavruli letters have a casing relationship defined with Mkhedruli letters: the
Mtavruli letters are the uppercase forms of the Mkhedruli letters, which now are consid-
ered lowercase forms.
Figure 7-7 uses Akaki Shanidze’s name to illustrate the various forms of Georgian text.
Both the modern Mkhedruli lowercase form and the Asomtavruli inscriptional form are
encoded in the Georgian block. The Nuskhuri script form is encoded in the Georgian Sup-
plement block, and the modern Mtavruli uppercase form is encoded in the Georgian
Extended block.
Case Forms. For most of modern Mkhedruli writing, Mtavruli has been used as an
emphatic or headline style. In Version 11.0 of the Unicode Standard, that usage was broad-
ened to define formal case pair mappings between these forms, with Mkhedruli serving as
lowercase and Mtavruli serving as uppercase. Georgian casing established in Version 11.0
does not extend to title casing, as the Georgian script does not have title casing for individ-
ual words or sentences. Mtavruli continues to be used as a an emphatic and headline style.
The Unicode Standard also provides case mappings between the two Khutsuri forms:
Asomtavruli and Nuskhuri.
Punctuation. Modern Georgian text uses generic European conventions for punctuation.
See the common punctuation marks in the Basic Latin and General Punctuation blocks.
Historic Punctuation. Historic Georgian manuscripts, particularly text in the older, eccle-
siastical styles, use manuscript punctuation marks common to the Byzantine tradition.
These include single, double, and multiple dot punctuation. For a single dot punctuation
mark, U+00B7 middle dot or U+2E31 word separator middle dot may be used. His-
toric double and multiple dot punctuation marks can be found in the U+2056..U+205E
range in the General Punctuation block and in the U+2E2A..U+2E2D range in the Supple-
mental Punctuation block.
U+10FB georgian paragraph separator is a historic punctuation mark commonly used
in Georgian manuscripts to delimit text elements comparable to a paragraph level.
Although this punctuation mark may demarcate a paragraph in exposition, it does not
force an actual paragraph termination in the text flow. To cause a paragraph termination,
U+10FB must be followed by a newline character, as described in Section 5.8, Newline
Guidelines.
Prior to Version 6.0 the Unicode Standard recommended the use of U+0589 armenian
full stop as the two dot version of the full stop for historic Georgian documents. This is
no longer recommended because designs for Armenian fonts may be inconsistent with the
display of Georgian text, and because other, generic two dot punctuation characters are
available in the standard, such as U+205A two dot punctuation or U+003A colon.
For additional punctuation to be used with this script, see C0 Controls and ASCII Punctu-
ation (U+0000..U+007F) and General Punctuation (U+2000..U+206F).
Europe-I 323 7.8 Modifier Letters
phies outside the context of technical phonetic transcriptional systems. This subset of
modifier letters is also known as “modifier symbols.”
This distinction between gc = Lm and gc = Sk is reflected in other Unicode specifications
relevant to identifiers and word boundary determination. Modifier letters with gc = Lm are
included in the set definitions that result in the derived properties ID_Start and ID_Con-
tinue (and XID_Start and XID_Continue). As such, they are considered part of the default
definition of Unicode identifiers. Modifier symbols (gc = Sk), on the other hand, are not
included in those set definitions, and so are excluded by default from Unicode identifiers.
Modifier letters (gc = Lm) have the derived property Alphabetic, while modifier symbols
(gc = Sk) do not. Modifier letters (gc = Lm) also have the word break property value (wb =
ALetter), while modifier symbols (gc = Sk) do not. This means that for default determina-
tion of word break boundaries, modifier symbols will cause a word break, while modifier
letters proper will not.
Blocks. Most general use modifier letters (and modifier symbols) were collected together
in the Spacing Modifier Letters block (U+02B0..U+02FF), the UPA-related Phonetic
Extensions block (U+1D00..U+1D7F), the Phonetic Extensions Supplement block
(U+1D80..U+1DBF), and the Modifier Tone Letters block (U+A700..U+A71F). However,
some script-specific modifier letters are encoded in the blocks appropriate to those scripts.
They can be identified by checking for their General_Category values.
Character Names. There is no requirement that the Unicode character names for modifier
letters contain the label “modifier letter”, although most of them do.
character names list. There are also instances where an IPA modifier letter is explicitly
equated in semantic value to an IPA nonspacing diacritic form.
Superscript Letters. Some of the modifier letters are superscript forms of other letters. The
most commonly occurring of these superscript letters are encoded in this block, but many
others, particularly for use in UPA, can be found in the Phonetic Extensions block
(U+1D00..U+1D7F) and in the Phonetic Extensions Supplement block
(U+1D80..U+1DBF). The superscript forms of the i and n letters can be found in the
Superscripts and Subscripts block (U+2070..U+209F). The fact that the latter two letters
contain the word “superscript” in their names instead of “modifier letter” is an historical
artifact of original sources for the characters, and is not intended to convey a functional
distinction in the use of these characters in the Unicode Standard.
Superscript modifier letters are intended for cases where the letters carry a specific mean-
ing, as in phonetic transcription systems, and are not a substitute for generic styling mech-
anisms for superscripting of text, as for footnotes, mathematical and chemical expressions,
and the like.
The superscript modifier letters are spacing letters, and should be distinguished from
superscripted combining Latin letters. The superscripted combining Latin letters, as for
example those encoded in the Combining Diacritical Marks block in the range
U+0363..U+036F, are associated with the Latin historic manuscript tradition, often repre-
senting various abbreviatory conventions in text.
Spacing Clones of Diacritics. Some corporate standards explicitly specify spacing and
nonspacing forms of combining diacritical marks, and the Unicode Standard provides
matching codes for these interpretations when practical. A number of the spacing forms
are included in the Basic Latin and Latin-1 Supplement blocks. The six common European
diacritics that do not have spacing forms encoded in those blocks are encoded as spacing
characters in the Spacing Modifier Letters block instead. These forms can have multiple
semantics, such as U+02D9 dot above, which is used as an indicator of the Mandarin
Chinese fifth (neutral) tone.
Rhotic Hook. U+02DE modifier letter rhotic hook is defined in IPA as a free-standing
modifier letter. In common usage, it is treated as a ligated hook on a baseform letter. Hence
U+0259 latin small letter schwa + U+02DE modifier letter rhotic hook may be
treated as equivalent to U+025A latin small letter schwa with hook.
Tone Letters. U+02E5..U+02E9 comprises a set of basic tone letters defined in IPA and
commonly used in detailed tone transcriptions of African and other languages. Each tone
letter refers to one of five distinguishable tone levels. To represent contour tones, the tone
letters are used in combinations. The rendering of contour tones follows a regular set of
ligation rules that results in a graphic image of the contour (see Figure 7-8).
For example, the sequence “1 + 5” in the first row of Figure 7-8 indicates the sequence of
the lowest tone letter, U+02E9 modifier letter extra-low tone bar, followed by the
highest tone letter, U+02E5 modifier letter extra-high tone bar. In that sequence, the
tone letter is drawn with a ligation from the iconic position of the low tone to that of the
Europe-I 326 7.8 Modifier Letters
high tone to indicate the sharp rising contour. A sequence of three tone letters may also be
ligated, as shown in the last row of Figure 7-8, to indicate a low rising-falling contour tone.
oxia), it can appear nearly upright. U+030C combining caron is commonly rendered as
an apostrophe when used with certain letterforms. U+0326 combining comma below is
sometimes rendered as a turned comma above on a lowercase “g” to avoid conflict with the
descender. In many fonts, there is no clear distinction made between U+0326 combining
comma below and U+0327 combining cedilla.
Combining accents above the base glyph are usually adjusted in height for use with upper-
case versus lowercase forms. In the absence of specific font protocols, combining marks are
often designed as if they were applied to typical base characters in the same font. However,
this will result in suboptimal appearance in rendering and may cause security problems.
See Unicode Technical Report #36, “Unicode Security Considerations.”
For more information, see Section 5.13, Rendering Nonspacing Marks.
Overlaid Diacritics. A few combining marks are encoded to represent overlaid diacritics
such as U+0335 combining short stroke overlay (= “bar”) or hooks modifying the
shape of base characters, such as U+0322 combining retroflex hook below. Such over-
laid diacritics are not used in decompositions of characters in the Unicode Standard. Over-
laid combining marks for the indication of negation of mathematical symbols are an
exception to this rule and are discussed later in this section.
One should use the combining marks for overlaid diacritics sparingly and with care, as ren-
dering them on letters may create opportunities for spoofing and other confusion.
Sequences of a letter followed by an overlaid diacritic or hook character are not canonically
equivalent to any preformed encoded character with diacritic even though they may
appear the same. See “Non-decomposition of Certain Diacritics” in Section 2.12, Equivalent
Sequences for more discussion of the implications of overlaid diacritics for normalization
and for text matching operations.
Marks as Spacing Characters. By convention, combining marks may be exhibited in
(apparent) isolation by applying them to U+00A0 no-break space. This approach might
be taken, for example, when referring to the diacritical mark itself as a mark, rather than
using it in its normal way in text. Prior to Version 4.1 of the Unicode Standard, the stan-
dard also recommended the use of U+0020 space for display of isolated combining marks.
This is no longer recommended, however, because of potential conflicts with the handling
of sequences of U+0020 space characters in such contexts as XML.
In charts and illustrations in this standard, the combining nature of these marks is illustrated
by applying them to a dotted circle, as shown in the examples throughout this standard.
In a bidirectional context, using any character with neutral directionality (that is, with a
Bidirectional Class of ON, CS, and so on) as a base character, including U+00A0 no-break
space, a dotted circle, or any other symbol, can lead to unintended separation of the base
character from certain types of combining marks during bidirectional ordering. The result
is that the combining mark will be graphically applied to something other than the correct
base. This affects spacing combining marks (that is, with a General Category of Mc) but
not nonspacing combining marks. The unintended separation can be prevented by brack-
eting the combining character sequence with RLM or LRM characters as appropriate. For
Europe-I 329 7.9 Combining Marks
more details on bidirectional reordering, see Unicode Standard Annex #9, “Unicode Bidi-
rectional Algorithm.”
Spacing Clones of Diacritical Marks. The Unicode Standard separately encodes clones of
many common European diacritical marks, primarily for compatibility with existing char-
acter set standards. These cloned accents and diacritics are spacing characters and can be
used to display the mark in isolation, without application to a no-break space. They are
cross-referenced to the corresponding combining mark in the names list in the Unicode
code charts. For example, U+02D8 breve is cross-referenced to U+0306 combining
breve. Most of these spacing clones also have compatibility decomposition mappings
involving U+0020 space, but implementers should be cautious in making use of those
decomposition mappings because of the complications that can arise from replacing a
spacing character with a space + combining mark sequence.
Relationship to ISO/IEC 8859-1. ISO/IEC 8859-1 contains eight characters that are
ambiguous regarding whether they denote combining characters or separate spacing char-
acters. In the Unicode Standard, the corresponding code points (U+005E ^ circumflex
accent, U+005F _ low line, U+0060 ‡ grave accent, U+007E ~ tilde, U+00A8 ¨
diaeresis, U+00AF ¯ macron, U+00B4 ´ acute accent, and U+00B8 ¸ cedilla) are
used only as spacing characters. The Unicode Standard provides unambiguous combining
characters in the Combining Diacritical Marks block, which can be used to represent
accented Latin letters by means of composed character sequences.
U+00B0 ° degree sign is also occasionally used ambiguously by implementations of ISO/
IEC 8859-1 to denote a spacing form of a diacritic ring above a letter; in the Unicode Stan-
dard, that spacing diacritical mark is denoted unambiguously by U+02DA ° ring above.
U+007E “~” tilde is ambiguous between usage as a spacing form of a diacritic and as an
operator or other punctuation; it is generally rendered with a center line glyph, rather than
as a diacritic raised tilde. The spacing form of the diacritic tilde is denoted unambiguously
by U+02DC “‹” small tilde.
Diacritics Positioned Over Two Base Characters. IPA, pronunciation systems, some trans-
literation systems, and a few languages such as Tagalog use diacritics that are applied to a
sequence of two letters. This display of diacritics over two letters, also known as the use of
double diacritics, is most often noted for the Latin script, which is widely used for transcrip-
tion and transliteration. However, the use of double diacritics is not limited to the Latin
script.
In rendering, these marks of unusual size appear as wide diacritics spanning across the top
(or bottom) of the two base characters. The Unicode Standard contains a set of double-dia-
critic combining marks to represent such forms. Like all other combining nonspacing
marks, these marks apply to the previous base character, but they are intended to hang over
the following letter as well. For example, the character U+0360 combining double tilde
is intended to be displayed as depicted in Figure 7-9.
The Unicode Standard also contains a set of combining half diacritical marks, which can
be used as an alternative, but not generally recommended, way of representing diacritics
Europe-I 330 7.9 Combining Marks
n+~
@ → ~
n
006E 0360
n+~
@ +g → ~
ng
006E 0360 0067
over a sequence of two (or more) letters. See “Combining Half Marks” later in this section
and Figure 7-15.
The double-diacritical marks have a very high combining class—higher than all other non-
spacing marks except U+0345 iota subscript—and so always are at or near the end of a
combining character sequence when canonically reordered. In rendering, the double dia-
critic will float above other diacritics above (or below other diacritics below)—excluding
surrounding diacritics—as shown in Figure 7-10.
a +$
ˆ + ~
$+ c $¨
+ →
~
âc¨
0061 0302 0360 0063 0308
a ~ $ˆ +
+$ + c + $¨ ~¨
→ âc
0061 0360 0302 0063 0308
In Figure 7-10, the first line shows a combining character sequence in canonical order, with
the double-diacritic tilde following a circumflex accent. The second line shows an alterna-
tive order of the two combining marks that is canonically equivalent to the first line.
Because of this canonical equivalence, the two sequences should display identically, with
the double diacritic floating above the other diacritics applied to single base characters.
Occasionally one runs across orthographic conventions that use a dot, an acute accent, or
other simple diacritic above a ligature tie—that is, U+0361 combining double inverted
breve. Because of the considerations of canonical order just discussed, one cannot repre-
sent such text simply by putting a combining dot above or combining acute directly after
U+0361 in the text. Instead, the recommended way of representing such text is to place
U+034F combining grapheme joiner (CGJ) between the ligature tie and the combining
mark that follows it, as shown in Figure 7-11.
´
(
+$ + + $´
(
u + i → ui
0075 0361 034F 0301 0069
Europe-I 331 7.9 Combining Marks
Because CGJ has a combining class of zero, it blocks reordering of the double diacritic to
follow the second combining mark in canonical order. The sequence of <CGJ, acute> is
then rendered with default stacking, placing it centered above the ligature tie. This conven-
tion can be used to create similar effects with combining marks above other double diacrit-
ics (or below double diacritics that render below base characters).
For more information on the combining grapheme joiner, see “Combining Grapheme
Joiner” in Section 23.2, Layout Controls.
Diacritics Positioned Over Three or More Base Characters. Some transcriptional systems
extend the convention of double-diacritic display and show diacritics above (or below)
three or more base letters. There are no characters encoded in the Unicode Standard which
are specifically designated for plain text representation of triple diacritics. Instead, the rec-
ommendation of the Unicode Standard is to use text markup for such representation. The
application of modifying text marks to arbitrary spans of text exceeds the normal scope of
plain text and is usually better dealt with by conventions designed for rich text. In some
limited circumstances, the combining half mark diacritics can be used in combinations to
represent triple diacritics, but the display of half mark diacritics used in this way often is
unsatisfactory in plain text rendering.
Subtending Marks. An additional class of marks called subtending marks is positioned
under (or occasionally over or surrounding) a sequence of several other characters. For-
mally, these marks are not treated as combining marks (gc = M), but instead as format
characters (gc = Cf ). In the text representation, they precede the sequence of characters
they subtend, rather than follow a single base character, as combining marks do.
Although the terms subtending marks and prefixed format control characters have been
used for these special marks for a number of versions of the Unicode Standard, as of Ver-
sion 9.0 another more precise but equivalent term has been introduced for them: pre-
pended concatenation marks. That terms focuses on the order of occurrence of the marks
(prepended to the sequence following them in the backing store), rather than the graphical
positioning of the visible mark in the final displayed rendering of the sequences. A binary
character property has also been introduced to refer to this class of marks as a whole: Pre-
pended_Concatenation_Mark. Proper display of these marks requires specialized render-
ing support, as the shapes of the marks may adjust depending on the length of the
following sequence of characters.
The use of subtending marks is most notably associated with the Arabic script. They typi-
cally occur before a sequence of digits and are then displayed with different styles of
extended swashes underneath the digits. In Arabic, these marks often indicate whether the
sequence of digits is to be interpreted as a number or a date, for example. Similar subtend-
ing marks are encoded for other scripts, including Syriac and Kaithi. (See Section 9.2, Ara-
bic, Section 9.3, Syriac, and Section 15.2, Kaithi for a number of examples and further
discussion.)
Combining Marks with Ligatures. According to Section 3.6, Combination, for a simple
combining character sequence such as <i , u> , the nonspacing mark u both applies to and
Europe-I 332 7.9 Combining Marks
depends on the base character i. If the i is preceded by a character that can ligate with it,
additional considerations apply.
Figure 7-12 shows typical examples of the interaction of combining marks with ligatures.
The sequence <f , i, u> is canonically equivalent to <f, î>. This implies that both sequences
should be rendered identically, if possible. The precise way in which the sequence is ren-
dered depends on whether the f and i of the first sequence ligate. If so, the result of applying
u should be the same as ligating an f with an î. The appearance depends on whatever typo-
graphical rules are established for this case, as illustrated in the first example of Figure 7-12.
Note that the two characters f and î may not ligate, even if the sequence <f , i> does.
f + i + $̂ f + î → fî , fiˆ , fî
f + $̃ + i + $̂ → f˜î , fi˜ˆ
f + $̂ + i + $̃ → fˆ˜ı , ˆfi˜
f + $̃ + i + $̂ f + $̂ + i + $̃
The second and third examples show that by default the sequence <f , u , i , u> is visually
distinguished from the sequence <f, u, i, u> by the relative placement of the accents. This is
true whether or not the <f, u> and the <i, u> ligate. Example 4 shows that the two
sequences are not canonically equivalent.
In some writing systems, established typographical rules further define the placement of
combining marks with respect to ligatures. As long as the rendering correctly reflects the
identity of the character sequence containing the marks, the Unicode Standard does not
prescribe such fine typographical details.
Compatibility characters such as the fi-ligature are not canonically equivalent to the
sequence of characters in their compatibility decompositions. Therefore, sequences like
<fi-ligature, %> may legitimately differ in visual representation from <f, i, %>, just as the
visual appearance of other compatibility characters may be different from that of the
sequence of characters in their compatibility decompositions. By default, a compatibility
character such as fi-ligature is treated as a single base glyph.
Standards. The combining diacritical marks are derived from a variety of sources, includ-
ing IPA, ISO 5426, and ISO 6937.
Underlining and Overlining. The characters U+0332 combining low line, U+0333 com-
bining double low line, U+0305 combining overline, and U+033F combining dou-
ble overline are intended to connect on the left and right. Thus, when used in
combination, they could have the effect of continuous lines above or below a sequence of
characters. However, because of their interaction with other combining marks and other
layout considerations such as intercharacter spacing, their use for underlining or overlin-
ing of text is discouraged in favor of using styled text.
e + $. + $ → e. ( ) ( )
e. + $ → e. ( ) ( )
1EB9 1ABD
In contrast with the three combining parentheses diacritical marks above or below, which
combine with other diacritics, U+1ABE combining parentheses overlay is a regular
enclosing mark, intended to surround a base character. The exact placement of the overlay
U+1ABE with respect to a base character is not specified by the Unicode Standard, but may
be adjusted for a particular base character as needed in fonts. For example, in the context
of phonetic transcription for German dialectology, the combining character sequence
<U+014B latin small letter eng, U+1ABE combining parentheses overlay> could
be rendered with the parentheses placed lower to surround the descender of the letter eng.
Europe-I 334 7.9 Combining Marks
+ @ →
2261 20D2
Enclosing Marks. These nonspacing characters are supplied for compatibility with existing
standards, allowing individual base characters to be enclosed in several ways. For example,
U+2460 ‡ circled digit one can be expressed as U+0031 digit one “1” + U+20DD %
combining enclosing circle. For additional examples, see Figure 2-17.
The combining enclosing marks surround their grapheme base and any intervening non-
spacing marks. These marks are intended for application to free-standing symbols. See
“Application of Combining Marks” in Section 3.6, Combination.
Users should be cautious when applying combining enclosing marks to other than free-
standing symbols—for example, when using a combining enclosing circle to apply to a let-
ter or a digit. Most implementations assume that application of any nonspacing mark will
not change the character properties of a base character. This means that even though the
intent might be to create a circled symbol (General_Category = So), most software will
continue to treat the base character as an alphabetic letter or a numeric digit. Note that
there is no canonical equivalence between a symbolic character such as U+24B6 circled
latin capital letter a and the sequence <U+0041 latin capital letter a, U+20DD
combining enclosing circle>, partly because of this difference in treatment of proper-
ties.
Chapter 8
Europe-II 8
Ancient and Other Scripts
This chapter describes ancient scripts of Europe, as well as other historic and limited-use
scripts of Europe not covered in Chapter 7, Europe-I. This includes the various ancient
Mediterranean scripts, other early alphabets and sets of runes, some poorly attested his-
toric scripts of paleographic interest, and more recently devised constructed scripts with
significant usage. In particular, these include:
Unicode encodes a number of ancient scripts, which have not been in normal use for a mil-
lennium or more, as well as historic scripts, whose usage ended in recent centuries.
Although they are no longer used to write living languages, documents and inscriptions
using these scripts exist, both for extinct languages and for precursors of modern lan-
guages. The primary user communities for these scripts are scholars interested in studying
the scripts and the languages written in them. Some of the historic scripts are related to
each other as well as to modern alphabets.
The Linear A script is an ancient writing system used from approximately 1700–1450 bce
on and around the island of Crete. The script contains more than ninety signs in regular
use and a host of logograms. Surviving examples are inscribed on clay tablets, stone tables,
and metals. The language of the inscriptions has not yet been deciphered.
Both Linear B and Cypriot are syllabaries that were used to write Greek. Linear B is the
older of the two scripts, and there are some similarities between a few of the characters that
may not be accidental. Cypriot may descend from Cypro-Minoan, which in turn may
descend from Linear B.
The ancient Anatolian alphabets Lycian, Carian, and Lydian all date from the first millen-
nium bce, and were used to write various ancient Indo-European languages of western and
southwestern Anatolia. All are closely related to the Greek script.
Old Italic was derived from Greek and was used to write Etruscan and other languages in
Italy. It was borrowed by the Romans and is the immediate ancestor of the Latin script now
used worldwide. One of the Old Italic alphabets of northern Italy may have influenced the
Europe-II 338
development of the Runic script, which has a distinct angular appearance owing to its use
in carving inscriptions in stone and wood.
Old Hungarian is another historical runiform script, used to write the Hungarian language
in Central Europe. In recent decades it has undergone a significant revival in Hungary. It
has developed casing, and is now used with modern typography to print significant
amounts of material in the modern Hungarian language. It is laid out right-to-left.
The Ogham script is indigenous to Ireland. While its originators may have been aware of
the Latin or Greek scripts, it seems clear that the sound values of Ogham letters were suited
to the phonology of a form of Primitive Irish.
The Gothic script, like Cyrillic, was developed on the basis of Greek at a much later date
than Old Italic.
Elbasan, Caucasian Albanian, and Old Permic are all simple alphabetic scripts. Elbasan is
an historic alphabetic script invented in the middle of the eighteenth-century to write Alba-
nian. It is named after the city where it originated. Caucasian Albanian dates from the early
fith century and is related to the modern Udi language. Old Permic was devised in the four-
teenth century to write the Uralic languages Komi and Komi-Permyak. Its use for Komi
extended into the seventeenth century.
Shavian is a phonemic alphabet invented in the 1950s to write English. It was used to pub-
lish one book in 1962, but remains of some current interest.
Europe-II 339 8.1 Linear A
8.1 Linear A
Linear A: U+10600–U+1077F
The Linear A script was used from approximately 1700–1450 bce. It was mainly used on
the island of Crete and surrounding areas to write a language which has not yet been iden-
tified. Unlike the later Linear B, which was used to write an early form of Greek, Linear A
appears on a variety of media, such as clay tablets, stone offering tables, gold and silver hair
pins, and pots.
Encoding. The repertoire of characters in the Unicode encoding of the Linear A script is
broadly based on the GORILA catalog by Godart and Olivier (1976–1985), which is the
basic set of signs used in decipherment efforts. All simple signs in that catalog are encoded
as single characters. Composite signs consisting of vertically stacked parts or touching
pieces are also encoded as single characters. Composite signs in the catalog which consist
of side-by-side pieces that are not touching are treated as digraphs; the parts are individu-
ally encoded as characters, but the composite sign is not separately encoded.
Structure. Linear A contains more than ninety syllabic signs in regular use and a host of
logograms. Some Linear A signs are also found in Linear B, although about 80% of the
logograms in Linear A do not appear in Linear B.
Character Names. The Linear A character names are based on the GORILA catalog num-
bers.
Directionality. Linear A was written from left to right, though occasionally it appears right
to left and, rarely, boustrophedon.
Numbers. Numbers in Linear A inscriptions are represented by characters in the Aegean
Numbers block. Numbers are usually arranged in sets of five or fewer that are stacked ver-
tically. The largest number recorded is 3,000. Linear A seems to use a series of unit frac-
tions. Seven fractions are regularly used and are included in the Linear A block.
Europe-II 340 8.2 Linear B
8.2 Linear B
Linear B Syllabary: U+10000–U+1007F
The Linear B script is a syllabic writing system that was used on the island of Crete and
parts of the nearby mainland to write the oldest recorded variety of the Greek language.
Linear B clay tablets predate Homeric Greek by some 700 years; the latest tablets date from
the mid- to late thirteenth century bce. Major archaeological sites include Knossos, first
uncovered about 1900 by Sir Arthur Evans, and a major site near Pylos. The majority of
currently known inscriptions are inventories of commodities and accounting records.
Early attempts to decipher the script failed until Michael Ventris, an architect and amateur
decipherer, came to the realization that the language might be Greek and not, as previously
thought, a completely unknown language. Ventris worked together with John Chadwick,
and decipherment proceeded quickly. The two published a joint paper in 1953.
Linear B was written from left to right with no nonspacing marks. The script mainly con-
sists of phonetic signs representing the combination of a consonant and a vowel. There are
about 60 known phonetic signs, in addition to a few signs that seem to be mainly free vari-
ants (also known as Chadwick’s optional signs), a few unidentified signs, numerals, and a
number of ideographic signs, which were used mainly as counters for commodities. Some
ligatures formed from combinations of syllables were apparently used as well. Chadwick
gives several examples of these ligatures, the most common of which are included in the
Unicode Standard. Other ligatures are the responsibility of the rendering system.
Standards. The catalog numbers used in the Unicode character names for Linear B sylla-
bles are based on the Wingspread Convention, as documented in Bennett (1964). The let-
ter “B” is prepended arbitrarily, so that name parts will not start with a digit, thus
conforming to ISO/IEC 10646 naming rules. The same naming conventions, using catalog
numbers based on the Wingspread Convention, are used for Linear B ideograms.
subunits; the system of weights retains its own unique subunits. Though several of the
signs originate in Linear A, the measuring system of Linear B differs from that of Linear A.
Linear B relies on units and subunits, much like the imperial “quart,” “pint,” and “cup,”
whereas Linear A uses whole numbers and fractions. The absolute values of the measure-
ments have not yet been completely agreed upon.
Europe-II 342 8.3 Cypriot Syllabary
Lydian is a simple alphabetic script of 26 letters. The vast majority of Lydian texts have
right-to-left directionality (the default direction); a very few texts are left-to-right and one
is boustrophedon. Most Lydian texts use U+0020 space as a word divider. Rare examples
have been found which use scriptio continua or which use dots to separate the words. In the
latter case, U+003A colon and U+00B7 middle dot (or U+2E31 word separator mid-
dle dot) can be used to represent the dots. U+1093F lydian triangular mark is
thought to indicate quotations, and is mirrored according to text directionality.
Europe-II 345 8.5 Old Italic
with a following -e sound, and liquids and sibilants (which can be pronounced more or less
on their own) were pronounced with a leading e- sound (so [k], [d] became [ke:], [de:],
while [l], [m] became [el], [em]). It is these names, according to Sampson, which were bor-
rowed by the Romans when they took their script from the Etruscans.
Directionality. Most Etruscan texts from the seventh to six centuries bce were written
from right-to-left, but left-to-right was not uncommon, and is found in approximately ten
percent of the texts from this period. From the fifth to the first centuries bce, right-to-left
was the standard, and left-to-right directionality was extremely rare. The other local variet-
ies of Old Italic also generally have right-to-left directionality. Boustrophedon appears
rarely, and not especially early (for instance, the Forum inscription dates to 550–500 bce).
Despite this, for reasons of implementation simplicity, many scholars prefer left-to-right
presentation of texts, as this is also their practice when transcribing the texts into Latin
script. Accordingly, the Old Italic script has a default directionality of strong left-to-right in
this standard. If the default directionality of the script is overridden to produce a right-to-
left presentation, the glyphs in Old Italic fonts should also be mirrored from the represen-
tative glyphs shown in the code charts. This kind of behavior is not uncommon in archaic
scripts; for example, archaic Greek letters may be mirrored when written from right to left
in boustrophedon.
Punctuation. The earliest inscriptions are written with no space between words in what is
called scriptio continua. There are numerous Etruscan inscriptions with dots separating
word forms, attested as early as the second quarter of the seventh century bce. This punc-
tuation is sometimes, but only rarely, used to mark certain types of syllables and not to sep-
arate words. From the sixth century bce, words were often separated by one, two, or three
dots spaced vertically above each other.
Numerals. Etruscan numerals are not well attested in the available materials, but are
employed in the same fashion as Roman numerals. Several additional numerals are
attested, but as their use is at present uncertain, they are not yet encoded in the Unicode
Standard.
Glyphs. The default glyphs in the code charts are based on the most common shapes found
for each letter. Most of these are similar to the Marsiliana abecedary (mid-seventh century
bce). Note that the phonetic values for U+10317 old italic letter eks [ks] and U+10319
old italic letter khe [kh] show the influence of western, Euboean Greek; eastern Greek
has U+03A7 greek capital letter chi [kh] and U+03A8 greek capital letter psi [ps]
instead.
The geographic distribution of the Old Italic script is shown in Figure 8-1. In the figure, the
approximate distribution of the ancient languages that used Old Italic alphabets is shown
in white. Areas for the ancient languages that used other scripts are shown in gray, and the
labels for those languages are shown in italics. In particular, note that the ancient Greek
colonies of the southern Italian and Sicilian coasts used the Greek script proper. Rome, of
course, is shown in gray, because Latin was written with the Latin alphabet, now encoded
in the Latin script.
Europe-II 347 8.5 Old Italic
Raetic
Cisalpine Celtic
Venetic
Etruscan
Central
N. Picene Sabellian
Umbrian languages
S. Picene
Ligurian Oscan
Etruscan Messapic
Faliscan
Latin (Rome)
Volscian
Greek
Elymian
Sicanian Siculan
Europe-II 348 8.6 Runic
8.6 Runic
Runic: U+16A0–U+16F0
The Runic script was historically used to write the languages of the early and medieval soci-
eties in the German, Scandinavian, and Anglo-Saxon areas. Use of the Runic script in vari-
ous forms covers a period from the first century to the nineteenth century. Some 6,000
Runic inscriptions are known. They form an indispensable source of information about the
development of the Germanic languages.
The Runic script is an historical script, whose most important use today is in scholarly and
popular works about the old Runic inscriptions and their interpretation. The Runic script
illustrates many technical problems that are typical for this kind of script. Unlike many
other scripts in the Unicode Standard, which predominantly serve the needs of the modern
user community—with occasional extensions for historic forms—the encoding of the
Runic script attempts to suit the needs of texts from different periods of time and from dis-
tinct societies that had little contact with one another.
The Runic Alphabet. Present-day knowledge about runes is incomplete. The set of graphe-
mically distinct units shows greater variation in its graphical shapes than most modern
scripts. The Runic alphabet changed several times during its history, both in the number
and the shapes of the letters contained in it. The shapes of most runes can be related to some
Latin capital letter, but not necessarily to a letter representing the same sound. The most
conspicuous difference between the Latin and the Runic alphabets is the order of the letters.
The Runic alphabet is known as the futhark from the name of its first six letters. The origi-
nal old futhark contained 24 runes:
†¢ ¶ ® ± ≤ ∑ π ∫æ ¡ √ « » … œ “ ÷ ◊ ⁄‹ fi fl
They are usually transliterated in this way:
f u ˛a r k g w h n i j Ôp z s t b e ml} d o
In England and Friesland, seven more runes were added from the fifth to the ninth cen-
tury.
In the Scandinavian countries, the futhark changed in a different way; in the eighth cen-
tury, the simplified younger futhark appeared. It consists of only 16 runes, some of which
are used in two different forms. The long-branch form is shown here:
† ¢ ¶ ¨ ± ¥ · æ ¡ ≈ À œ “ ÿ ⁄Ê
f u ˛ o r k h n i a s t b ml Ä
The use of runes continued in Scandinavia during the Middle Ages. During that time, the
futhark was influenced by the Latin alphabet and new runes were invented so that there
was full correspondence with the Latin letters.
Europe-II 349 8.6 Runic
Direction. Like other early writing systems, runes could be written either from left to right
or from right to left, or moving first in one direction and then the other (boustrophedon),
or following the outlines of the inscribed object. At times, characters appear in mirror
image, or upside down, or both. In modern scholarly literature, Runic is written from left
to right. Therefore, the letters of the Runic script have a default directionality of strong left-
to-right in this standard.
Representative Glyphs. The known inscriptions can include considerable variations of
shape for a given rune, sometimes to the point where the nonspecialist will mistake the
shape for a different rune. There is no dominant main form for some runes, particularly for
many runes added in the Anglo-Friesian and medieval Nordic systems. When transcribing
a Runic inscription into its Unicode-encoded form, one cannot rely on the idealized repre-
sentative glyph shape in the character charts alone. One must take into account to which of
the four Runic systems an inscription belongs and be knowledgeable about the permitted
form variations within each system. The representative glyphs were chosen to provide an
image that distinguishes each rune visually from all other runes in the same system. For
actual use, it might be advisable to use a separate font for each Runic system. Of particular
note is the fact that the glyph for U+16C4 ƒ runic letter ger is actually a rare form, as
the more common form is already used for U+16E1 · runic letter ior.
Unifications. When a rune in an earlier writing system evolved into several different runes
in a later system, the unification of the earlier rune with one of the later runes was based on
similarity in graphic form rather than similarity in sound value. In cases where a substan-
tial change in the typical graphical form has occurred, though the historical continuity is
undisputed, unification has not been attempted. When runes from different writing sys-
tems have the same graphic form but different origins and denote different sounds, they
have been coded as separate characters.
Long-Branch and Short-Twig. Two sharply different graphic forms, the long-branch and
the short-twig form, were used for 9 of the 16 Viking Age Nordic runes. Although only one
form is used in a given inscription, there are runologically important exceptions. In some
cases, the two forms were used to convey different meanings in later use in the medieval
system. Therefore the two forms have been separated in the Unicode Standard.
Staveless Runes. Staveless runes are a third form of the Viking Age Nordic runes, a kind of
Runic shorthand. The number of known inscriptions is small and the graphic forms of
many of the runes show great variability between inscriptions. For this reason, staveless
runes have been unified with the corresponding Viking Age Nordic runes. The corre-
sponding Viking Age Nordic runes must be used to encode these characters—specifically
the short-twig characters, where both short-twig and long-branch characters exist.
Punctuation Marks. The wide variety of Runic punctuation marks has been reduced to
three distinct characters based on simple aspects of their graphical form, as very little is
known about any difference in intended meaning between marks that look different. Any
other punctuation marks have been unified with shared punctuation marks elsewhere in
the Unicode Standard.
Europe-II 350 8.6 Runic
Golden Numbers. Runes were used as symbols for Sunday letters and golden numbers on
calendar staves used in Scandinavia during the Middle Ages. To complete the number series
1–19, three more calendar runes were added. They are included after the punctuation marks.
Encoding. The order of the Runic characters follows the traditional futhark order, with
variants and derived runes being inserted directly after the corresponding ancestor.
Runic character names are based as much as possible on the sometimes several traditional
names for each rune, often with the Latin transliteration at the end of the name.
Europe-II 351 8.7 Old Hungarian
8.8 Gothic
Gothic: U+10330–U+1034F
The Gothic script was devised in the fourth century by the Gothic bishop, Wulfila (311–
383 ce), to provide his people with a written language and a means of reading his transla-
tion of the Bible. Written Gothic materials are largely restricted to fragments of Wulfila’s
translation of the Bible; these fragments are of considerable importance in New Testament
textual studies. The chief manuscript, kept at Uppsala, is the Codex Argenteus or “the Silver
Book,” which is partly written in gold on purple parchment. Gothic is an East Germanic
language; this branch of Germanic has died out and thus the Gothic texts are of great
importance in historical and comparative linguistics. Wulfila appears to have used the
Greek script as a source for the Gothic, as can be seen from the basic alphabetical order.
Some of the character shapes suggest Runic or Latin influence, but this is apparently coin-
cidental.
Diacritics. The tenth letter U+10339 gothic letter eis is used with U+0308 combining
diaeresis when word-initial, when syllable-initial after a vowel, and in compounds with a
verb as second member as shown below:
\]^ _`a^bcd e\f eg ^\`eeg hi`jk^f`j
swe gameliþ ïst ïn esaïïn praufetau
“as is written in Isaiah the prophet”
To indicate contractions or omitted letters, U+0305 combining overline is used.
Numerals. Gothic letters, like those of other early Western alphabets, can be used as num-
bers; two of the characters have only a numeric value and are not used alphabetically. To
indicate numeric use of a letter, it is either flanked on both sides by U+00B7 middle dot
or followed by both U+0304 combining macron and U+0331 combining macron
below, as shown in the following example:
l or m means “5”
Punctuation. Gothic manuscripts are written with no space between words in what is
called scriptio continua. Sentences and major phrases are often separated by U+0020 space,
U+00B7 middle dot, or U+003A colon.
Europe-II 353 8.9 Elbasan
8.9 Elbasan
Elbasan: U+10500–U+1052F
The earliest alphabet devised for the Albanian language was created around 1750 for the
Elbasan Gospel manuscript, which is the only known example of the script. This manu-
script, preserved at the State Archives in Tirana, records the earliest-known Albanian-lan-
guage text in an original alphabet. Most of the letters in the Elbasan alphabet seem to be
new creations, although some of their shapes may have been influenced by Greek and
Cyrillic.
Structure. Elbasan is a simple alphabetic script written from left to right horizontally. The
alphabet consists of forty letters.
Three characters have an inherent diacritical dot: U+10505 A elbasan letter nde is used
to indicate a pre-nasalized U+10504 d /d/; U+10511 B elbasan letter lle is used to indi-
cate a geminate U+10510 l /l/; U+1051A C elbasan letter rre is used to indicate a gem-
inate U+10519 r /r/. In many cases the dot on nde is written like a small ne. In one instance
in the manuscript gje is written with a dot above to indicate prenasalized /f /.
Two different letters are used for /n/: U+10513 D elbasan letter ne is used generally, and
U+10514 E elbasan letter na is typically used in prenasalized position as in Eg /nh/ and
Ej /nf / . Two letters, which are rare and appear in Greek loanwords, are used for /y/,
U+10525 F elbasan letter ghe and U+10526 G elbasan letter ghamma.
Accents and Other Marks. The Elbasan manuscript contains breathing accents, similar to
those used in Greek. Those accents do not appear regularly in the orthography and have
not been fully analyzed yet. Raised vertical marks also appear in the manuscript, but are not
specific to the script. Generic combining characters from the Combining Diacritical Marks
block can be used to render these accents and other marks.
Names. The names used for the characters in the Elbasan block are based on those of the
modern Albanian alphabet.
Numerals and Punctuation. There are no script-specific numerals or punctuation marks.
A separating dot and spaces appear in the Elbasan manuscript, and may be rendered with
U+00B7 middle dot and U+0020 space, respectively. For numerals, a Greek-like system
of letter and combining overline is in use. Overlines also appear above certain letters in
abbreviations, such as Z to indicate Zot (Lord). The overlines in numerals and abbrevia-
tions can be represented with U+0305 combining overline.
Europe-II 354 8.10 Caucasian Albanian
Numerals. Script-specific numerals are not known. Letters of the alphabet can be marked
with U+0483 combining cyrillic titlo to indicate numeric use.
Punctuation. Old Permic does not have any script-specific punctuation, but uses middle
dot, colon, and apostrophe. Spaces are used to separate words in manuscripts.
Europe-II 356 8.12 Ogham
8.12 Ogham
Ogham: U+1680–U+169F
Ogham is an alphabetic script devised to write a very early form of Irish. Monumental
Ogham inscriptions are found in Ireland, Wales, Scotland, England, and on the Isle of
Man. Many of the Scottish inscriptions are undeciphered and may be in Pictish. It is prob-
able that Ogham (Old Irish “Ogam”) was widely written in wood in early times. The main
flowering of “classical” Ogham, rendered in monumental stone, was in the fifth and sixth
centuries ce. Such inscriptions were mainly employed as territorial markers and memori-
als; the more ancient examples are standing stones.
The script was originally written along the edges of stone where two faces meet; when writ-
ten on paper, the central “stemlines” of the script can be said to represent the edge of the
stone. Inscriptions written on stemlines cut into the face of the stone, instead of along its
edge, are known as “scholastic” and are of a later date (post-seventh century). Notes were
also commonly written in Ogham in manuscripts as recently as the sixteenth century.
Structure. The Ogham alphabet consists of 26 distinct characters (feda), the first 20 of
which are considered to be primary and the last 6 (forfeda) supplementary. The four pri-
mary series are called aicmí (plural of aicme, meaning “family”). Each aicme was named
after its first character, (Aicme Beithe, Aicme Uatha, meaning “the B Family,” “the H Fam-
ily,” and so forth). The character names used in this standard reflect the spelling of the
names in modern Irish Gaelic, except that the acute accent is stripped from Úr, Éabhadh,
Ór, and Ifín, and the mutation of nGéadal is not reflected.
Rendering. Ogham text is read beginning from the bottom left side of a stone, continuing
upward, across the top, and down the right side (in the case of long inscriptions). Monu-
mental Ogham was incised chiefly in a bottom-to-top direction, though there are examples
of left-to-right bilingual inscriptions in Irish and Latin. Manuscript Ogham accommo-
dated the horizontal left-to-right direction of the Latin script, and the vowels were written
as vertical strokes as opposed to the incised notches of the inscriptions. Ogham should
therefore be rendered on computers from left to right or from bottom to top (never starting
from top to bottom).
Forfeda (Supplementary Characters). In printed and in manuscript Ogham, the fonts are
conventionally designed with a central stemline, but this convention is not necessary. In
implementations without the stemline, the character U+1680 ogham space mark should
be given its conventional width and simply left blank like U+0020 space. U+169B ogham
feather mark and U+169C ogham reversed feather mark are used at the beginning
and the end of Ogham text, particularly in manuscript Ogham. In some cases, only the
Ogham feather mark is used, which can indicate the direction of the text.
The word latheirt MNOPQRSTPU shows the use of the feather marks. This word was writ-
ten in the margin of a ninth-century Latin grammar and means “massive hangover,” which
may be the scribe’s apology for any errors in his text.
Europe-II 357 8.13 Shavian
8.13 Shavian
Shavian: U+10450–U+1047F
The playwright George Bernard Shaw (1856–1950) was an outspoken critic of the idiosyn-
crasies of English orthography. In his will, he directed that Britain’s Public Trustee seek out
and publish an alphabet of no fewer than 40 letters to provide for the phonetic spelling of
English. The alphabet finally selected was designed by Kingsley Read and is variously
known as Shavian, Shaw’s alphabet, and the Proposed British Alphabet. Also in accordance
with Shaw’s will, an edition of his play, Androcles and the Lion, was published and distrib-
uted to libraries, containing the text both in the standard Latin alphabet and in Shavian.
As with other attempts at spelling reform in English, the alphabet has met with little suc-
cess. Nonetheless, it has its advocates and users. The normative version of Shavian is taken
to be the version in Androcles and the Lion.
Structure. The alphabet consists of 48 letters and 1 punctuation mark. The letters have no
case. The digits and other punctuation marks are the same as for the Latin script. The one
additional punctuation mark is a “name mark,” used to indicate proper nouns. U+00B7
middle dot should be used to represent the “name mark.” The letter names are intended
to be indicative of their sounds; thus the sound /p/ is represented by U+10450 m shavian
letter peep.
The first 40 letters are divided into four groups of 10. The first 10 and second 10 are 180-
degree rotations of one another; the letters of the third and fourth groups often show a sim-
ilar relationship of shape.
The first 10 letters are tall letters, which ascend above the x-height and generally represent
unvoiced consonants. The next 10 letters are “deep” letters, which descend below the base-
line and generally represent voiced consonants. The next 20 are the vowels and liquids.
Again, each of these letters usually has a close phonetic relationship to the letter in its
matching set of 10.
The remaining 8 letters are technically ligatures, the first 6 involving vowels plus /r/.
Because ligation is not optional, these 8 letters are included in the encoding.
Collation. The problem of collation is not addressed by the alphabet’s designers.
Europe-II 358 8.13 Shavian
359
Chapter 9
Middle East-I 9
Modern and Liturgical Scripts
The scripts in this chapter have a common origin in the ancient Phoenician alphabet. They
include:
The Hebrew script is used in Israel and for languages of the Diaspora. The Arabic script is
used to write many languages throughout the Middle East, North Africa, and certain parts
of Asia. The Syriac script is used to write a number of Middle Eastern languages. These
three also function as major liturgical scripts, used worldwide by various religious groups.
The Samaritan script is used in small communities in Israel and the Palestinian Territories
to write the Samaritan Hebrew and Samaritan Aramaic languages. The Mandaic script was
used in southern Mesopotamia in classical times for liturgical texts by adherents of the
Mandaean gnostic religion. The Classical Mandaic and Neo-Mandaic languages are still in
limited current use in modern Iran and Iraq and in the Mandaean diaspora.
The Middle Eastern scripts are mostly abjads, with small character sets. Words are demar-
cated by spaces. These scripts include a number of distinctive punctuation marks. In addi-
tion, the Arabic script includes traditional forms for digits, called “Arabic-Indic digits” in
the Unicode Standard.
Text in these scripts is written from right to left. Implementations of these scripts must
conform to the Unicode Bidirectional Algorithm (see Unicode Standard Annex #9, “Uni-
code Bidirectional Algorithm”). For more information about writing direction, see
Section 2.10, Writing Direction. There are also special security considerations that apply to
bidirectional scripts, especially with regard to their use in identifiers. For more information
about these issues, see Unicode Technical Report #36, “Unicode Security Considerations.”
Arabic, Syriac and Mandaic are cursive scripts even when typeset, unlike Hebrew and
Samaritan, where letters are unconnected. Most letters in Arabic, Syriac and Mandaic
assume different forms depending on their position in a word. Shaping rules for the ren-
dering of text are specified in Section 9.2, Arabic, Section 9.3, Syriac and Section 9.5, Man-
daic. Shaping rules are not required for Hebrew because only five letters have position-
dependent final forms, and these forms are separately encoded.
Historically, Middle Eastern scripts did not write short vowels. Nowadays, short vowels are
represented by marks positioned above or below a consonantal letter. Vowels and other
pronunciation (“vocalization”) marks are encoded as combining characters, so support for
Middle East-I 360
vocalized text necessitates use of composed character sequences. Yiddish and Syriac are
normally written with vocalization; Hebrew, Samaritan, and Arabic are usually written
unvocalized.
Middle East-I 361 9.1 Hebrew
9.1 Hebrew
Hebrew: U+0590–U+05FF
The Hebrew script is used for writing the Hebrew language as well as Yiddish, Judezmo
(Ladino), and a number of other languages. Vowels and various other marks are written as
points, which are applied to consonantal base letters; these marks are usually omitted in
Hebrew, except for liturgical texts and other special applications. Five Hebrew letters
assume a different graphic form when they occur last in a word.
Directionality. The Hebrew script is written from right to left. Conformant implementa-
tions of Hebrew script must use the Unicode Bidirectional Algorithm (see Unicode Stan-
dard Annex #9, “Unicode Bidirectional Algorithm”).
Cursive. The Unicode Standard uses the term cursive to refer to writing where the letters of
a word are connected. A handwritten form of Hebrew is known as cursive, but its rounded
letters are generally unconnected, so the Unicode definition does not apply. Fonts based on
cursive Hebrew exist. They are used not only to show examples of Hebrew handwriting,
but also for display purposes.
Standards. ISO/IEC 8859-8—Part 8. Latin/Hebrew Alphabet. The Unicode Standard
encodes the Hebrew alphabetic characters in the same relative positions as in ISO/IEC
8859-8; however, there are no points or Hebrew punctuation characters in that ISO stan-
dard.
Vowels and Other Pronunciation Marks. These combining marks, generically called
points in the context of Hebrew, indicate vowels or other modifications of consonantal let-
ters. General rules for applying combining marks are given in Section 2.11, Combining
Characters, and Section 3.6, Combination. Additional Hebrew-specific behavior is
described below.
Hebrew points can be separated into four classes: dagesh, shin dot and sin dot, vowels, and
other pronunciation marks.
Dagesh, U+05BC hebrew point dagesh or mapiq, has the form of a dot that appears
inside the letter that it affects. It is not a vowel but rather a diacritic that affects the pronun-
ciation of a consonant. The same base consonant can also have a vowel and/or other dia-
critics. Dagesh is the only element that goes inside a letter.
The dotted Hebrew consonant shin is explicitly encoded as the sequence U+05E9 hebrew
letter shin followed by U+05C1 hebrew point shin dot. The shin dot is positioned on
the upper-right side of the undotted base letter. Similarly, the dotted consonant sin is
explicitly encoded as the sequence U+05E9 hebrew letter shin followed by U+05C2
hebrew point sin dot. The sin dot is positioned on the upper-left side of the base letter.
The two dots are mutually exclusive. The base letter shin can also have a dagesh, a vowel,
and other diacritics. The two dots are not used with any other base character.
Middle East-I 362 9.1 Hebrew
Vowels all appear below the base character that they affect, except for holam, U+05B9
hebrew point holam, which appears above left. The following points represent vowels:
U+05B0..U+05BB, and U+05C7.
The remaining three points are pronunciation marks: U+05BD hebrew point meteg,
U+05BF hebrew point rafe, and U+FB1E hebrew point judeo-spanish varika.
Meteg, also known as siluq, goes below the base character; rafe and varika go above it. The
varika, used in Judezmo, is a glyphic variant of rafe.
Shin and Sin. Separate characters for the dotted letters shin and sin are not included in this
block. When it is necessary to distinguish between the two forms, they should be encoded
as U+05E9 hebrew letter shin followed by the appropriate dot, either U+05C1 hebrew
point shin dot or U+05C2 hebrew point sin dot. (See preceding discussion.) This
practice is consistent with Israeli standard encoding.
Final (Contextual Variant) Letterforms. Variant forms of five Hebrew letters are encoded
as separate characters in this block, as in Hebrew standards including ISO/IEC 8859-8.
These variant forms are generally used in place of the nominal letterforms at the end of
words. Certain words, however, are spelled with nominal rather than final forms, particu-
larly names and foreign borrowings in Hebrew and some words in Yiddish. Because final
form usage is a matter of spelling convention, software should not automatically substitute
final forms for nominal forms at the end of words. The positional variants should be coded
directly and rendered one-to-one via their own glyphs—that is, without contextual analy-
sis.
Yiddish Digraphs. The digraphs are considered to be independent characters in Yiddish.
The Unicode Standard has included them as separate characters so as to distinguish cer-
tain letter combinations in Yiddish text—for example, to distinguish the digraph double
vav from an occurrence of a consonantal vav followed by a vocalic vav. The use of digraphs
is consistent with standard Yiddish orthography. Other letters of the Yiddish alphabet,
such as pasekh alef, can be composed from other characters, although alphabetic presenta-
tion forms are also encoded.
Punctuation. Most punctuation marks used with the Hebrew script are not given indepen-
dent codes (that is, they are unified with Latin punctuation) except for the few cases where
the mark has a unique form in Hebrew—namely, U+05BE hebrew punctuation maqaf,
U+05C0 hebrew punctuation paseq (also known as legarmeh), U+05C3 hebrew punc-
tuation sof pasuq, U+05F3 hebrew punctuation geresh, and U+05F4 hebrew punc-
tuation gershayim. For paired punctuation such as parentheses, the glyphs chosen to
represent U+0028 left parenthesis and U+0029 right parenthesis will depend on the
direction of the rendered text. See Section 4.7, Bidi Mirrored, for more information. For
additional punctuation to be used with the Hebrew script, see Section 6.2, General Punctu-
ation.
Cantillation Marks. Cantillation marks are used in publishing liturgical texts, including
the Bible. There are various historical schools of cantillation marking; the set of marks
included in the Unicode Standard follows the Israeli standard SI 1311.2.
Middle East-I 363 9.1 Hebrew
Positioning. Marks may combine with vowels and other points, and complex typographic
rules dictate how to position these combinations.
The vertical placement (meaning above, below, or inside) of points and marks is very well
defined. The horizontal placement (meaning left, right, or center) of points is also very well
defined. The horizontal placement of marks, by contrast, is not well defined, and conven-
tion allows for the different placement of marks relative to their base character.
When points and marks are located below the same base letter, the point always comes first
(on the right) and the mark after it (on the left), except for the marks yetiv, U+059A
hebrew accent yetiv, and dehi, U+05AD hebrew accent dehi. These two marks come
first (on the right) and are followed (on the left) by the point.
These rules are followed when points and marks are located above the same base letter:
• If the point is holam, all cantillation marks precede it (on the right) except
pashta, U+0599 hebrew accent pashta.
• Pashta always follows (goes to the left of ) points.
• Holam on a sin consonant (shin base + sin dot) follows (goes to the left of ) the
sin dot. However, the two combining marks are sometimes rendered as a single
assimilated dot.
• Shin dot and sin dot are generally represented closer vertically to the base letter
than other points and marks that go above it.
Meteg. Meteg, U+05BD hebrew point meteg, frequently co-occurs with vowel points
below the consonant. Typically, meteg is placed to the left of the vowel, although in some
manuscripts and printed texts it is positioned to the right of the vowel. The difference in
positioning is not known to have any semantic significance; nevertheless, some authors
wish to retain the positioning found in source documents.
The alternate vowel-meteg ordering can be represented in terms of alternate ordering of
characters in encoded representation. However, because of the fixed-position canonical
combining classes to which meteg and vowel points are assigned, differences in ordering of
such characters are not preserved under normalization. The combining grapheme joiner
can be used within a vowel-meteg sequence to preserve an ordering distinction under nor-
malization. For more information, see the description of U+034F combining grapheme
joiner in Section 23.2, Layout Controls.
For example, to display meteg to the left of (after, for a right-to-left script) the vowel point
sheva, U+05B0 hebrew point sheva, the sequence of meteg following sheva can be used:
<sheva, meteg>
Because these marks are canonically ordered, this sequence is preserved under normaliza-
tion. Then, to display meteg to the right of the sheva, the sequence with meteg preceding
sheva with an intervening CGJ can be used:
<meteg, CGJ, sheva>
Middle East-I 364 9.1 Hebrew
A further complication arises for combinations of meteg with hataf vowels: U+05B1
hebrew point hataf segol, U+05B2 hebrew point hataf patah, and U+05B3 hebrew
point hataf qamats. These vowel points have two side-by-side components. Meteg can be
placed to the left or the right of a hataf vowel, but it also is often placed between the two
components of the hataf vowel. A three-way positioning distinction is needed for such
cases.
The combining grapheme joiner can be used to preserve an ordering that places meteg to
the right of a hataf vowel, as described for combinations of meteg with non-hataf vowels,
such as sheva.
Placement of meteg between the components of a hataf vowel can be conceptualized as a
ligature of the hataf vowel and a nominally positioned meteg. With this in mind, the liga-
tion-control functionality of U+200D zero width joiner and U+200C zero width non-
joiner can be used as a mechanism to control the visual distinction between a nominally
positioned meteg to the left of a hataf vowel versus the medially positioned meteg within
the hataf vowel. That is, zero width joiner can be used to request explicitly a medially posi-
tioned meteg, and zero width non-joiner can be used to request explicitly a left-positioned
meteg. Just as different font implementations may or may not display an “fi” ligature by
default, different font implementations may or may not display meteg in a medial position
when combined with hataf vowels by default. As a result, authors who want to ensure left-
position versus medial-position display of meteg with hataf vowels across all font imple-
mentations may use joiner characters to distinguish these cases.
Thus the following encoded representations can be used for different positioning of meteg
with a hataf vowel, such as hataf patah:
left-positioned meteg: <hataf patah, ZWNJ, meteg>
medially positioned meteg: <hataf patah, ZWJ, meteg>
right-positioned meteg: <meteg, CGJ, hataf patah>
In no case is use of ZWNJ, ZWJ, or CGJ required for representation of meteg. These recom-
mendations are simply provided for interoperability in those instances where authors wish
to preserve specific positional information regarding the layout of a meteg in text.
Atnah Hafukh and Qamats Qatan. In some older versions of Biblical text, a distinction is
made between the accents U+05A2 hebrew accent atnah hafukh and U+05AA
hebrew accent yerah ben yomo. Many editions from the last few centuries do not retain
this distinction, using only yerah ben yomo, but some users in recent decades have begun to
reintroduce this distinction. Similarly, a number of publishers of Biblical or other religious
texts have introduced a typographic distinction for the vowel point qamats corresponding
to two different readings. The original letterform used for one reading is referred to as
qamats or qamats gadol; the new letterform for the other reading is qamats qatan. Not all
users of Biblical Hebrew use atnah hafukh and qamats qatan. If the distinction between
accents atnah hafukh and yerah ben yomo is not made, then only U+05AA hebrew accent
yerah ben yomo is used. If the distinction between vowels qamats gadol and qamats qatan
is not made, then only U+05B8 hebrew point qamats is used. Implementations that sup-
Middle East-I 365 9.1 Hebrew
port Hebrew accents and vowel points may not necessarily support the special-usage char-
acters U+05A2 hebrew accent atnah hafukh and U+05C7 hebrew point qamats
qatan.
Holam Male and Holam Haser. The vowel point holam represents the vowel phoneme
/o/. The consonant letter vav represents the consonant phoneme /w/, but in some words is
used to represent a vowel, /o/. When the point holam is used on vav, the combination usu-
ally represents the vowel /o/, but in a very small number of cases represents the consonant-
vowel combination /wo/. A typographic distinction is made between these two in many
versions of Biblical text. In most cases, in which vav + holam together represents the vowel
/o/, the point holam is centered above the vav and referred to as holam male. In the less fre-
quent cases, in which the vav represents the consonant /w/, some versions show the point
holam positioned above left. This is referred to as holam haser. The character U+05BA
hebrew point holam haser for vav is intended for use as holam haser only in those
cases where a distinction is needed. When the distinction is made, the character U+05B9
hebrew point holam is used to represent the point holam male on vav. U+05BA hebrew
point holam haser for vav is intended for use only on vav; results of combining this
character with other base characters are not defined. Not all users distinguish between the
two forms of holam, and not all implementations can be assumed to support U+05BA
hebrew point holam haser for vav.
Puncta Extraordinaria. In the Hebrew Bible, dots are written in various places above or
below the base letters that are distinct from the vowel points and accents. These dots are
referred to by scholars as puncta extraordinaria, and there are two kinds. The upper punc-
tum, the more common of the two, has been encoded since Unicode 2.0 as U+05C4
hebrew mark upper dot. The lower punctum is used in only one verse of the Bible, Psalm
27:13, and is encoded as U+05C5 hebrew mark lower dot. The puncta generally differ
in appearance from dots that occur above letters used to represent numbers; the number
dots should be represented using U+0307 combining dot above and U+0308 combining
diaeresis.
Nun Hafukha. The nun hafukha is a special symbol that appears to have been used for
scribal annotations, although its exact functions are uncertain. It is used a total of nine
times in the Hebrew Bible, although not all versions include it, and there are variations in
the exact locations in which it is used. There is also variation in the glyph used: it often has
the appearance of a rotated or reversed nun and is very often called inverted nun; it may
also appear similar to a half tet or have some other form.
Currency Symbol. The new sheqel sign (U+20AA) is encoded in the currency block.
alternative plus sign, is used more often in handwriting than in print, but it does occur
in school textbooks. It is used by those who wish to avoid cross symbols, which can have
religious and historical connotations.
U+FB20 hebrew letter alternative ayin is an alternative form of ayin that may replace
the basic form U+05E2 hebrew letter ayin when there is a diacritical mark below it. The
basic form of ayin is often designed with a descender, which can interfere with a mark
below the letter. U+FB20 is encoded for compatibility with implementations that substi-
tute the alternative form in the character data, as opposed to using a substitute glyph at
rendering time.
Use of Wide Letters. Wide letterforms are used in handwriting and in print to achieve even
margins. The wide-form letters in the Unicode Standard are those that are most commonly
“stretched” in justification. If Hebrew text is to be rendered with even margins, justification
should be left to the text-formatting software.
These alphabetic presentation forms are included for compatibility purposes. For the pre-
ferred encoding, see the Hebrew presentation forms, U+FB1D..U+FB4F.
For letterlike symbols, see U+2135..U+2138.
Middle East-I 367 9.2 Arabic
9.2 Arabic
Arabic: U+0600–U+06FF
The Arabic script is used for writing the Arabic language and has been extended to repre-
sent a number of other languages, such as Persian, Urdu, Pashto, Sindhi, and Uyghur, as
well as many African languages. Urdu is often written with the ornate Nastaliq script vari-
ety. Some languages, such as Indonesian/Malay, Turkish, and Ingush, formerly used the
Arabic script but now employ the Latin or Cyrillic scripts. Other languages, such as Kurd-
ish, Azerbaijani, Kazakh, and Uzbek have competing Arabic and Latin or Cyrillic orthogra-
phies in different countries.
The Arabic script is cursive, even in its printed form (see Figure 9-1). As a result, the same
letter may be written in different forms depending on how it joins with its neighbors. Vow-
els and various other marks may be written as combining marks called tashkil, which are
applied to consonantal base letters. In normal writing, however, these marks are omitted.
Memory representation:
After reordering:
After joining:
Directionality. The Arabic script is written from right to left. Conformant implementa-
tions of Arabic script must use the Unicode Bidirectional Algorithm to reorder the memory
representation for display (see Unicode Standard Annex #9, “Unicode Bidirectional Algo-
rithm”).
Standards. ISO/IEC 8859-6—Part 6. Latin/Arabic Alphabet. The Unicode Standard
encodes the basic Arabic characters in the same relative positions as in ISO/IEC 8859-6.
ISO/IEC 8859-6, in turn, is based on ECMA-114, which was based on ASMO 449.
Encoding Principles. The basic set of Arabic letters is well defined. Each letter receives only
one Unicode character value in the basic Arabic block, no matter how many different con-
textual appearances it may exhibit in text. Each Arabic letter in the Unicode Standard may
be said to represent the inherent semantic identity of the letter. A word is spelled as a
sequence of these letters. The representative glyph shown in the Unicode character chart
for an Arabic letter is usually the form of the letter when standing by itself. It is simply used
to distinguish and identify the character in the code charts and does not restrict the glyphs
used to represent it. See “Arabic Cursive Joining,” “Arabic Ligatures,” and “Arabic Joining
Groups” in the following text for an extensive discussion of how cursive joining and posi-
tional variants of Arabic letters are handled by the Unicode Standard.
Middle East-I 368 9.2 Arabic
The following principles guide the encoding of the various types of marks which are
applied to the basic Arabic letter skeletons:
1. Ijam: Diacritical marks applied to basic letter forms to derive new (usually
consonant) letters for extended Arabic alphabets are not separately encoded as
combining marks. Instead, each letter plus ijam combination is encoded as a
separate, atomic character. These letter plus ijam characters are never given
decompositions in the standard. Ijam generally take the form of one-, two-,
three- or four-dot markings above or below the basic letter skeleton, although
other diacritic forms occur in extensions of the Arabic script in Central and
South Asia and in Africa. In discussions of Arabic in Unicode, ijam are often
also referred to as nukta, because of their functional similarity to the nukta dia-
critical marks which occur in many Indic scripts.
2. Tashkil: Marks functioning to indicate vocalization of text, as well as other
types of phonetic guides to correct pronunciation, are separately encoded as
combining marks. These include several subtypes: harakat (short vowel
marks), tanwin (postnasalized or long vowel marks), shaddah (consonant gem-
ination mark), and sukun (to mark lack of a following vowel). A basic Arabic
letter plus any of these types of marks is never encoded as a separate, precom-
posed character, but must always be represented as a sequence of letter plus
combining mark. Additional marks invented to indicate non-Arabic vowels,
used in extensions of the Arabic script, are also encoded as separate combining
marks.
3. Maddah: The maddah is a particular case of a harakat mark which has excep-
tional treatment in the standard. In most modern languages using the Arabic
script, it occurs only above alef, and in that combination represents the sound
/vaa/. In Koranic Arabic, maddah occurs above waw or yeh to note vowel elon-
gation. For this reason, and the shared use of maddah between Arabic and Syr-
iac scripts, the precomposed combination U+0622 arabic letter alef with
madda above is encoded, however the combining mark U+0653 arabic
maddah above is also encoded. U+0622 is given a canonical decomposition to
the sequence of alef followed by the combining maddah. Some historical non-
Arabic orthographies have also used maddah as an ijam. U+0653 should be
used to represent those texts.
4. Hamza: The hamza may occur above or below other letters. Its treatment in
the Unicode Standard is also exceptional and rather complex. The general prin-
ciple is that when such a hamza is used to indicate an actual glottal stop (or the
/je/ sound used in Persian and Urdu for ezafe), it should be represented with a
separate combining mark, either U+0654 arabic hamza above or U+0655
arabic hamza below. However, when the hamza mark is used as a diacritic to
derive a separate letter as an extension of the Arabic script, then the basic letter
skeleton plus the hamza mark is represented by a single, precomposed charac-
ter. See “Combining Hamza Above” later in this section for discussion of the
complications for particular characters.
Middle East-I 369 9.2 Arabic
Memory representation:
After reordering:
After joining:
These connecting forms commonly occur in some abbreviations such as the marker for
Î
hijri dates, which consists of an initial form of heh: .
The use of a non-joiner between two letters prevents those letters from forming a cursive
connection with each other when rendered, as shown in Figure 9-3.
Memory representation:
After reordering:
After joining:
Middle East-I 370 9.2 Arabic
Examples requiring the use of a non-joiner include the Persian plural suffix, some Persian
proper names, and Ottoman Turkish vowels. This use of non-joiners is important for rep-
resentation of text in such languages, and ignoring or removing them will result in text with
a different meaning, or in meaningless text.
Joiners and non-joiners may also occur in combinations. The effects of such combinations
are shown in Figure 9-4. For further discussion of joiners and non-joiners, see Section 23.2,
Layout Controls.
Memory representation:
After reordering:
After joining:
Tashkil Nonspacing Marks. Tashkil are marks that indicate vowels or other modifications
of consonant letters. In English, these marks are often referred to as “points.” They may
also be called harakat, although technically, harakat refers to the subset of tashkil which
denote short vowels. The code charts depict these tashkil in relation to a dotted circle, indi-
cating that this character is intended to be applied via some process to the character that
precedes it in the text stream (that is, the base character). General rules for applying nons-
pacing marks are given in Section 7.9, Combining Marks. The few marks that are placed
after (to the left of ) the base character are treated as ordinary spacing characters in the Uni-
code Standard. The Unicode Standard does not specify a sequence order in case of multi-
ple tashkil applied to the same Arabic base character. For more information about the
canonical ordering of nonspacing marks, see Section 2.11, Combining Characters, and
Section 3.11, Normalization Forms.
The placement and rendering of vowel and other marks in Arabic strongly depends on the
typographical environment or even the typographical style. For example, in the Unicode
code charts, the default position of U+0651 L arabic shadda is with the glyph placed
above the base character, whereas for U+064D arabic kasratan the glyph is placed
below the base character, as shown in the first example in Figure 9-5. However, computer
fonts often follow an approach that originated in metal typesetting and combine the kas-
ratan with shadda in a ligature placed above the text, as shown in the second example in
Figure 9-5.
The shapes of the various tashkil marks may also depend on the style of writing. For exam-
ple, dammatan can be written in at least three different styles:
• using a shape similar to that shown in the charts
• using two dammas, one of which is turned
• using two dammas vertically stacked
U+064C arabic dammatan can be rendered in any of those three shapes. U+08F1 arabic
open dammatan is an alternative dammatan character for use in Quran orthographies
which have two distinct forms of dammatan that convey a semantic difference.
Arabic-Indic Digits. The names for the forms of decimal digits vary widely across different
languages. The decimal numbering system originated in India (Devanagari vwx …) and
was subsequently adopted in the Arabic world with a different appearance (Arabic
٠١٢٣ …). The Europeans adopted decimal numbers from the Arabic world, although once
again the forms of the digits changed greatly (European 0123…). The European forms
were later adopted widely around the world and are used even in many Arabic-speaking
countries in North Africa. In each case, the interpretation of decimal numbers remained
the same. However, the forms of the digits changed to such a degree that they are no longer
recognizably the same characters. Because of the origin of these characters, the European
decimal numbers are widely known as “Arabic numerals” or “Hindi-Arabic numerals,”
whereas the decimal numbers in use in the Arabic world are widely known there as “Hindi
numbers.”
The Unicode Standard includes Indic digits (including forms used with different Indic
scripts), Arabic digits (with forms used in most of the Arabic world), and European digits
(now used internationally). Because of this decision, the traditional names could not be
retained without confusion. In addition, there are two main variants of the Arabic digits:
those used in Afghanistan, India, Iran, and Pakistan (here called Eastern Arabic-Indic) and
those used in other parts of the Arabic world. In summary, the Unicode Standard uses the
names shown in Table 9-1. A different set of number forms, called Rumi, was used in his-
torical materials from Egypt to Spain, and is discussed in the subsection on “Rumi
Numeral Symbols” in Section 22.3, Numerals.
locales using the Eastern Arabic-Indic digits. Table 9-2 illustrates this variation with some
example glyphs for digits in languages of Afghanistan, India, Iran, and Pakistan. While
some usage of the Persian glyph for U+06F7 extended arabic-indic digit seven can be
documented for Sindhi, the form shown in Table 9-2 is predominant.
The Unicode Standard provides a single, complete sequence of digits for Persian, Sindhi,
and Urdu to account for the differences in appearance and directional treatment when ren-
dering them. The Kashmiri digits have the same appearance as those for Urdu. (For a com-
plete discussion of directional formatting of numbers in the Unicode Standard, see
Unicode Standard Annex #9, “Unicode Bidirectional Algorithm.”)
Extended Arabic Letters. Arabic script is used to write major languages, such as Persian
and Urdu, but it has also been used to transcribe some lesser-used languages, such as Balu-
chi and Lahnda, which have little tradition of printed typography. As a result, the Unicode
Standard encodes multiple forms of some Extended Arabic letters because the character
forms and usages are not well documented for a number of languages. For additional
extended Arabic letters, see the Arabic Supplement block, U+0750..U+077F and the Arabic
Extended-A block, U+08A0..U+08FF.
Koranic Annotation Signs. These characters are used in the Koran to mark pronunciation
and other annotation. Several additional Koranic annotation signs are encoded in the Ara-
bic Extended-A block, U+08A0..U+08FF.
Additional Vowel Marks. When the Arabic script is adopted as the writing system for a lan-
guage other than Arabic, it is often necessary to represent vowel sounds or distinctions not
made in Arabic. In some cases, conventions such as the addition of small dots above and/or
below the standard Arabic fatha, damma, and kasra signs have been used.
Classical Arabic has only three canonical vowels (/a/, /i/, /u/), whereas languages such as
Urdu and Persian include other contrasting vowels such as /o/ and /e/. For this reason, it is
imperative that speakers of these languages be able to show the difference between /e/ and
/i/ (U+0656 arabic subscript alef), and between /o/ and /u/ (U+0657 arabic
inverted damma). At the same time, the use of these two diacritics in Arabic is redundant,
merely emphasizing that the underlying vowel is long.
U+065F arabic wavy hamza below is an additional vowel mark used in Kashmiri. It can
appear in combination with many characters. The particular combination of an alef with
Middle East-I 373 9.2 Arabic
this vowel mark should be written with the sequence <U+0627 arabic letter alef,
U+065F arabic wavy hamza below>, rather than with the character U+0673 arabic
letter alef with wavy hamza below, which has been deprecated and which is not
canonically equivalent. However, implementations should be aware that there may be
existing legacy Kashmiri data in which U+0673 occurs.
Honorifics. Marks known as honorifics represent phrases expressing the status of a person
and are in widespread use in the Arabic-script world. Most have a specifically religious
meaning. In effect, these marks are combining characters at the word level, rather than
being associated with a single base character. The normal practice is that such marks be
used at the end of words. In manuscripts, depending on the letter shapes present in the
name and the calligraphic style in use, the honorific mark may appear over a letter in the
middle of the word. If an exact representation of a manuscript is desired, the honorific
mark may be represented as following that letter. The normalization algorithm does not
move such word-level combining characters to the end of the word.
Spacing honorifics are also in wide use both in the Arabic script and among Muslim com-
munities writing in other scripts. See “Word Ligatures” under Arabic Presentation Forms-A
later in this section for more information.
Arabic Mathematical Symbols. A few Arabic mathematical symbols are encoded in this
block. The Arabic mathematical radix signs, U+0606 arabic-indic cube root and
U+0607 arabic-indic fourth root, differ from simple mirrored versions of U+221B
cube root and U+221C fourth root, in that the digit portions of the symbols are written
with Arabic-Indic digits and are not mirrored. U+0608 arabic ray is a letterlike symbol
used in Arabic mathematics.
Date Separator. U+060D arabic date separator is used in Pakistan and India between
the numeric date and the month name when writing out a date. This sign is distinct from
U+002F solidus, which is used, for example, as a separator in currency amounts.
Full Stop. U+061E arabic triple dot punctuation mark is encoded for traditional
orthographic practice using the Arabic script to write African languages such as Hausa,
Wolof, Fulani, and Mandinka. These languages use arabic triple dot punctuation
mark as a full stop.
Currency Symbols. U+060B afghani sign is a currency symbol used in Afghanistan. The
symbol is derived from an abbreviation of the name of the currency, which has become a
symbol in its own right. U+FDFC rial sign is a currency symbol used in Iran. Unlike the
afghani sign, U+FDFC rial sign is considered a compatibility character, encoded for
compatibility with Iranian standards. Ordinarily in Persian “rial” is simply spelled out as
the sequence of letters, <0631, 06CC, 0627, 0644>.
Signs Spanning Numbers. Several other special signs are written in association with num-
bers in the Arabic script. All of these signs can span multiple-digit numbers, rather than
just a single digit. They are not formally considered combining marks in the sense used by
the Unicode Standard, although they clearly interact graphically with their associated
sequence of digits. In the text representation they precede the sequence of digits that they
Middle East-I 374 9.2 Arabic
span, rather than follow a base character, as would be the case for a combining mark. Their
General_Category value is Cf (format character). Unlike most other format characters,
however, they should be rendered with a visible glyph, even in circumstances where no
suitable digit or sequence of digits follows them in logical order. The characters have the
Bidi_Class value of Arabic_Number to make them appear in the same run as the numbers
following them.
A few similar signs spanning numbers or letters are associated with scripts other than Ara-
bic. See the discussion of U+070F syriac abbreviation mark in Section 9.3, Syriac, and
the discussion of U+110BD kaithi number sign and U+110CD kaithi number sign
above in Section 15.2, Kaithi. All of these prefixed format controls, including the non-Ara-
bic ones, are given the property value Prepended_Concatenation_Mark = True, to identify
them as a class. They also have special behavior in text segmentation. (See Unicode Stan-
dard Annex #29, “Unicode Text Segmentation.”)
U+0600 arabic number sign signals the beginning of a number. It is followed by a
sequence of one or more Arabic digits and is rendered below the digits of the number. The
length of its rendered display may vary with the number of digits. The sequence terminates
with the occurrence of any non-digit character.
U+0601 arabic sign sanah indicates a year (that is, as part of a date). This sign is also ren-
dered below the digits of the number it precedes. Its appearance is a vestigial form of the
Arabic word for year, /sanatu/ (seen noon teh-marbuta), but it is now a sign in its own right
and is widely used to mark a numeric year even in non-Arabic languages where the Arabic
word would not be known. The use of the year sign is illustrated in Figure 9-6.
Z
U+0602 arabic footnote marker is a specialized variant of number sign. Its use indi-
cates that the number so marked represents a footnote number in the text.
U+0603 arabic sign safha is another specialized variant of number sign. It marks a page
number.
U+0604 arabic sign samvat is a specialized variant of date sign used specifically to write
dates of the Śaka era. The shape of the glyph is a stylized abbreviation of the word samvat,
the name of this calendar. It is seen in the Urdu orthography, where it contrasts with con-
ventions used to display dates from the Gregorian or Islamic calendars.
U+0605 arabic number mark above is a specialized variant of number sign. It is used in
Arabic text with Coptic numbers, such as in early astronomical tables. Unlike the other
Arabic number signs, it extends across the top of the sequence of digits, and is used with
Coptic digits, rather than with Arabic digits. (See also the discussion of supralineation and
the numerical use of letters in Section 7.3, Coptic.)
Middle East-I 375 9.2 Arabic
U+06DD arabic end of ayah is another sign used to span numbers, but its rendering is
somewhat different. Rather than extending below the following digits, this sign encloses the
digit sequence. This sign is used conventionally to indicate numbered verses in the Koran.
U+08E2 arabic disputed end of ayah is a specialized variant of the end of ayah. It is seen
occasionally in Koranic text to mark a verse for which there is scholarly disagreement
about the location of the end of the verse.
Poetic Verse Sign. U+060E arabic poetic verse sign is a special symbol often used to
mark the beginning of a poetic verse. Although it is similar to U+0602 arabic footnote
marker in appearance, the poetic sign is simply a symbol. In contrast, the footnote marker
is a format control character that has complex rendering in conjunction with following
digits. U+060F arabic sign misra is another symbol used in poetry.
In Table 9-3, right and left refer to visual order, so a Joining_Type value of Right_Joining
indicates that a character cursively joins to a character displayed to its right in visual order.
(For a discussion of the meaning of Joining_Type values in the context of a vertically ren-
dered script, see “Cursive Joining” in Section 14.4, Phags-pa.) The Arabic characters with
Joining_Type = Right_Joining are exemplified in more detail in Table 9-9, and those with
Joining_Type = Dual_Joining are shown in Table 9-8. When characters do not join or
cause joining (such as damma), they are classified as transparent.
The Phags-pa and Manichaean scripts have a few Left_Joining characters, which are other-
wise unattested in the Arabic and Syriac scripts. See Section 10.5, Manichaean. For a discus-
sion of the meaning of Joining_Type values in the context of a vertically rendered script,
see “Cursive Joining” in Section 14.4, Phags-pa.
Table 9-4 defines derived superclasses of the primary Arabic joining types; those derived
types are used in the cursive joining rules. In this table, right and left refer to visual order.
Joining Rules. The following rules describe the joining behavior of Arabic letters in terms
of their display (visual) order. In other words, the positions of letterforms in the included
examples are presented as they would appear on the screen after the Bidirectional Algo-
rithm has reordered the characters of a line of text.
An implementation may choose to restate the following rules according to logical order so
as to apply them before the Bidirectional Algorithm’s reordering phase. In this case, the
words right and left as used in this section would become preceding and following.
Middle East-I 377 9.2 Arabic
In the following rules, if X refers to a character, then various glyph types representing that
character are referred to as shown in Table 9-5.
R1 Transparent characters do not affect the joining behavior of base (spacing) char-
acters. For example:
+ $ + → + $ + →
R2 A right-joining character X that has a right join-causing character on the right will
adopt the form Xr . For example:
R5 A dual-joining character X that has a right join-causing character on the right and
no left join-causing character on the left will adopt the form Xr . For example:
R6 A dual-joining character X that has a left join-causing character on the left and no
right join-causing character on the right will adopt the form Xl. For example:
R7 If none of the preceding rules applies to a character X, then it will adopt the non-
joining form Xn.
The cursive joining behavior described here for the Arabic script is also generally applica-
ble to other cursive scripts such as Syriac. Specific circumstances may modify the applica-
tion of the rules just described.
As noted earlier in this section, the zero width non-joiner may be used to prevent join-
ing, as in the Persian plural suffix or Ottoman Turkish vowels.
Arabic Ligatures
Ligature Classes. The lam-alef type of ligatures are extremely common in the Arabic script.
These ligatures occur in almost all font designs, except for a few modern styles. When sup-
ported by the style of the font, lam-alef ligatures are considered obligatory. This means that
all character sequences rendered in that font, which match the rules specified in the follow-
ing discussion, must form these ligatures. Many other Arabic ligatures are discretionary.
Their use depends on the font design.
For the purpose of describing the obligatory Arabic ligatures, certain characters fall into
two joining groups, as shown in Table 9-6. The complete list is available in ArabicShap-
ing.txt in the Unicode Character Database.
Ligature Rules. The following rules describe the formation of obligatory ligatures. They
are applied after the preceding joining rules. As for the joining rules just discussed, the fol-
lowing rules describe ligature behavior of Arabic letters in terms of their display (visual)
order.
Middle East-I 379 9.2 Arabic
In the ligature rules, if X and Y refer to characters, then various glyph types representing
combinations of these characters are referred to as shown in Table 9-7.
The Joining_Type and Joining_Group values for all Arabic characters are explicitly speci-
fied in ArabicShaping.txt in the Unicode Character Database. For convenience in refer-
ence, the Joining_Type values are extracted and listed in DerivedJoiningType.txt and the
Joining_Group values are extracted and listed in DerivedJoiningGroup.txt.
Dual-Joining. Table 9-8 exemplifies dual-joining Arabic characters and illustrates the
forms taken by the letter skeletons and their ijam marks in context. Dual-joining characters
have four distinct forms, for isolated, final, medial, and initial contexts, respectively. The
name for each joining group is based on the name of a representative letter that is used to
illustrate the shaping behavior. All other Arabic characters are merely variations on these
basic shapes, with diacritics added, removed, moved, or replaced. For instance, the beh
joining group applies not only to U+0628 arabic letter beh, which has a single dot
below the skeleton, but also to U+062A arabic letter teh, which has two dots above the
skeleton, and to U+062B arabic letter theh, which has three dots above the skeleton, as
well as to the Persian and Urdu letter U+067E arabic letter peh, which has three dots
below the skeleton. The joining groups in the table are organized by shape and not by stan-
dard Arabic alphabetical order. Note that characters in some joining groups have dots in
some contextual forms, but not others, or their dots may move to a different position.
These joining groups include nya, farsi yeh, and burushaski yeh barree.
african noon W X Ë Á
nya 6 7 8 9 Jawi nya.
farsi yeh 2 3 Ù Û
burushaski yeh barree ij k l Dual joining, as opposed to yeh barree
feh — “ ‘ ”
african feh a b c d
qaf ’ ÷ ÿ ◊
african qaf Y Z ‘ ”
meem · ‚ ‰ „
heh È Í Ï Î
knotted heh ™ ´ ≠ ¨
heh goal ¶ ß © ® Includes hamza on heh goal.
kaf Ÿ ⁄ ‹ ¤
swash kaf [\ ] ^
gaf í ì ï î Includes keheh.
lam › fi ‡ fl
Middle East-I 382 9.2 Arabic
Right-Joining. Table 9-9 exemplifies right-joining Arabic characters, illustrating the forms
they take in context. Right-joining characters have only two distinct forms, for isolated and
final contexts, respectively.
alef ç é
waw Ì Ó
straight waw W X Tatar straight waw
Some characters occur only at the end of words or morphemes when correctly spelled;
these are called trailing characters. Examples include teh marbuta and dammatan. When
trailing characters are joining (such as teh marbuta), they are classified as right-joining,
even when similarly shaped characters are dual-joining. Other characters, such as alef
maksura, are considered trailing in modern Arabic, but are dual-joining in Koranic Arabic
and languages like Uyghur. These are classified as dual-joining.
Î
Letter heh. In the case of U+0647 arabic letter heh, the glyph is shown in the code
charts. This form is often used to reduce the chance of misidentifying heh as U+0665 ara-
bic-indic digit five, which has a very similar shape. The isolated forms of U+0647 ara-
bic letter heh and U+06C1 arabic letter heh goal both look like U+06D5 arabic
letter ae.
U+06BE arabic letter heh doachashmee is used to represent any heh-like letter that
appears with stems at both sides in all contextual forms. The exact contextual shapes of the
Middle East-I 383 9.2 Arabic
letter depend on the language and the style of writing. The forms shown in Table 9-8 for
knotted heh are used in certain styles of writing in South Asia. Other South Asian styles
may use different medial and final forms. The style used in China and Central Asia for lan-
guages such as Uyghur uses medial and final forms for heh doachashmee that are visually
similar to the medial form of heh shown in Table 9-8.
Letter yeh. There are many complications in the shaping of the Arabic letter yeh. These
complications have led to the encoding of several different characters for yeh in the Uni-
code Standard, as well as the definition of several different joining groups involving yeh.
The relationships between those characters and joining groups for yeh are explained here.
U+06CC arabic letter farsi yeh is used in Persian, Urdu, Pashto, Azerbaijani, Kurdish,
and various minority languages written in the Arabic script, and also Koranic Arabic. It
behaves differently from most Arabic letters, in a way surprising to native Arabic language
speakers. The letter has two horizontal dots below the skeleton in initial and medial forms,
but no dots in final and isolated forms. Compared to the two Arabic language yeh forms,
farsi yeh is exactly like U+0649 arabic letter alef maksura in final and isolated forms,
but exactly like U+064A arabic letter yeh in initial and medial forms, as shown in
Table 9-10.
behave in a different manner. Just as U+064A arabic letter yeh retains two dots below in
all contextual forms, other characters in the joining group yeh retain whatever mark
appears below their isolated form in all other contexts. For example, U+0777 arabic let-
ter farsi yeh with extended arabic-indic digit four below carries an Urdu-style
digit four as a diacritic below the yeh skeleton, and retains that diacritic in all positions, as
shown in the fourth row of Table 9-10. Note that the joining group cannot always be
derived from the character name alone. The complete list of characters with the joining
group yeh or farsi yeh is available in ArabicShaping.txt in the Unicode Character Data-
base.
In the orthographies of Arabic and Persian, the yeh barree has always been treated as a sty-
listic variant of yeh in final and isolated positions. When the Perso-Arabic writing system
was adapted and extended for use with the Urdu language, yeh barree was adopted as a dis-
tinct letter to accommodate the richer vowel repertoire of Urdu. South Asian languages
such as Urdu and Kashmiri use yeh barree to represent the /e/ vowel. This contrasts with
the /i/ vowel, which is usually represented in those languages by U+06CC arabic letter
farsi yeh. The encoded character U+06D2 arabic letter yeh barree is classified as a
right-joining character, as shown in Table 9-10. On that basis, when the /e/ vowel needs to
be represented in initial or medial positions with a yeh shape in such languages, one should
use U+06CC arabic letter farsi yeh. In the unusual circumstances where one wishes to
distinctly represent the /e/ vowel in word-initial or word-medial positions, a higher level
protocol should be used.
For the Burushaski language, two characters that take the form of yeh barree with a dia-
critic, U+077A arabic letter yeh barree with extended arabic-indic digit two
above and U+077B arabic letter yeh barree with extended arabic-indic digit
three above, are classified as dual-joining. These characters have a separate joining group
called burushaski yeh barree, as shown for U+077A in the last row of Table 9-10.
U+0620 arabic letter kashmiri yeh is used in Kashmiri text to indicate that the preced-
ing consonantal sound is palatalized. The letter has the form of a yeh with a diacritic small
circle below. It has the yeh joining group, with the shapes as shown in the fifth row of
Table 9-10. However, when Kashmiri is written in Nastaliq style, the final and isolated
forms of kashmiri yeh usually appear as truncated yeh shapes (o) without the diacritic ring.
U+08AC arabic letter rohingya yeh is used in the Arabic orthography for the
Rohingya language of Myanmar. It represents a medial ya, corresponding to the use of
U+103B myanmar consonant sign medial ya in the Myanmar script. It is a right-joining
letter, but never occurs in isolated form. It only occurs after certain consonants, forming a
conjunct letter with those consonants.
Noon Ghunna. The letter noon ghunna is used to mark nasalized vowels at the ends of
words and some morphemes in Urdu, Balochi, and other languages of Southern Asia. It is
represented by U+06BA arabic letter noon ghunna. The noon ghunna has the shape of
a dotless noon and typically appears only in final and isolated contexts in these languages.
In the middle of words and morphemes, the normal noon, U+0646 arabic letter noon,
Middle East-I 385 9.2 Arabic
is used instead. To avoid ambiguity, sometimes a special mark, U+0658 arabic mark
noon ghunna, is added to the dotted noon to indicate nasalization.
U+06BA arabic letter noon ghunna is also used as a dotless noon for the noon skeleton
in all four of its contextual forms. As such, it is used in representation of early Arabic and
Koranic Arabic texts. Rendering systems should display U+06BA as a dual-joining letter,
with all four contextual forms shown dotless, regardless of the language of the text.
Advanced text entry applications for Urdu, Balochi, and other languages using noon
ghunna may include specialized logic for its handling. For example, they might detect mid-
word usage of the noon ghunna key and emit the regular dotted noon character (U+0646)
instead, as appropriate for spelling in that context.
Letters for Warsh. There is a set of widespread orthographic conventions for Arabic writ-
ing in West and Northwest Africa known as Warsh. Among other differences from the bet-
ter-known Hafs orthography of the Middle East, there are significant differences in Warsh
regarding the placement of ijam dots on several important Arabic letters. Several “African”
letters are encoded in the Arabic Extended-A block specifically to account for these differ-
ences in dot placement.
The specialized letters for Warsh share the skeleton with the corresponding, regular Arabic
letters. However, they differ in the placement of dots. The Warsh letters have no dots in
final or isolated positional contexts. This is illustrated by U+08BD arabic letter african
noon. Unlike U+0646 arabic letter noon, which displays a dot above in all positional
contexts, african noon displays a dot above in initial and medial position, and no dot in
final or isolated position. This contrast can be clearly seen in Table 9-8.
U+08BB arabic letter african feh and U+08BC arabic letter african qaf also lose
all dots in final or isolated position, but exhibit a somewhat different pattern for initial and
medial position. The basic skeletons for feh and for qaf are identical for those letters in ini-
tial and medial position. In the Hafs orthography, the feh takes a single dot above in all
positions, while the qaf takes two dots above in all positions. The Warsh orthography dis-
tinguishes the two letters differently: the feh takes a single dot below in initial or medial
position, while the qaf takes a single dot above in initial or medial position. These contex-
tual differences in the placement of the dots for these letters can also be seen in Table 9-8.
Joining Groups in Other Scripts. Other scripts besides Arabic also have cursive joining
behavior and associated per-character values for Joining_Type and Joining_Group. Those
values are also listed in ArabicShaping.txt in the Unicode Character Database, in sections
devoted to each particular script. See the script descriptions for such scripts in the core
specification—for example, Syriac and Manichaean—for detailed discussions of cursive
joining behavior and tables of joining groups for those scripts.
For the Arabic script, Joining_Group values are assigned for each distinct letter skeleton in
all instances—even for the small number of cases, such as heh goal, where only a single
character is associated with that Joining_Group. This is appropriate for Arabic, because the
script has cosmopolitan use, and many letters have been modified with various nukta dia-
critics to form new letters for non-Arabic languages using the script. This pattern of com-
Middle East-I 386 9.2 Arabic
prehensive assignment of Joining_Group values to all letter skeletons also applies for the
Syriac and Manichaean scripts.
For other cursive joining scripts with less well-defined joining groups, all letters are simply
assigned the value No_Joining_Group. This does not necessarily mean that no identifiable
letter skeletons occur, but rather that no complete analysis has been done that would indi-
cate more than one letterform uses a shared skeleton for cursive joining. Examples include:
Mongolian, Phags-Pa, Psalter Pahlavi, Sogdian, and Adlam.
Starting with Unicode 11.0, even in cases where a newly encoded script with cursive join-
ing behavior includes some characters which share letter skeletons, most characters are
given the No_Joining_Group value. This applies, for example, to the Hanifi Rohingya
script, which has a few explicit Joining_Group values, but for which all other characters
have the No_Joining_Group value.
In the future, characters with the Non_Joining_Group value in scripts with cursive joining
behavior may end up being given explicit new Joining_Group values, where further analy-
sis clearly demonstrates use of shared skeletons in cursive joining, or where new, diacriti-
cally modified letters are added to the encoding for that script.
Combining Hamza
Combining Hamza Above. U+0654 arabic hamza above is intended both for the repre-
sentation of hamza semantics in combination with certain Arabic letters, and as a diacriti-
cal mark occasionally used in combinations to derive extended Arabic letters. There are a
number of complications regarding its use, which interact with the rules for the rendering
of Arabic letter yeh and which result from the need to keep Unicode normalization stable.
U+0654 arabic hamza above should not be used with U+0649 arabic letter alef mak-
sura. Instead, the precomposed U+0626 arabic letter yeh with hamza above should
be used to represent a yeh-shaped base with no dots in any positional form, and with a
hamza above. Because U+0626 is canonically equivalent to the sequence <U+064A arabic
letter yeh, U+0654 arabic hamza above>, when U+0654 is applied to U+064A arabic
letter yeh, the yeh should lose its dots in all positional forms, even though yeh retains its
dots when combined with other marks.
A separate, non-decomposable character, U+08A8 arabic letter yeh with two dots
below and hamza above, is used to represent a yeh-shaped base with a hamza above, but
with retention of dots in all positions. This letter is used in the Fulfulde language in Cam-
eroun, to represent a palatal implosive.
In most other cases when a hamza is needed as a mark above for an Arabic letter, U+0654
arabic hamza above can be freely used in combination with basic Arabic letters. Three
exceptions are the extended Arabic letters U+0681 arabic letter hah with hamza
above, U+076C arabic letter reh with hamza above, and U+08A1 arabic letter
beh with hamza above, where the hamza mark is functioning as an ijam (diacritic),
rather than as a normal hamza. In those three cases, the extended Arabic letters have no
Middle East-I 387 9.2 Arabic
The first five entries in Table 9-11 show the cases where the hamza above can be freely
used, and where there is a canonical equivalence to the precomposed characters. The last
four entries show the exceptions, where use of the hamza above is inappropriate, and
where only the precomposed characters should be used.
The Arabic block also includes several extended Arabic characters whose origin was to rep-
resent dialectal or other poorly attested alternative forms of the Soraní alphabet extensions.
U+0692 arabic letter reh with small v is a dialectal variant of U+0695 which places
the small v diacritic above the letter rather than below it. U+0694 is another variant of
U+0695. U+06B6 and U+06B7 are poorly attested variants of U+06B5, and U+06CA is a
poorly attested variant of U+06C6. None of these alternative forms is required (or desired)
for a regular implementation of the Kurdish Soraní orthography.
set precisely. Fonts will often include only a subset of these glyphs, and they may also
include glyphs outside of this set. The included glyphs are generally not accessible as char-
acters and are used only by rendering engines.
Ornate Parentheses. The alternative, ornate forms of parentheses (U+FD3E ornate left
parenthesis and U+FD3F ornate right parenthesis) for use with the Arabic script are
considered traditional Arabic punctuation, rather than compatibility characters. These
ornate parentheses are exceptional in rendering in bidirectional text; for legacy reasons,
they do not have the Bidi_Mirrored property. Thus, unlike other parentheses, they do not
automatically mirror when rendered in a bidirectional context.
Nuktas. Various patterns of single or multiple dots or other small marks are used diacriti-
cally to extend the core Arabic set of letters to represent additional sounds in other lan-
guages written with the Arabic script. Such dot patterns are known as ijam or nuktas. In the
Unicode Standard, extended Arabic characters with nuktas are simply encoded as fully-
formed base characters. However, there is an occasional need in pedagogical materials
about the Arabic script to exhibit the various nuktas in isolation. The range of characters
U+FBB2..U+FBC1 provides a set of symbols for this purpose. These are ordinary, spacing
symbols with right-to-left directionality. They are not combining marks, and are not
intended for the construction of new Arabic letters by use in combining character
sequences. The Arabic pedagogical symbols do not partake of any Arabic shaping behavior.
Their Joining_Type is Non_Joining, so if used in juxtaposition with an Arabic letter skele-
ton, they will break the cursive connection and render after the letter, instead of above or
below it.
For clarity in display, those with the names including the word “above” should have glyphs
that render high above the baseline, and those with names including “below” should be at
or below the baseline.
Word Ligatures. The signs and symbols encoded at U+FDF0..U+FDFD are word ligatures
sometimes treated as a unit. Most of them are encoded for compatibility with older charac-
ter sets and are rarely used, except the following:
U+FDF2 arabic ligature allah isolated form is a very common ligature, used to dis-
play the name of God. When the formation of the allah ligature is desired, the recom-
mended way to represent the word would be <alef, lam, lam, shadda, superscript alef, heh>
<0627, 0644, 0644, 0651, 0670, 0647>. In non-Arabic languages, other forms of heh, such as
heh goal (U+06C1), may also form the ligature. Extra care should be taken not to form the
ligature in the absence of the shadda and the superscript alef, as the sequences <alef, lam,
lam, heh> and <alef, lam, lam, shadda, heh> exist in Persian and other languages with dif-
ferent meanings or pronunciations, where the formation of the ligature would be incorrect
and inappropriate.
U+FDFA arabic ligature sallallahou alayhe wasallam and U+FDFB arabic liga-
ture jallajalalouhou are honorifics, commonly used after the name of the prophet
Muhammad or God. Their usage is comparable to the honorifics found at
U+0610..U+0613, except that these are spacing characters. The same characters are some-
times used by Muslims writing in other scripts such as Latin and Cyrillic.
Middle East-I 390 9.2 Arabic
9.3 Syriac
Syriac: U+0700–U+074F
Syriac Language. The Syriac language belongs to the Aramaic branch of the Semitic family
of languages. The earliest datable Syriac writing dates from the year 6 ce. Syriac is the
active liturgical language of many communities in the Middle East (Syrian Orthodox,
Assyrian, Maronite, Syrian Catholic, and Chaldaean) and Southeast India (Syro-Malabar
and Syro-Malankara). It is also the native language of a considerable population in these
communities.
Syriac is divided into two dialects. West Syriac is used by the Syrian Orthodox, Maronites,
and Syrian Catholics. East Syriac is used by the Assyrians (that is, Ancient Church of the
East) and Chaldaeans. The two dialects are very similar and have almost no differences in
grammar and vocabulary. They differ in pronunciation and use different dialectal forms of
the Syriac script.
Languages Using the Syriac Script. A number of modern languages and dialects employ
the Syriac script in one form or another. They include the following:
1. Literary Syriac. The primary usage of Syriac script.
2. Neo-Aramaic dialects. The Syriac script is widely used for modern Aramaic lan-
guages, next to Hebrew, Cyrillic, and Latin. A number of Eastern Modern Ara-
maic dialects known as Swadaya (also called vernacular Syriac, modern Syriac,
modern Assyrian, and so on, and spoken mostly by the Assyrians and Chaldae-
ans of Iraq, Turkey, and Iran) and the Central Aramaic dialect, Turoyo (spoken
mostly by the Syrian Orthodox of the Tur Abdin region in southeast Turkey),
belong to this category of languages.
3. Garshuni (Arabic written in the Syriac script). It is currently used for writing
Arabic liturgical texts by Syriac-speaking Christians. Garshuni employs the
Arabic set of vowels and overstrike marks.
4. Christian Palestinian Aramaic (also known as Palestinian Syriac). This dialect
is no longer spoken.
5. Other languages. The Syriac script was used in various historical periods for
writing Armenian and some Persian dialects. Syriac speakers employed it for
writing Arabic, Ottoman Turkish, and Malayalam. Six special characters used
for Persian and Sogdian were added in Version 4.0 of the Unicode Standard.
Shaping. The Syriac script is cursive and has shaping rules that are similar to those for Ara-
bic. The Unicode Standard does not include any presentation form characters for Syriac.
Directionality. The Syriac script is written from right to left. Conformant implementations
of Syriac script must use the Unicode Bidirectional Algorithm (see Unicode Standard
Annex #9, “Unicode Bidirectional Algorithm”).
Middle East-I 392 9.3 Syriac
Syriac Type Styles. Syriac texts employ several type styles. Because all type styles use the
same Syriac characters, even though their shapes vary to some extent, the Unicode Stan-
dard encodes only a single Syriac script.
1. Estrangela type style. Estrangela (a word derived from Greek strongulos, mean-
ing “rounded”) is the oldest type style. Ancient manuscripts use this writing
style exclusively. Estrangela is used today in West and East Syriac texts for writ-
ing headers, titles, and subtitles. It is the current standard in writing Syriac texts
in Western scholarship.
2. Serto or West Syriac type style. This type style is the most cursive of all Syriac
type styles. It emerged around the eighth century and is used today in West Syr-
iac texts, Turoyo (Central Neo-Aramaic), and Garshuni.
3. East Syriac type style. Its early features appear as early as the sixth century; it
developed into its own type style by the twelfth or thirteenth century. This type
style is used today for writing East Syriac texts as well as Swadaya (Eastern Neo-
Aramaic). It is also used today in West Syriac texts for headers, titles, and subti-
tles alongside the Estrangela type style.
4. Christian Palestinian Aramaic. Manuscripts of this dialect employ a script that
is akin to Estrangela. It can be considered a subcategory of Estrangela.
The Unicode Standard provides for usage of the type styles mentioned above. It also
accommodates letters and diacritics used in Neo-Aramaic, Christian Palestinian Aramaic,
Garshuni, Persian, and Sogdian languages. Examples are supplied in the Serto type style,
except where otherwise noted.
Character Names. Character names follow the East Syriac convention for naming the let-
ters of the alphabet. Diacritical points use a descriptive naming—for example, syriac dot
above.
Syriac Abbreviation Mark. U+070F syriac abbreviation mark (SAM) is a zero-width
formatting code that has no effect on the shaping process of Syriac characters. The SAM
specifies the beginning point of a Syriac abbreviation, which is a line drawn horizontally
above one or more characters, at the end of a word or of a group of characters followed by
a character other than a Syriac letter or diacritical mark. A Syriac abbreviation may contain
Syriac diacritics.
Ideally, the Syriac abbreviation is rendered by a line that has a dot at each end and the cen-
ter, as shown in the examples. While not preferable, it has become acceptable for comput-
ers to render the Syriac abbreviation as a line without the dots. The line is acceptable for
the presentation of Syriac in plain text, but the presence of dots is recommended in liturgi-
cal texts.
The Syriac abbreviation is used for letter numbers and contractions. A Syriac abbreviation
generally extends from the last tall character in the word until the end of the word. A com-
mon exception to this rule is found with letter numbers that are preceded by a preposition
character, as seen in Figure 9-7.
Middle East-I 393 9.3 Syriac
= 15 (number in letters)
A SAM is placed before the character where the abbreviation begins. The Syriac abbrevia-
tion begins over the character following the SAM and continues until the end of the word.
Use of the SAM is demonstrated in Figure 9-8.
Note: Modern East Syriac texts employ a punctuation mark for contractions of this sort.
Ligatures and Combining Characters. Only one ligature is included in the Syriac block:
U+071E syriac letter yudh he. This combination is used as a unique character in the
same manner as an “æ” ligature. A number of combining diacritics unique to Syriac are
encoded, but combining characters from other blocks are also used, especially from the
Arabic block.
Diacritical Marks and Vowels. The function of the diacritical marks varies. They indicate
vowels (as in Arabic and Hebrew), mark grammatical attributes (for example, verb versus
noun, interjection), or guide the reader in the pronunciation and/or reading of the given
text.
“The reader of the average Syriac manuscript or book is confronted with
a bewildering profusion of points. They are large, of medium size and
small, arranged singly or in twos and threes, placed above the word,
below it, or upon the line.”
There are two vocalization systems. The first, attributed to Jacob of Edessa (633–708 ce),
utilizes letters derived from Greek that are placed above (or below) the characters they
modify. The second is the more ancient dotted system, which employs dots in various
shapes and locations to indicate vowels. East Syriac texts exclusively employ the dotted sys-
Middle East-I 394 9.3 Syriac
tem, whereas West Syriac texts (especially later ones and in modern times) employ a mix-
ture of the two systems.
Diacritical marks are nonspacing and are normally centered above or below the character.
Exceptions to this rule follow:
1. U+0741 syriac qushshaya and U+0742 syriac rukkakha are used only with
the letters beth, gamal (in its Syriac and Garshuni forms), dalath, kaph, pe, and
taw.
• The qushshaya indicates that the letter is pronounced hard and unaspirated.
• The rukkakha indicates that the letter is pronounced soft and aspirated. When
the rukkakha is used in conjunction with the dalath, it is printed slightly to the
right of the dalath’s dot below.
2. In Modern Syriac usage, when a word contains a rish and a seyame, the dot of
the rish and the seyame are replaced by a rish with two dots above it.
3. The feminine dot is usually placed to the left of a final taw.
Punctuation. Most punctuation marks used with Syriac are found in the Latin-1 and Ara-
bic blocks. The other marks are encoded in this block.
Digits. Modern Syriac employs European numerals, as does Hebrew. The ordering of dig-
its follows the same scheme as in Hebrew.
Harklean Marks. The Harklean marks are used in the Harklean translation of the New
Testament. U+070B syriac harklean obelus and U+070D syriac harklean asteri-
scus mark the beginning of a phrase, word, or morpheme that has a marginal note.
U+070C syriac harklean metobelus marks the end of such sections.
Dalath and Rish. Prior to the development of pointing, early Syriac texts did not distin-
guish between a dalath and a rish, which are distinguished in later periods with a dot below
the former and a dot above the latter. Unicode provides U+0716 syriac letter dotless
dalath rish as an ambiguous character.
Semkath. Unlike other letters, the joining mechanism of semkath varies through the
course of history from right-joining to dual-joining. It is necessary to enter a U+200C zero
width non-joiner character after the semkath to obtain the right-joining form where
required. Two common variants of this character exist: U+0723 syriac letter semkath
and U+0724 syriac letter final semkath. They occur interchangeably in the same doc-
ument, similar to the case of Greek sigma.
Vowel Marks. The so-called Greek vowels may be used above or below letters. As West Syr-
iac texts employ a mixture of the Greek and dotted systems, both versions are accounted
for here.
Miscellaneous Diacritics. Miscellaneous general diacritics are used in Syriac text. Their
usage is explained in Table 9-12.
Middle East-I 395 9.3 Syriac
U+0303, U+0330 These are used in Swadaya to indicate letters not found in Syriac.
U+0304, U+0320 These are used for various purposes ranging from phonological to grammatical
to orthographic markers.
U+0307, U+0323 These points are used for various purposes—grammatical, phonological, and
otherwise. They differ typographically and semantically from the qushshaya,
rukkakha points, and the dotted vowel points.
U+0308 This is the plural marker. It is also used in Garshuni for the Arabic teh marbuta.
U+030A, U+0325 These are two other forms for the indication of qushshaya and rukkakha. They
are used interchangeably with U+0741 syriac qushshaya and U+0742 syr-
iac rukkakha, especially in West Syriac grammar books.
U+0324 This diacritical mark is found in ancient manuscripts. It has a grammatical and
phonological function.
U+032D This is one of the digit markers.
U+032E This is a mark used in late and modern East Syriac texts as well as in Swadaya to
indicate a fricative pe.
Use of Characters of the Arabic Block. Syriac makes use of several characters from the
Arabic block, including U+0640 arabic tatweel. Modern texts use U+060C arabic
comma, U+061B arabic semicolon, and U+061F arabic question mark. The shadda
(U+0651) is also used in the core part of literary Syriac on top of a waw in the word “O”.
Arabic harakat are used in Garshuni to indicate the corresponding Arabic vowels and dia-
critics.
Syriac Shaping
Minimum Rendering Requirements. Rendering requirements for Syriac are similar to
those for Arabic. The remainder of this section specifies a minimum set of rules that pro-
vides legible Syriac joining and ligature substitution behavior.
Joining Types. Each Syriac letter must be depicted by one of a number of possible contex-
tual glyph forms. The appropriate form is determined on the basis of the cursive joining
behavior of that character as it interacts with the cursive joining behavior of adjacent char-
acters. The basic joining types are identical to those specified for the Arabic script, and are
specified in the file ArabicShaping.txt in the Unicode Character Database. However, there
are additional contextual rules which govern the shaping of U+0710 syriac letter alaph
Middle East-I 396 9.3 Syriac
in final position. The additional glyph types associated with final alaph are listed in
Table 9-13.
In the following rules, alaph refers to U+0710 syriac letter alaph, which has Join-
ing_Group = Alaph.
These rules are intended to augment joining rules for Syriac which would otherwise paral-
lel the joining rules specified for Arabic in Section 9.2, Arabic. Characters with Joining_-
Type = Transparent are skipped over when applying the Syriac rules for shaping of alaph.
In other words, the Syriac parallel for Arabic joining rule R1 would take precedence over
the alaph joining rules.
S1 An alaph that has a left-joining character to its right and a non-joining character
to its left will take the form of Afj.
+ → + →
S2 An alaph that has a non-left-joining character to its right, except for a character
with Joining_Group = Dalath_Rish, and a non-joining character to its left will
take the form of Afn.
+ → + →
S3 An alaph that has a character with Joining_Group = Dalath_Rish to its right and a
non-joining character to its left will take the form of Afx.
+ → + →
The example in rule S3 is shown in the East Syriac font style.
Malayalam LLA. U+0868 syriac letter malayalam lla normally connects to the right,
but because it joins on both sides in some manuscripts, it is designated dual-joining. To
represent right-joining lla, the ZWNJ should be employed to make sure it does not connect
to the left-side letter.
Syriac Character Joining Groups. Syriac characters can be subdivided into shaping
groups, based on the behavior of their letter skeletons when shaped in context. The Uni-
code character property that specifies these groups is called Joining_Group, and is speci-
fied in ArabicShaping.txt in the Unicode Character Database. It is described in the
subsection on character joining groups in Section 9.2, Arabic.
Middle East-I 397 9.3 Syriac
Table 9-14 exemplifies dual-joining Syriac characters and illustrates the forms taken by the
letter skeletons in context. This table and the subsequent table use the Serto (West Syriac)
font style, whereas the Unicode code charts are in the Estrangela font style.
yudh K L M N
kaph O P Q R
khaph – ” “ — Sogdian
lamadh S T U V
mim W X Y Z
nun [ \ ] ^
semkath _ ` a b
final_semkath c d e f
e g h i j
pe k l m n
reversed_pe o p q r
fe ‘ ◊ ÷ ’ Sogdian
qaph s t u v
shin w x y z
In addition to the skeleton patterns shown in Table 9-14, six of the Garshuni characters
encoded in the Syriac Supplement block (U+0860, U+0862..U+0865, U+0868) are also
dual-joining, and have their own joining group values. U+0868 syriac letter malayalam
lla, in particular, normally connects only to the right, but occasionally occurs connected
on both sides. That letter is given the dual-joining property value. For instances when a
Middle East-I 398 9.3 Syriac
right-joining lla occurs in a manuscript, it may be represented with the sequence <0868,
ZWNJ>.
Table 9-15 exemplifies right-joining Syriac characters, illustrating the forms they take in
context. Right-joining characters have only two distinct forms, for isolated and final con-
texts, respectively.
he % &
syriac_waw ' (
zain ) *
zhain Œ œ Sogdian
yudh_he + ,
sadhe - .
taw 1 2
In addition to the skeleton patterns shown in Table 9-15, three of the Garshuni characters
encoded in the Syriac Supplement block (U+0867, U+0869, U+086A) are also right-join-
ing, and have their own joining group values.
U+0710 syriac letter alaph has the Joining_Group = Alaph and is a right-joining char-
acter. However, as specified above in rules S1, S2, and S3, its glyph is subject to additional
contextual shaping. Table 9-16 illustrates all of the glyph forms for alaph in each of the
three major Syriac type styles.
Ligature Classes. As in other scripts, ligatures in Syriac vary depending on the font style.
Table 9-17 identifies the principal valid ligatures for each font style. When applicable, these
ligatures are obligatory, unless denoted with an asterisk (*).
9.4 Samaritan
Samaritan: U+0800–U+083F
The Samaritan script is used today by small Samaritan communities in Israel and the Pal-
estinian Territories to write the Samaritan Hebrew and Samaritan Aramaic languages, pri-
marily for religious purposes. The Samaritan religion is related to an early form of Judaism,
but the Samaritans did not leave Palestine during the Babylonian exile, so the script
evolved from the linear Old Hebrew script, most likely directly descended from Phoenician
(see Section 10.3, Phoenician). In contrast, the more recent square Hebrew script associated
with Judaism derives from the Imperial Aramaic script (see Section 10.4, Imperial Aramaic)
used widely in the region during and after the Babylonian exile, and thus well-known to
educated Hebrew speakers of that time.
Like the Phoenician and Hebrew scripts, Samaritan has 22 consonant letters. The conso-
nant letters do not form ligatures, nor do they have explicit final forms as some Hebrew
consonants do.
Directionality. The Samaritan script is written from right to left. Conformant implementa-
tions of Samaritan script must use the Unicode Bidirectional Algorithm. For more infor-
mation, see Unicode Standard Annex #9, “Unicode Bidirectional Algorithm.”
Vowel Signs. Vowel signs are optional in Samaritan, just as points are optional in Hebrew.
Combining marks are used for vowels that follow a consonant, and are rendered above and
to the left of the base consonant. With the exception of o and short a, vowels may have up to
three lengths (normal, long, and overlong), which are distinguished by the size of the cor-
responding vowel sign. Sukun is centered above the corresponding base consonant and
indicates that no vowel follows the consonant.
Two vowels, i and short a, may occur in a word-initial position preceding any consonant.
In this case, the separate spacing versions U+0828 samaritan modifier letter i and
U+0824 samaritan modifier letter short a should be used instead of the normal com-
bining marks.
When U+0824 samaritan modifier letter short a follows a letter used numerically, it
indicates thousands, similar to the use of U+05F3 hebrew punctuation geresh for the
same purpose in Hebrew.
Consonant Modifiers. The two marks, U+0816 samaritan mark in and U+0817 samari-
tan mark in-alaf, are used to indicate a pharyngeal voiced fricative /f/. These occur
immediately following their base consonant and preceding any vowel signs, and are ren-
dered above and to the right of the base consonant.
U+0818 samaritan mark occlusion “strengthens” the consonant, for example changing
/w/ to /b/. U+0819 samaritan mark dagesh indicates consonant gemination. The occlu-
sion and dagesh marks may both be applied to the same consonant, in which case the occlu-
sion mark should precede the dagesh in logical order, and the dagesh is rendered above the
Middle East-I 401 9.4 Samaritan
occlusion mark. The occlusion mark is also used to designate personal names to distinguish
them from homographs.
Epenthetic yut represents a kind of glide-vowel which interacts with another vowel. It was
originally used only with the consonants alaf, iy, it, and in, in combination with a vowel
sign. The combining U+081B samaritan mark epenthetic yut should be used for this
purpose. When epenthetic yut is not fixed to one of the four consonants listed above, a new
behavior evolved in which the mark for the epenthetic yut behaves as a spacing character,
capable of bearing its own diacritical mark. U+081A samaritan modifier letter epen-
thetic yut should be used instead to represent the epenthetic yut in this context.
Punctuation. Samaritan uses a large number of punctuation characters. U+0830 samari-
tan punctuation nequdaa and U+0831 samaritan punctuation afsaaq (“interrup-
tion”) are similar to the Hebrew sof pasuq and were originally used to separate sentences,
and later to mark lesser breaks within a sentence. They have also been described respec-
tively as “semicolon” and “pause.” Samaritan also uses a smaller dot as a word separator,
which can be represented by U+2E31 word separator middle dot. U+083D samaritan
punctuation sof mashfaat is equivalent to the full stop. U+0832 samaritan punctua-
tion anged (“restraint”) indicates a break somewhat less strong than an afsaaq. U+083E
samaritan punctuation annaau (“rest”) is stronger than the afsaaq and indicates that a
longer time has passed between actions narrated in the sentences it separates.
U+0839 samaritan punctuation qitsa is similar to the annaau but is used more fre-
quently. The qitsa marks the end of a section, and may be followed by a blank line to fur-
ther make the point. It has many glyph variants. One important variant, U+0837
samaritan punctuation melodic qitsa, differs significantly from any of the others, and
indicates the end of a sentence “which one should read melodically.”
Many of the punctuation characters are used in combination with each other, for example:
afsaaq + nequdaa or nequdaa + afsaaq, qitsa + nequdaa, and so on.
U+0836 samaritan abbreviation mark follows an abbreviation. U+082D samaritan
mark nequdaa is an editorial mark which indicates that there is a variant reading of the
word.
Other Samaritan punctuation characters mark some prosodic or performative attributes of
the text preceding them, as summarized in Table 9-18.
9.5 Mandaic
Mandaic: U+0840–U+085F
The origins of the Mandaic script are unclear, but it is thought to have evolved between the
second and seventh century ce from a cursivized form of the Aramaic script (as did the Syr-
iac script) or from the Parthian chancery script. It was developed by adherents of the Man-
daean gnostic religion of southern Mesopotamia to write the dialect of Eastern Aramaic
they used for liturgical purposes, which is referred to as Classical Mandaic.
The religion has survived into modern times, with more than 50,000 Mandaeans in several
communities worldwide (most having left what is now Iraq). In addition to the Classical
Mandaic still used within some of these communities, a variety known as Neo-Mandaic or
Modern Mandaic has developed and is spoken by a small number of people. Mandaeans
consider their script sacred, with each letter having specific mystic properties, and the
script has changed very little over time.
Letter It. The character U+0847 U mandaic letter it is a pharyngeal, pronounced [hu].
It can appear at the end of personal names or at the end of words to indicate the third per-
son singular suffix.
Structure. Mandaic is unusual among Semitic scripts in being a true alphabet; the letters
halqa, ushenna, aksa, and in are used to write both long and short forms of vowels, instead
of functioning as consonants also used to write long vowels (matres lectionis), in the man-
ner characteristic of other Semitic scripts. This is possible because some consonant sounds
represented by the corresponding letters in other Semitic scripts are not used in the Man-
daic language.
The character U+0856 mandaic letter dushenna, also called adu, has a morphemic
function. It is used to write the relative pronoun and the genitive exponent di. Dushenna is
a digraph derived from an old ligature for ad + aksa. It is thus an addition to the usual
Semitic set of 22 characters. The Mandaic alphabet is traditionally represented as the 23
letters halqa through dushenna, with halqa appended again at the end to form a symboli-
cally-important cycle of 24 letters.
Two additional Mandaic characters are encoded in the Unicode Standard: U+0857 man-
daic letter kad is derived from an old ligature of ak + dushenna; it is a digraph used to
write the word kd, which means “when, as, like”. The second additional character, U+0858
mandaic letter ain, is a borrowing from U+0639 arabic letter ain.
Three diacritical marks are used in teaching materials to differentiate vowel quality; they
may be omitted from ordinary text. U+0859 mandaic affrication mark is used to extend
the character set for foreign sounds (whether affrication, lenition, or another sound).
U+085A mandaic vocalization mark is used to distinguish vowel quality of halqa, ush-
enna, and aksa. U+085B mandaic gemination mark is used to indicate what native writ-
ers call a “hard” pronunciation.
Middle East-I 403 9.5 Mandaic
Chapter 10
Middle East-II 10
Ancient Scripts
This chapter covers a number of ancient scripts of the Middle East. All of these scripts were
written right to left.
Old North Arabian and Old South Arabian are two branches of the South Semitic script
family used in and around Arabia from about the tenth century bce to the sixth century ce.
The Old South Arabian script was used around the southwestern part of the Arabian pen-
insula for 1,200 years beginning around the 8th century bce. Carried westward, it was
adapted for writing the Ge’ez language, and evolved into the root of the modern Ethiopic
script.
The Phoenician alphabet was used in various forms around the Mediterranean. It is ances-
tral to Latin, Greek, Hebrew, and many other scripts—both modern and historical.
The Imperial Aramaic script evolved from Phoenician and was the source of many other
scripts, such as the square Hebrew and the Arabic script. Imperial Aramaic was used to
write the Aramaic language beginning in the eighth century bce, and was the principal
administrative language of the Assyrian empire and then the official language of the Achae-
menid Persian empire. Inscriptional Parthian, Inscriptional Pahlavi, and Avestan are also
derived from Imperial Aramaic, and were used to write various Middle Persian languages.
Psalter Pahlavi is a cursive alphabetic script used to write the Middle Persian language
during the 6th or 7th century ce. It is a historically conservative variety of Pahlavi used by
Christians in the Neo-Persian empire.
The Manichaean script is a cursive alphabetic script related to Syriac, as well as Palmyrene
Aramaic. The script was used by those practicing the Manichaean religion, which was
founded during the third century ce in Babylonia, and spread widely over the next four
centuries before later vanishing.
The Elymaic script was used to write Achaemenid Aramaic in the state of Elymais, which
flourished from the second century bce to the early third century ce and was located in the
southwestern portion of modern-day Iran. Elymaic derives from the Aramaic script and is
closely related to Parthian and Mandaic.
Middle East-II 406
The Nabataean script developed from the Aramaic script and was used to write the lan-
guage of the Nabataean kingdom. The script was in wide use from the second century bce
to the fourth century ce. It is generally considered the precursor of the Arabic script.
The Palmyrene script was derived from the customary forms of Aramaic developed during
the Achaemenid empire. The script was used for writing the Palmyrene dialect of West Ara-
maic, and is known from inscriptions and documents found mainly in the city of Palmyra
and other cities in the region of Syria, dating from 44 bce to about 280 ce.
The Hatran script belongs to the North Mesopotamian branch of the Aramaic scripts, and
was used for writing a dialect of the Aramaic language. The script is known from inscrip-
tions discovered in the ancient city of Hatra, in present-day Iraq, dating from 98–97 bce
until circa 241 ce.
Middle East-II 407 10.1 Old North Arabian
Numbers are built up through juxtaposition of these characters in a manner similar to that
of Roman numerals, as shown in Table 10-2. When 10, 50, or 100 occur preceding 1000
they serve to indicate multiples of 1000. The example numbers shown in Table 10-2 are
rendered in a right-to-left direction in the last column.
Character Names. Character names are based on those of corresponding letters in north-
west Semitic.
Middle East-II 410 10.3 Phoenician
10.3 Phoenician
Phoenician: U+10900–U+1091F
The Phoenician alphabet and its successors were widely used over a broad area surround-
ing the Mediterranean Sea. Phoenician evolved over the period from about the twelfth cen-
tury bce until the second century bce, with the last neo-Punic inscriptions dating from
about the third century ce. Phoenician came into its own from the ninth century bce. An
older form of the Phoenician alphabet is a forerunner of the Greek, Old Italic (Etruscan),
Latin, Hebrew, Arabic, and Syriac scripts among others, many of which are still in modern
use. It has also been suggested that Phoenician is the ultimate source of Kharoshthi and of
the Indic scripts descending from Brahmi.
Phoenician is an historic script, and as for many other historic scripts, which often saw
continuous change in use over periods of hundreds or thousands of years, its delineation as
a script is somewhat problematic. This issue is particularly acute for historic Semitic
scripts, which share basically identical repertoires of letters, which are historically related
to each other, and which were used to write closely related Semitic languages.
In the Unicode Standard, the Phoenician script is intended for the representation of text in
Paleo-Hebrew, Archaic Phoenician, Phoenician, Early Aramaic, Late Phoenician cursive,
Phoenician papyri, Siloam Hebrew, Hebrew seals, Ammonite, Moabite, and Punic. The
line from Phoenician to Punic is taken to constitute a single continuous branch of script
evolution, distinct from that of other related but separately encoded Semitic scripts.
The earliest Hebrew language texts were written in the Paleo-Hebrew alphabet, one of the
forms of writing considered to be encompassed within the Phoenician script as encoded in
the Unicode Standard. The Samaritans who did not go into exile continued to use Paleo-
Hebrew forms, eventually developing them into the distinct Samaritan script. (See
Section 9.4, Samaritan.) The Jews in exile gave up the Paleo-Hebrew alphabet and instead
adopted Imperial Aramaic writing, which was a descendant of the Early Aramaic form of
the Phoenician script. (See Section 10.4, Imperial Aramaic.) Later, they transformed Impe-
rial Aramaic into the “Jewish Aramaic” script now called (Square) Hebrew, separately
encoded in the Hebrew block in the Unicode Standard. (See Section 9.1, Hebrew.)
Some scholars conceive of the language written in the Paleo-Hebrew form of the Phoeni-
cian script as being quintessentially Hebrew and consistently transliterate it into Square
Hebrew. In such contexts, Paleo-Hebrew texts are often considered to simply be Hebrew,
and because the relationship between the Paleo-Hebrew letters and Square Hebrew letters
is one-to-one and quite regular, the transliteration is conceived of as simply a font change.
Other scholars of Phoenician transliterate texts into Latin. The encoding of the Phoenician
script in the Unicode Standard does not invalidate such scholarly practice; it is simply
intended to make it possible to represent Phoenician, Punic, and similar textual materials
directly in the historic script, rather than as specialized font displays of transliterations in
modern Square Hebrew.
Middle East-II 411 10.3 Phoenician
Directionality. Phoenician is written horizontally from right to left. The characters of the
Phoenician script are all given strong right-to-left directionality.
Punctuation. Inscriptions and other texts in the various forms of the Phoenician script
generally have no space between words. Dots are sometimes found between words in later
exemplars—for example, in Moabite inscriptions—and U+1091F phoenician word sep-
arator should be used to represent this punctuation. The appearance for this word sep-
arator is somewhat variable; in some instances it may appear as a short vertical bar, instead
of a rounded dot.
Stylistic Variation. The letters for Phoenician proper and especially for Punic have very
exaggerated descenders. These descenders help distinguish the main line of Phoenician
script evolution toward Punic, as contrasted with the Hebrew forms, where the descenders
instead grew shorter over time.
Numerals. Phoenician numerals are built up from six elements used in combination.
These include elements for one, two, and three, and then separate elements for ten, twenty,
and one hundred. Numerals are constructed essentially as tallies, by repetition of the vari-
ous elements. The numbers for two and three are graphically composed of multiples of the
tally mark for one, but because in practice the values for two or three are clumped together
in display as entities separate from one another they are encoded as individual characters.
This same structure for numerals can be seen in some other historic scripts ultimately
descendant from Phoenician, such as Imperial Aramaic and Inscriptional Parthian.
Like the letters, Phoenician numbers are written from right to left: OOOPPQ means 143 (100 +
20 + 20 + 3). This practice differs from modern Semitic scripts like Hebrew and Arabic,
which use decimal numbers written from left to right.
Character Names. The names used for the characters here are those reconstructed by The-
odor Nöldeke in 1904, as given in Powell (1996).
Middle East-II 412 10.4 Imperial Aramaic
Values in the range 1-99 are represented by a string of characters whose values are in the
range 1-20; the numeric value of the string is the sum of the numeric values of the charac-
ters. The string is written using the minimum number of characters, with the most signifi-
cant values first. For example, 55 is represented as 20 + 20 + 10 + 3 + 2. Characters for 100,
1000, and 10000 are prefixed with a multiplier represented by a string whose value is in the
range 1-9. The Inscriptional Parthian, Inscriptional Pahlavi, Nabataean, Palmyrene, and
Hatran scripts use a similar system for forming numeric values.
Middle East-II 414 10.5 Manichaean
10.5 Manichaean
Manichaean U+10AC0–U+10AFF
The Manichaean religion was founded during the third century ce in Babylonia, then part
of the Sassanid Persian empire. It spread widely over the next four centuries, as far west as
north Africa and as far east as China, but had mostly vanished by the fourteenth century.
From 762 until around 1000 it was a state religion in the Uyghur kingdom.
The Manichaean script was used by adherents of Manichaeism, and was based on or influ-
enced by the Estrangela form of Syriac, as well as Palmyrene Aramaic. It is said to have
been invented by Mani, but may be older. Because of the wide spread of Manichaeism and
Mani’s decision to spread his teachings in any language available, the Manichaean script
was used to write a variety of languages with some variation in character repertoire: the
Iranian languages Middle and Early Modern Persian, Parthian, Sogdian, and Bactrian, as
well as the Turkic language Uyghur and, to a lesser extent, the Indo-European language
Tocharian.
Directionality. The Manichaean script is written from right to left. Conformant imple-
mentations of Manichaean script must use the Unicode Bidirectional Algorithm (see Uni-
code Standard Annex #9, “Unicode Bidirectional Algorithm”).
Structure. Manichaean is alphabetic, written with spaces between words. The alphabet
includes 24 base letters, two more than Aramaic. There are a total of 36 letters. Ten of these
are formed by adding one or two dots above the base letter to represent a spirant or other
modified sound. There is also a sign representing the conjunction ud.
In addition, two diacritical marks are used to indicate abbreviations, elisions, or plural
forms. Manichaean text paid careful attention to the layout of characters, often stretching
or shrinking letters, using abbreviations, or eliminating vowels (indicated with elision
dots) to achieve desired line widths and to avoid breaking words across lines. Sogdian writ-
ten in Manichaean script also sometimes shows the use of doubled vowels to fill out a line.
To graphically extend a word, U+0640 arabic tatweel may be used.
Shaping. Manichaean has shaping rules and rendering requirements that are similar to
those for Syriac and Arabic, with joining forms as shown in Table 10-4, Table 10-5,
Table 10-6 and Table 10-7. In these tables, Xn, Xr, Xm, and Xl designate the isolated, final,
medial, and initial forms respectively. The dotted letters are not shown separately, because
their joining behavior is the same as the corresponding un-dotted letter. Note that Man-
ichaean has two letters with the rare Joining_Type of Left_Joining.
Five Manichaean letters—daleth, he, mem, nun, resh—have alternate forms whose occur-
rence cannot be predicted from context, although the alternate forms tend to occur most
often at the end of lines. These forms are represented using standardized variation
sequences and are shown in the tables that follow.
Middle East-II 415 10.5 Manichaean
Table 10-4 lists the dual-joining letters Manichaean. In this and the following tables, the
standardized variation sequences are indicated in the joining group column in separate
rows showing the relevant joining group plus the variation selector.
Manichaean has two obligatory ligatures for sadhe followed by yodh or nun. These are
shown in Table 10-8.
punctuation dot within dot is used to indicate smaller units of text in a prose text or
the end of a half-verse in a verse text. U+10AF4 manichaean punctuation dot is used to
indicate sub-units of text, logical parts of a sentence or units in a list.
Middle East-II 418 10.6 Pahlavi and Parthian
10.7 Avestan
Avestan: U+10B00–U+10B3F
The Avestan script was created around the fifth century ce to record the canon of the
Avesta, the principal collection of Zoroastrian religious texts. The Avesta had been trans-
mitted orally in the Avestan language, which was by then extinct except for liturgical pur-
poses. The Avestan script was also used to write the Middle Persian language, which is
called Pazand when written in Avestan script. The Avestan script was derived from Book
Pahlavi, but provided improved phonetic representation by adding consonants and a com-
plete set of vowels—the latter probably due to the influence of the Greek script. It is an
alphabetic script of 54 letters, including one that is used only for Pazand.
Directionality. The Avestan script is written from right to left. Conformant implementa-
tions of Avestan script must use the Unicode Bidirectional Algorithm. For more informa-
tion, see Unicode Standard Annex #9, “Unicode Bidirectional Algorithm”.
Shaping Behavior. Four ligatures are commonly used in manuscripts of the Avesta, as
shown in Table 10-10. U+200C zero width non-joiner can be used to prevent ligature
formation.
Punctuation. Archaic Avestan texts use a dot to separate words. The texts generally use a
more complex grouping of dots or other marks to indicate boundaries between larger units
such as clauses and sentences, but this is not systematic. In contemporary critical editions
of Avestan texts, some scholars have systematized and differentiated the usage of various
Avestan punctuation marks. The most notable example is Karl F. Geldner’s 1880 edition of
the Avesta.
The Unicode Standard encodes a set of Avestan punctuation marks based on the system
established by Geldner. U+10B3A tiny two dots over one dot punctuation functions
as an Avestan colon, U+10B3B small two dots over one dot punctuation as an Aves-
tan semicolon, and U+10B3C large two dots over one dot punctuation as an Avestan
end of sentence mark; these indicate breaks of increasing finality. U+10B3E large two
rings over one ring punctuation functions as an Avestan end of section, and may be
doubled (sometimes with a space between) for extra finality. U+10B39 avestan abbrevia-
tion mark is used to mark abbreviation and repetition. U+10B3D large one dot over
Middle East-II 421 10.7 Avestan
two dots punctuation and U+10B3F large one ring over two rings punctuation
are found in Avestan texts, but are not used by Geldner.
Minimal representation of Avestan requires two separators: one to separate words and a
second mark used to delimit larger units, such as clauses or sentences. Contemporary edi-
tions of Avestan texts show the word separator dot in a variety of vertical positions: it may
appear in a midline position or on the baseline. Dots such as U+2E31 word separator
middle dot, U+00B7 middle dot, or U+002E full stop can be used to represent this.
Middle East-II 422 10.8 Elymaic
10.8 Elymaic
Elymaic: U+10FE0–U+10FFF
The Elymaic script, also called “Elymaean,” was used to write Achaemenid Aramaic in the
ancient state of Elymais, which flourished from the second century bce to the early third
century ce and was located in the southwestern portion of modern-day Iran. Elymaic
derives from the Aramaic script and is closely related to Parthian and Mandaic. The script
is found on inscriptions and coins.
Directionality. The Elymaic script is written from right to left. Conformant implementa-
tions of the Elymaic script must use the Unicode Bidirectional Algorithm. For more infor-
mation, see Unicode Standard Annex #9, “Unicode Bidirectional Algorithm.”
Structure. Elymaic is encoded as a non-joining abjad. Although some sources show adja-
cent letters connecting or overlapping, the overall script does not contain intrinsic cursive
behavior. However, Elymaic includes one ligature: U+10FF6 elymaic ligature zayin-
yodh.
Character Names and Glyphs. The Elymaic character names are based on those for Impe-
rial Aramaic because the native names for the characters are unknown. The representative
glyphs in the code charts are based on the stone inscriptions at Tang-e Sarvak in southwest
Iran.
Punctuation. There is no script-specific punctuation for Elymaic. Although word bound-
aries are not generally indicated, some inscriptions have spaces between words. Modern
editors tend to use U+0020 space for word separation.
Numerals. There are no known script-specific numerals.
Middle East-II 423 10.9 Nabataean
10.9 Nabataean
Nabataean U+10880–U+108AF
The Nabataean script developed from the Aramaic script and was used to write the lan-
guage of the Nabataean kingdom. The script was in wide use from the second century bce
to the fourth century ce, well after the Roman province of Arabia Petraea was formed.
Nabataean is generally considered to be the precursor of the Arabic script. The Namara
inscription, dating from the fourth century ce and believed to be one of the oldest Arabic
texts, was written in the Nabataean script.
The glyphs of the Nabataean script are more ornate than those of other scripts derived
from Aramaic, and flourishes can be found in some inscriptions. As the script evolved, a
range of ligatures was introduced. Because their usage is irregular, no joining behavior is
specified for Nabataean.
Structure. The Nabataean script consists of 22 consonants. Nine consonants have final
forms and are treated similarly to the final letters of the Hebrew script. The final forms are
encoded separately because their occurrence in text is not predictable. For more informa-
tion about the use of distinctly encoded final consonants in Semitic scripts, see Section 9.1,
Hebrew.
Directionality. Both words and numbers in the Nabataean script are written from right to
left in horizontal lines. Conformant implementations of the script must use the Unicode
Bidirectional Algorithm. For more information on bidirectional layout, see Unicode Stan-
dard Annex #9, “Unicode Bidirectional Algorithm.”
Numerals. Nabataean has script-specific numeral characters, with strong right-to-left
directionality. Nabataean numbers are built up using sequences of characters for 1, 2, 3, 4,
5, 10, 20, and 100 in a manner similar to the way numbers are built up for Imperial Ara-
maic, which is shown in Table 10-3. A cruciform variant of the numeral 4 is encoded sepa-
rately at U+108AB.
Punctuation. There is no script-specific punctuation in Nabataean. The inscriptions usu-
ally have no space between words, but modern editors tend to use U+0020 space for word
separation.
Middle East-II 424 10.10 Palmyrene
10.10 Palmyrene
Palmyrene U+10860–U+1087F
The Palmyrene script was derived by modification of the customary forms of Aramaic
developed during the Achaemenid empire. The script was used for writing the Palmyrene
dialect of West Aramaic, and is known from inscriptions and documents found mainly in
the city of Palmyra and other cities in the region of Syria, dating from 44 bce to about 280
ce.
Palmyrene has both a monumental and a cursive form. Earlier inscriptions show more
rounded forms, while later inscriptions tend to regularize the letterforms. Most pre-Uni-
code fonts for Palmyrene have followed the monumental style. Ligatures exist in both
forms of the script, but are not used consistently.
At a certain point, some Palmyrene letterforms became confused and a distinguishing dia-
critical dot was introduced, although not regularly or systematically, as seen in the glyphic
variation of consonants daleth and resh across the various styles of the script. Sometimes
the two glyphs appear with different skeletons, which is sufficient to distinguish them;
sometimes they have the same skeleton and are differentiated by a dot; and sometimes they
appear with the same skeleton and no dot, in which case they are indistinguishable. In the
Unicode code charts, a dot distinguishes the daleth and resh glyphs.
Structure. The Palmyrene script consists of 22 consonants. The consonant nun has a final
form variant, encoded as a separate character, U+1086D palmyrene letter final nun,
and used similarly to the counterpart Hebrew consonant. For information about the use of
distinctly encoded final consonants in Semitic scripts, see Section 9.1, Hebrew.
Directionality. Both words and numbers in the Palmyrene script are written from right to
left in horizontal lines. Conformant implementations of the script must use the Unicode
Bidirectional Algorithm. For more information on bidirectional layout, see Unicode Stan-
dard Annex #9, “Unicode Bidirectional Algorithm.”
Numerals. Palmyrene has script-specific numeral characters, with strong right-to-left
directionality. Palmyrene numbers are built up using sequences of characters for 1, 2, 3, 4,
5, 10, 20, and 100 in a manner similar to the way numbers are built up for Imperial Ara-
maic, which is shown in Table 10-3. The glyphs for the numerals 10 and 100, which had
been distinct in Aramaic, coalesced into the same glyph in Palmyrene. The two numerals
are generally distinguished by their position in sequences representing numbers rather
than their shape. A single character is encoded at U+1087E palmyrene number ten and
should be used for both numerals.
Symbols. Two symbols are encoded at U+10877 palmyrene left-pointing fleuron and
U+10878 palmyrene right-pointing fleuron. They usually appear next to numbers.
Punctuation. There is no script-specific punctuation in Palmyrene. The inscriptions usu-
ally have no space between words, but modern editors tend to use U+0020 space for word
separation.
Middle East-II 425 10.11 Hatran
10.11 Hatran
Hatran: U+108E0–U+108FF
The Hatran abjad belongs to the North Mesopotamian branch of the Aramaic scripts, and
was used for writing a dialect of the Aramaic language. Hatran writing was discovered in
the ancient city of Hatra in present-day Iraq. The inscriptions found there date from 98–97
bce until circa 241 ce, when the city of Hatra was destroyed. Many of the known texts in
Hatran are graffiti, but there are some longer texts.
Structure. The Hatran script consists of 22 consonants, encoded as 21 characters. The con-
sonants daleth and resh are indistinguishable by shape and are encoded as a single charac-
ter, U+108E3 hatran letter daleth-resh. Ligatures can occur—for example, the letter
beth often joins or touches the letter following it—but are not used consistently.
Directionality. Both words and numbers in the Hatran script are written from right to left
in horizontal lines. Conformant implementations of the script must use the Unicode Bidi-
rectional Algorithm. For more information on bidirectional layout, see Unicode Standard
Annex #9, “Unicode Bidirectional Algorithm.”
Numerals. Hatran has script-specific characters for numerals, with strong right-to-left
directionality. Hatran numbers are built up using sequences of characters for 1, 5, 10, 20,
and 100 in a manner similar to the way numbers are built up for Imperial Aramaic, which
is shown in Table 10-3. The numbers 2, 3, and 4 are formed from sequences of repeated
characters for the numeral 1, and are not separately encoded.
Punctuation. There is no script-specific punctuation encoded for Hatran. The inscriptions
sometimes have spaces between words; modern editors tend to insert U+0020 space for
word separation even if there were no spaces in the original text.
Middle East-II 426 10.11 Hatran
427
Chapter 11
Cuneiform and Hieroglyphs 11
The following scripts are described in this chapter:
Three ancient cuneiform scripts are described in this chapter: Ugaritic, Old Persian, and
Sumero-Akkadian. The largest and oldest of these is Sumero-Akkadian. The other two
scripts are not derived directly from the Sumero-Akkadian tradition but had common writ-
ing technology, consisting of wedges indented into clay tablets with reed styluses. Ugaritic
texts are about as old as the earliest extant Biblical texts. Old Persian texts are newer, dating
from the fifth century bce.
Egyptian Hieroglyphs were used for more than 3,000 years from the end of the fourth mil-
lennium bce.
Meroitic hieroglyphs and Meroitic cursive were used from around the second century bce
to the fourth century ce to write the Meroitic language of the Nile valley kingdom known
as Kush or Meroë. Meroitic cursive was for general use, and its appearance was based on
Egyptian demotic. Meroitic hieroglyphs were used for inscriptions, and their appearance
was based on Egyptian hieroglyphs.
Anatolian Hieroglyphs date to the second and first millennia bce, and were used to write
the Luwian language, an Indo-European language, in the area of present-day Turkey and
environs.
Cuneiform and Hieroglyphs 428 11.1 Sumero-Akkadian
11.1 Sumero-Akkadian
Cuneiform: U+12000–U+123FF
Sumero-Akkadian Cuneiform is a logographic writing system with a strong syllabic com-
ponent. It was written from left to right on clay tablets.
Early History of Cuneiform. The earliest stage of Mesopotamian Cuneiform as a complete
system of writing is first attested in Uruk during the so-called Uruk IV period (circa 3500–
3200 bce) with an initial repertoire of about 700 characters or “signs” as Cuneiform schol-
ars customarily call them.
Late fourth millennium ideographic tablets were also found at Susa and several other sites
in western Iran, in Assyria at Nineveh (northern Iraq), at Tell Brak (northwestern Syria),
and at Habuba Kabira in Syria. The writing system developed in Sumer (southeastern Iraq)
was repeatedly exported to peripheral regions in the third, second, and first millennia bce.
Local variations in usage are attested, but the core of the system is the Sumero-Akkadian
writing system.
Writing emerged in Sumer simultaneously with a sudden growth in urbanization and an
attendant increase in the scope and scale of administrative needs. A large proportion of the
elements of the early writing system repertoire was devised to represent quantities and
commodities for bureaucratic purposes.
At this earliest stage, signs were mainly pictographic, in that a relatively faithful facsimile of
the thing signified was traced, although some items were strictly ideographic and repre-
sented by completely arbitrary abstractions, such as the symbol for sheep D. Some scholars
believe that the abstract symbols were derived from an earlier “token” system of account-
ing, but there is no general agreement on this point. Where the pictographs are concerned,
interpretation was relatively straightforward. The head of a bull was used to denote “cat-
tle”; an ear of barley was used to denote “barley.” In some cases, pictographs were also
interpreted logographically, so that meaning was derived from the symbol by close concep-
tual association. For example, the representation of a bowl might mean “bowl,” but it could
indicate concepts associated with bowls, such as “food.” Renditions of a leg might variously
suggest “leg,” “stand,” or “walk.”
By the next chronological period of south Mesopotamian history (the Uruk III period,
3200–2900 bce), logographic usage seems to have become much more widespread. In
addition, individual signs were combined into more complex designs to express other con-
cepts. For example, a head with a bowl next to it was used to denote “eat” or “drink.” This
is the point during script development at which one can truly speak of the first Sumerian
texts. In due course, the early graphs underwent change, conditioned by factors such as the
most widely available writing medium and writing tools, and the need to record informa-
tion more quickly and efficiently from the standpoint of the bureaucracy that spawned the
system.
Cuneiform and Hieroglyphs 429 11.1 Sumero-Akkadian
Clay was the obvious writing medium in Sumer because it was widely available and easily
molded into cushion- or pillow-shaped tablets. Writing utensils were easily made for it by
sharpening pieces of reed. Because it was awkward and slow to inscribe curvilinear lines in
a piece of clay with a sharpened reed (called a stylus), scribes tended to approximate the
pictographs by means of short, wedge-shaped impressions made with the edge of the sty-
lus. These short, mainly straight shapes gave rise to the modern word “cuneiform” from the
Latin cuneus, meaning “wedge.” Cuneiform proper was common from about 2700 bce,
although experts use the term “cuneiform” to include the earlier forms as well.
Geographic Range. The Sumerians did not live in complete isolation, and there is very
early evidence of another significant linguistic group in the area immediately north of
Sumer known as Agade or Akkad. Those peoples spoke a Semitic language whose dialects
are subsumed by scholars under the heading “Akkadian.” In the long run, the Akkadian
speakers became the primary users and promulgators of Cuneiform script. Because of their
trade involvement with their neighbors, Cuneiform spread through Babylonia (the
umbrella term for Sumer and Akkad) to Elam, Assyria, eastern Syria, southern Anatolia,
and even Egypt. Ultimately, many languages came to be written in Cuneiform script, the
most notable being Sumerian, Akkadian (including Babylonian, Assyrian, Eblaite),
Elamite, Hittite, and Hurrian.
Periods of script usage are defined according to geography and primary linguistic repre-
sentation, as shown in Table 11-1.
Sources and Coverage. The base character repertoire for the Cuneiform block was distilled
from the list of Ur III signs compiled by the Cuneiform Digital Library Initiative (UCLA)
in union with the list constructed independently by Miguel Civil. This repertoire is com-
prehensive from the Ur III period onward. Old Akkadian and Archaic Cuneiform are not
Cuneiform and Hieroglyphs 430 11.1 Sumero-Akkadian
covered by the repertoire in this block. Signs specific to the Early Dynastic period are
encoded separately in the Early Dynastic Cuneiform block.
Simple Signs. Most Cuneiform signs are simple units; each sign of this type is represented
by a single character in the standard.
Complex and Compound Signs. Some Cuneiform signs are categorized as either complex
or compound signs. Complex signs are made up of a primary sign with one of more sec-
ondary signs written within it or conjoined to it, such that the whole is generally treated by
scholars as a unit; this includes linear sequences of two or more signs or wedge-clusters
where one or more of those clusters have not been clearly identified as characters in their
own right. Complex signs, which present a relative visual unity, are assigned single individ-
ual code points irrespective of their components.
Compound signs are linear sequences of two or more signs or wedge-clusters generally
treated by scholars as a single unit, when each and every such wedge-cluster exists as a
clearly identified character in its own right. Compound signs are encoded as sequences of
their component characters. Signs that shift from compound to complex, or vice versa,
generally have been treated according to their Ur III manifestation.
Mergers and Splits. Over the long history of Cuneiform, a number of signs have simplified
and merged; in other cases, a single sign has diverged and developed into more than one
distinct sign. The choice of signs for encoding as characters was made at the point of max-
imum differentiation in the case of either mergers or splits to enable the most comprehen-
sive set for the representation of text in any period.
Fonts. Fonts for the representation of Cuneiform text may need to be designed distinctly
for optimal use for different historic periods. For example, in the late third millennium
bce, the head of the glyph of the lower right-hand stroke in a ring of four strokes changed
its orientation. In earlier times it sloped down to the left, as shown in the glyph for
U+1212D, but was later replaced by a stroke in which the head sloped up to the right, as
shown in the glyph for U+12423. The glyphs in the code charts do not use a consistent style
for these kinds of historic features.
U+1212D U+12423
Fonts for some periods will contain duplicate glyphs depending on the status of merged or
split signs at that point of the development of the writing system.
Glyph Variants Acquiring Independent Semantic Status. Glyph variants such as U+122EC
M cuneiform sign ta asterisk, a Middle Assyrian form of the sign U+122EB N cune-
iform sign ta, which in Neo-Assyrian usage has its own logographic interpretation, have
been assigned separate code positions. They are to be used only when the new interpreta-
tion applies.
Cuneiform and Hieroglyphs 431 11.1 Sumero-Akkadian
Formatting. Cuneiform was often written between incised lines or in blocks surrounded
by drawn boxes known as case rules. These boxes and lines are considered formatting and
are not part of the script. Case ruling and the like are not to be treated as punctuation.
Ordering. The characters are encoded in the Unicode Standard in Latin alphabetical order
by primary sign name. Complex signs based on the primary sign are organized according
to graphic principles; in some cases, these correspond to the native analyses.
Other Standards. There is no standard legacy encoding of Cuneiform primarily because it
was not possible to encode the huge number of characters in the pre-Unicode world of 8-
bit fonts.
11.2 Ugaritic
Ugaritic: U+10380–U+1039F
The city state of Ugarit was an important seaport on the Phoenician coast (directly east of
Cyprus, north of the modern town of Minet el-Beida) from about 1400 bce until it was
completely destroyed in the twelfth century bce. The site of Ugarit, now called Ras Shamra
(south of Latakia on the Syrian coast), was apparently continuously occupied from Neo-
lithic times (circa 5000 bce). It was first uncovered by a local inhabitant while plowing a
field in 1928 and subsequently excavated by Claude Schaeffer and Georges Chenet begin-
ning in 1929, in which year the first of many tablets written in the Ugaritic script were dis-
covered. They later proved to contain extensive portions of an important Canaanite
mythological and religious literature that had long been sought and that revolutionized
Biblical studies. The script was first deciphered in a remarkably short time jointly by Hans
Bauer, Edouard Dhorme, and Charles Virolleaud.
The Ugaritic language is Semitic, variously regarded by scholars as being a distinct lan-
guage related to Akkadian and Canaanite, or a Canaanite dialect. Ugaritic is generally writ-
ten from left to right horizontally, sometimes using U+1039F s ugaritic word divider.
In the city of Ugarit, this script was also used to write the Hurrian language. The letters
U+1039B p ugaritic letter i, U+1039C q ugaritic letter u, and U+1039D r
ugaritic letter ssu are used for Hurrian.
Variant Glyphs. There is substantial variation in glyph representation for Ugaritic. Glyphs
for U+10398 s ugaritic letter thanna, U+10399 t ugaritic letter ghain, and
U+1038F r ugaritic letter dhal differ somewhat between modern reference sources,
as do some transliterations. U+10398 s ugaritic letter thanna is most often displayed
with a glyph that looks like an occurrence of U+10393 v ugaritic letter ain overlaid
with U+10382 u ugaritic letter gamla.
Ordering. The ancient Ugaritic alphabetical order, which differs somewhat from the mod-
ern Hebrew order for similar characters, has been used to encode Ugaritic in the Unicode
Standard.
Character Names. Some of the Ugaritic character names have been reconstructed; others
appear in an early fragmentary document.
Cuneiform and Hieroglyphs 433 11.3 Old Persian
order of signs in the code charts reflects this classification. The Gardiner categories are
shown in headers in the names list accompanying the code charts.
Some individual characters may have been identified as belonging to other classes since
their original category was assigned, but the ordering in the Unicode Standard simply fol-
lows the original category and catalog values.
Enclosures. The two principal names of the king, the nomen and prenomen, were nor-
mally written inside a cartouche: a pictographic representation of a coil of rope, as shown
in Figure 11-1.
In the Unicode representation of hieroglyphic text, the beginning and end of the cartouche
are represented by separate paired characters, somewhat like parentheses. Rendering of a
full cartouche surrounding a name is handled by the font.
There are a several pairs of characters for the different types of enclosures used in Egyptian
Hieroglyphic texts.
Numerals. Egyptian numbers are encoded following the same principles used for the
encoding of Aegean and Cuneiform numbers. Gardiner does not supply a full set of
numerals with catalog numbers in his Egyptian Grammar, but does describe the system of
numerals in detail, so that it is possible to deduce the required set of numeric characters.
Two conventions of representing Egyptian numerals are supported in the Unicode Stan-
dard. The first relates to the way in which hieratic numerals are represented. Individual
signs for each of the 1s, the 10s, the 100s, the 1000s, and the 10,000s are encoded, because
in hieratic these are written as units, often quite distinct from the hieroglyphic shapes into
which they are transliterated. The other convention is based on the practice of the Manual
de Codage, and is comprised of five basic text elements used to build up Egyptian numerals.
There is some overlap between these two systems.
used ASCII characters to indicate the spatial organization of hieroglyphs. Four of the Egyp-
tian Hieroglyph format controls derive from MdC usage:
• U+13430 egyptian hieroglyph vertical joiner indicates a vertical join, and
corresponds to MdC use of a colon.
• U+13431 egyptian hieroglyph horizontal joiner indicates a horizontal
join, and corresponds to MdC use of an asterisk.
• U+13437 egyptian hieroglyph begin segment and U+13438 egyptian
hieroglyph end segment indicate grouping, and correspond to MdC use of
opening and closing parentheses, respectively.
A quadrat layout of one hieroglyph above another is represented by inserting U+13430
egyptian hieroglyph vertical joiner between two hieroglyphs, where the first logical
glyph in the sequence is the upper of the two hieroglyphs shown in the first example of
Figure 11-2. Similarly, U+13431 egyptian hieroglyph horizontal joiner joins two
adjacent hieroglyphs horizontally. The horizontal ordering of the joined glyphs matches
the logical ordering of the two hieroglyphs, as shown in the second example in Figure 11-2.
The column labeled “Symbolic” in Figure 11-2 (and subsequent figures) emulates the way
such quadrats are represented using the MdC conventions. Thus “A1” is the symbolic
abbreviation used in MdC for U+13000 egyptian hieroglyph a001 (a seated man). MdC
simply uses a few ASCII characters (“:”, “*”, “+”) for the operators that combine signs into
sequences expressing the full quadrats. So, the MdC representation of the first example in
Figure 11-2 would be “A1:O1”. The symbolic representation in Figure 11-2 instead uses the
dotted box glyph convention to represent the actual Unicode Egyptian Hieroglyph format
controls, as for example, U+13430 L egyptian hieroglyph vertical joiner.
Four control characters are used in similar fashion to insert a following hieroglyph into the
corner of a preceding hieroglyph. The control U+13432 egyptian hieroglyph insert at
top start places a following hieroglyph within the frame of the preceding hieroglyph in
the corner at the top edge and starting side, as shown in the first example of Figure 11-3.
Similarly, U+13433 egyptian hieroglyph insert at bottom start causes a following
hieroglyph to display in the bottom-starting corner within the frame of the preceding
hieroglyph. U+13434 egyptian hieroglyph insert at top end causes a following hiero-
glyph to display in the top-ending corner within the frame of the preceding hieroglyph.
U+13435 egyptian hieroglyph insert at bottom end causes a following hieroglyph to
Cuneiform and Hieroglyphs 438 11.4 Egyptian Hieroglyphs
display in the bottom-ending corner within the frame of the preceding hieroglyph.
Figure 11-3 shows examples of this use.
Hieroglyphs may also overlay other hieroglyphs. This arrangement is controlled by
U+13436 egyptian hieroglyph overlay middle. This control character causes a follow-
ing hieroglyph to overlay on top of a preceding hieroglyph, as shown in the last example in
Figure 11-3.
Complex Clusters. The basic joining controls may be used in conjunction with one
another to render more complex clusters, as shown in the first example in Figure 11-4.
The two characters, U+13437 egyptian hieroglyph begin segment and U+13438 egyp-
tian hieroglyph end segment, are used to group signs in complex clusters comprising
different levels of joining controls, as shown in the second example in Figure 11-4.
Some rendering systems may support multiple levels of the segment controls for use in the
most complex hieroglyphic sign arrangements, as shown in the third example in
Figure 11-4.
Some Egyptian hieroglyphs with complex structures have previously been encoded as sin-
gle characters. When glyphs for these single characters are available in the font, the pre-
Cuneiform and Hieroglyphs 439 11.4 Egyptian Hieroglyphs
11.5 Meroitic
Meroitic Hieroglyphs: U+10980–U+1099F
Meroitic Cursive: U+109A0–U+109FF
Meroitic hieroglyphs and Meroitic cursive were used from around the second century bce
to the fourth century ce to write the Meroitic language of the Nile valley kingdom known
as Kush or Meroë. The kingdom originated south of Egypt around 850 bce, with its capital
at Napata, located in modern-day northern Sudan. At that time official inscriptions used
the Egyptian language and script. Around 560 bce the capital was relocated to Meroë,
about 600 kilometers upriver. As the use of Egyptian language and script declined with the
greater distance from Egypt, two native scripts developed for writing Meroitic:
• Meroitic cursive was for general use, and its appearance was based on Egyptian
demotic.
• Meroitic hieroglyphs were used for inscriptions on royal monuments and tem-
ples, and their appearance was based on Egyptian hieroglyphs. (See
Section 11.4, Egyptian Hieroglyphs for more information.)
After the fourth century ce, the Meroitic language was gradually replaced by Nubian, and
by the sixth century the Meroitic scripts had been superseded by the Coptic script, which
picked up three additional symbols from Meroitic cursive to represent Nubian.
Although the values of the script characters were deciphered around 1911 by the English
Egyptologist F. L. Griffith, the Meroitic language is still not understood except for names
and a few other words. It is not known to be related to any other language. It may be related
to Nubian.
Structure. Unlike the Egyptian scripts, the Meroitic scripts are almost purely alphabetic.
There are 15 basic consonants; if not followed by an explicit vowel letter, they are read with
an inherent a. There are four vowels: e, i, o, and a. The a vowel is only used for initial a. In
addition, for unknown reasons, there are explicit letters for the syllables ne, te, se, and to.
This may have been due to dialect differences, or to the possible use of n, t, and s as final
consonants in some cases.
Meroitic cursive also uses two logograms for rmt and imn, derived from Egyptian demotic.
Directionality. Horizontal writing is almost exclusively right-to-left, matching the direc-
tion in which the hieroglyphs depicting people and animals are looking. This is unlike
Egyptian hieroglyphs, which are read into the faces of the glyphs for people and animals.
Meroitic hieroglyphs are also written vertically in columns.
Shaping. In Meroitic cursive, the letter for i usually connects to a preceding consonant.
There is no other connecting behavior.
Punctuation. The Meroitic scripts were among the earliest to use word division—not
always consistently—to separate basic sentence elements, such as noun phrases, verb
forms, and so on. For this purpose Meroitic hieroglyphs use three vertical dots, represented
Cuneiform and Hieroglyphs 441 11.5 Meroitic
by U+205D tricolon. When Meroitic hieroglyphs are presented in vertical columns, the
orientation of the three dots shifts to become three horizontal dots. This can be repre-
sented either with U+2026 horizontal ellipsis, or in more sophisticated rendering, by
glyphic rotation of U+205D tricolon. Meroitic cursive uses two vertical dots, represented
by U+003A colon.
Symbols. Two ankh-like symbols are used with Meroitic hieroglyphs.
Meroitic Cursive Numbers. Meroitic numbers are found only in Meroitic Cursive. The
system consists of numbers one through nine and bases for ranks: tens, hundreds, thou-
sands, ten thousands, and hundred thousands. The numbers for 100 and higher are sys-
tematically formed by attaching the numbers for one through nine as a multiplier to the
respective base for each rank. There is also a notation for a fractional system based on
twelfths, which simply uses one to eleven dots to represent each fraction.
Cuneiform and Hieroglyphs 442 11.6 Anatolian Hieroglyphs
Annotations. Latin names are used traditionally to describe characters used logographi-
cally and appear as annotations in the names list. Those characters which have a Luwian
phonetic value or are logosyllabic are identified in the annotations. When a plus sign
appears between two elements in the annotation, the elements are considered a single
graphic unit, whereas a period between the two elements indicates the two elements are
considered graphically separate.
U+1447E C anatolian hieroglyph a107a
= bos+mi
U+14480 D anatolian hieroglyph a107b
= bos.mi
Punctuation. In some texts, word division is indicated by U+145B5 anatolian hiero-
glyph a386 or its variant U+145B6 anatolian hieroglyph a386a. U+145CE anatolian
hieroglyph a410 begin logogram mark and U+145CF anatolian hieroglyph a410a
end logogram mark sometimes occur in text to mark logograms.
The characters U+145F7 anatolian hieroglyph a450 and U+144EF anatolian hiero-
glyph a209 are occasionally used to fill blank spaces, often at the end of a word. Spaces are
used in modern renditions of hieroglyphic text.
Numbers. Some of the hieroglyphic signs have been interpreted as having numeric values.
These include values for 1–5, 8–10, 12, 100, and 1000. However, all of the Anatolian hiero-
glyphs have the General_Category = Other_Letter and no specific numeric values for them
are assigned in the Unicode Character Database.
Rendering. Just as for Egyptian hieroglyphs, only the basic text elements of the script are
encoded. A higher-level protocol is required for the display Anatolian hieroglyphs in a
nonlinear layout.
Cuneiform and Hieroglyphs 444 11.6 Anatolian Hieroglyphs
445
Chapter 12
South and Central Asia-I 12
Official Scripts of India
The scripts of South Asia share so many common features that a side-by-side comparison
of a few will often reveal structural similarities even in the modern letterforms. With minor
historical exceptions, they are written from left to right. They are all abugidas in which
most symbols stand for a consonant plus an inherent vowel (usually the sound /a/). Word-
initial vowels in many of these scripts have distinct symbols, and word-internal vowels are
usually written by juxtaposing a vowel sign in the vicinity of the affected consonant.
Absence of the inherent vowel, when that occurs, is frequently marked with a special sign.
In the Unicode Standard, this sign is denoted by the Sanskrit word virZma. In some lan-
guages, another designation is preferred. In Hindi, for example, the word hal refers to the
character itself, and halant refers to the consonant that has its inherent vowel suppressed;
in Tamil, the word pukki is used. The virama sign nominally serves to suppress the inherent
vowel of the consonant to which it is applied; it is a combining character, with its shape
varying from script to script.
Most of the scripts of South Asia, from north of the Himalayas to Sri Lanka in the south,
from Pakistan in the west to the easternmost islands of Indonesia, are derived from the
ancient Brahmi script. The oldest lengthy inscriptions of India, the edicts of Ashoka from
the third century bce, were written in two scripts, Kharoshthi and Brahmi. These are both
ultimately of Semitic origin, probably deriving from Aramaic, which was an important
administrative language of the Middle East at that time. Kharoshthi, written from right to
left, was supplanted by Brahmi and its derivatives. The descendants of Brahmi spread with
myriad changes throughout the subcontinent and outlying islands. There are said to be
some 200 different scripts deriving from it. By the eleventh century, the modern script
known as Devanagari was in ascendancy in India proper as the major script of Sanskrit lit-
erature.
The North Indian branch of scripts was, like Brahmi itself, chiefly used to write Indo-Euro-
pean languages such as Pali and Sanskrit, and eventually the Hindi, Bengali, and Gujarati
languages, though it was also the source for scripts for non-Indo-European languages such
as Tibetan, Mongolian, and Lepcha.
South and Central Asia-I 446
The South Indian scripts are also derived from Brahmi and, therefore, share many struc-
tural characteristics. These scripts were first used to write Pali and Sanskrit but were later
adapted for use in writing non-Indo-European languages—namely, the languages of the
Dravidian family of southern India and Sri Lanka. Because of their use for Dravidian lan-
guages, the South Indian scripts developed many characteristics that distinguish them
from the North Indian scripts. South Indian scripts were also exported to southeast Asia
and were the source of scripts such as Tai Tham (Lanna) and Myanmar, as well as the insu-
lar scripts of the Philippines and Indonesia.
The shapes of letters in the South Indian scripts took on a quite distinct look from the shapes
of letters in the North Indian scripts. Some scholars suggest that this occurred because writ-
ing materials such as palm leaves encouraged changes in the way letters were written.
The major official scripts of India proper, including Devanagari, are documented in this
chapter. They are all encoded according to a common plan, so that comparable characters
are in the same order and relative location. This structural arrangement, which facilitates
transliteration to some degree, is based on the Indian national standard (ISCII) encoding
for these scripts.
The first six columns in each script are isomorphic with the ISCII-1988 encoding, except
that the last 11 positions (U+0955.. U+095F in Devanagari, for example), which are unas-
signed or undefined in ISCII-1988, are used in the Unicode encoding. The seventh column
in each of these scripts, along with the last 11 positions in the sixth column, represent addi-
tional character assignments in the Unicode Standard that are matched across some or all
of the scripts. For example, positions U+xx66..U+xx6F and U+xxE6 ..U+xxEF code the
Indic script digits for each script. The eighth column for each script is reserved for script-
specific additions that do not correspond from one Indic script to the next.
While the arrangement of the encoding for the scripts of India is based on ISCII, this does
not imply that the rendering behavior of South Indian scripts in particular is the same as
that of Devanagari or other North Indian scripts. Implementations should ensure that ade-
quate attention is given to the actual behavior of those scripts; they should not assume that
they work just as Devanagari does. Each block description in this chapter describes the
most important aspects of rendering for a particular script as well as unique behaviors it
may have.
Many of the character names in this group of scripts represent the same sounds, and com-
mon naming conventions are used for the scripts of India.
South and Central Asia-I 447 12.1 Devanagari
12.1 Devanagari
Devanagari: U+0900–U+097F
The Devanagari script is used for writing classical Sanskrit and its modern historical deriv-
ative, Hindi. Extensions to the Sanskrit repertoire are used to write other related languages
of India (such as Marathi) and of Nepal (Nepali). In addition, the Devanagari script is used
to write the following languages: Awadhi, Bagheli, Bhatneri, Bhili, Bihari, Braj Bhasha,
Chhattisgarhi, Garhwali, Gondi (Betul, Chhindwara, and Mandla dialects), Harauti, Ho,
Jaipuri, Kachchhi, Kanauji, Konkani, Kului, Kumaoni, Kurku, Kurukh, Marwari, Mundari,
Newari, Palpa, and Santali.
All other Indic scripts, as well as the Sinhala script of Sri Lanka, the Tibetan script, and the
Southeast Asian scripts, are historically connected with the Devanagari script as descen-
dants of the ancient Brahmi script. The entire family of scripts shares a large number of
structural features.
The principles of the Indic scripts are covered in some detail in this introduction to the
Devanagari script. The remaining introductions to the Indic scripts are abbreviated but
highlight any differences from Devanagari where appropriate.
Standards. The Devanagari block of the Unicode Standard is based on ISCII-1988 (Indian
Script Code for Information Interchange). The ISCII standard of 1988 differs from and is
an update of earlier ISCII standards issued in 1983 and 1986.
The Unicode Standard encodes Devanagari characters in the same relative positions as
those coded in positions A0–F416 in the ISCII-1988 standard. The same character code lay-
out is followed for eight other Indic scripts in the Unicode Standard: Bengali, Gurmukhi,
Gujarati, Oriya, Tamil, Telugu, Kannada, and Malayalam. This parallel code layout
emphasizes the structural similarities of the Brahmi scripts and follows the stated intention
of the Indian coding standards to enable one-to-one mappings between analogous coding
positions in different scripts in the family. Sinhala, Tibetan, Thai, Lao, Khmer, Myanmar,
and other scripts depart to a greater extent from the Devanagari structural pattern, so the
Unicode Standard does not attempt to provide any direct mappings for these scripts to the
Devanagari order.
In November 1991, at the time The Unicode Standard, Version 1.0, was published, the
Bureau of Indian Standards published a new version of ISCII in Indian Standard (IS)
13194:1991. This new version partially modified the layout and repertoire of the ISCII-
1988 standard. Because of these events, the Unicode Standard does not precisely follow the
layout of the current version of ISCII. Nevertheless, the Unicode Standard remains a sup-
erset of the ISCII-1991 repertoire. Modern, non-Vedic texts encoded with ISCII-1991 may
be automatically converted to Unicode code points and back to their original encoding
without loss of information. The Vedic extension characters defined in IS 13194:1991
Annex G—Extended Character Set for Vedic are now fully covered by the Unicode Standard,
but the conversions between ISCII and Unicode code points in some cases are more com-
plex than for modern texts.
South and Central Asia-I 448 12.1 Devanagari
Encoding Principles. The writing systems that employ Devanagari and other Indic scripts
constitute abugidas—a cross between syllabic writing systems and alphabetic writing sys-
tems. The effective unit of these writing systems is the orthographic syllable, consisting of a
consonant and vowel (CV) core and, optionally, one or more preceding consonants, with a
canonical structure of (((C)C)C)V. The orthographic syllable need not correspond exactly
with a phonological syllable, especially when a consonant cluster is involved, but the writ-
ing system is built on phonological principles and tends to correspond quite closely to pro-
nunciation.
The orthographic syllable is built up of alphabetic pieces, the actual letters of the Devana-
gari script. These pieces consist of three distinct character types: consonant letters, inde-
pendent vowels, and dependent vowel signs. In a text sequence, these characters are stored
in logical (phonetic) order. Consonant letters by themselves constitute a CV unit, where the
V is an inherent vowel, whose exact phonetic value may vary by writing system. Indepen-
dent vowels also constitute a CV unit, where the C is considered to be null.
A dependent vowel sign is used to represent a V in CV units where C is not null and V is not
the inherent vowel. CV units are not represented by sequences of a consonant followed by
virama followed by independent vowel. In some cases, a phonological diphthong (such as
Hindi 012 /jQo/) is actually written as two orthographic CV units, where the second of
these units is an independent vowel letter, whose C is considered to be null.
Some Devanagari consonant letters have alternative presentation forms whose choice
depends on neighboring consonants. This variability is especially notable for U+0930
devanagari letter ra, which has numerous different forms, both as the initial element
and as the final element of a consonant cluster. Only the nominal forms, rather than the
contextual alternatives, are depicted in the code charts.
The traditional Sanskrit/Devanagari alphabetic encoding order for consonants follows
articulatory phonetic principles, starting with velar consonants and moving forward to
bilabial consonants, followed by liquids and then fricatives. ISCII and the Unicode Stan-
dard both observe this traditional order.
Independent Vowel Letters. The independent vowels in Devanagari are letters that stand
on their own. The writing system treats independent vowels as orthographic CV syllables in
which the consonant is null. The independent vowel letters are used to write syllables that
start with a vowel.
Dependent Vowel Signs (Matras). The dependent vowels serve as the common manner of
writing noninherent vowels and are generally referred to as vowel signs, or as matras in
Sanskrit. The dependent vowels do not stand alone; rather, they are visibly depicted in
combination with a base letterform. A single consonant or a consonant cluster may have a
dependent vowel applied to it to indicate the vowel quality of the syllable, when it is differ-
ent from the inherent vowel. Explicit appearance of a dependent vowel in a syllable over-
rides the inherent vowel of a single consonant letter.
The greatest variation among different Indic scripts is found in the way that the dependent
vowels are applied to base letterforms. Devanagari has a collection of nonspacing depen-
dent vowel signs that may appear above or below a consonant letter, as well as spacing
dependent vowel signs that may occur to the right or to the left of a consonant letter or
consonant cluster. Other Indic scripts generally have one or more of these forms, but what
is a nonspacing mark in one script may be a spacing mark in another. Also, some of the
Indic scripts have single dependent vowels that are indicated by two or more glyph compo-
nents—and those glyph components may surround a consonant letter both to the left and
to the right or may occur both above and below it.
In modern usage the Devanagari script has only one character denoting a left-side depen-
dent vowel sign: U+093F devanagari vowel sign i. In the historic Prishthamatra orthog-
raphy, Devanagari also made use of one additional left-side dependent vowel sign: U+094E
devanagari vowel sign prishthamatra e. Other Indic scripts either have no such vowel
signs (Telugu and Kannada) or include as many as three of these signs (Bengali, Tamil, and
Malayalam).
Vowel Letters. Vowel letters are encoded atomically in Unicode, even if they can be ana-
lyzed visually as consisting of multiple parts. Table 12-1 shows vowel letters that can be
analyzed, the single code point that should be used to represent them in text, and the
sequence of code points resulting from analysis that should not be used.
South and Central Asia-I 450 12.1 Devanagari
Virama (Halant). Devanagari employs a sign known in Sanskrit as the virama or vowel
omission sign. In Hindi, it is called hal or halant, and that term is used in referring to the
virama or to a consonant with its vowel suppressed by the virama. The terms are used
interchangeably in this section.
The virama sign, U+094D devanagari sign virama, nominally serves to cancel (or kill)
the inherent vowel of the consonant to which it is applied. When a consonant has lost its
inherent vowel by the application of virama, it is known as a dead consonant; in contrast, a
live consonant is one that retains its inherent vowel or is written with an explicit dependent
vowel sign. In the Unicode Standard, a dead consonant is defined as a sequence consisting
of a consonant letter followed by a virama. The default rendering for a dead consonant is to
position the virama as a combining mark bound to the consonant letterform.
For example, if Cn denotes the nominal form of consonant C, and Cd denotes the dead con-
sonant form, then a dead consonant is encoded as shown in Figure 12-1.
South and Central Asia-I 451 12.1 Devanagari
à + † → Æ
It could be assumed that a dead consonant may be combined with a vowel letter or sign to
represent a CV orthographic syllable. Some non-Unicode implementations have used this
approach; however, this is not done in implementations of the Unicode Standard. Instead,
a CV orthographic syllable is represented with a (live) consonant followed by a dependent
vowel. A dead consonant should not be followed either by an independent vowel letter or
by a dependent vowel sign in an attempt to create an alternative representation of a CV
orthographic syllable.
Atomic Representation of Consonant Letters. Consonant letters are encoded atomically
in Unicode, even if they can be analyzed visually as consisting of multiple parts. In particu-
lar, consonant half forms are dead-consonant forms that often resemble a full consonant
form minus a vertical stem. This vertical stem is visually similar to the vowel sign denoting
/ā/, U+093E devanagari vowel sign aa. Table 12-2 shows atomic consonant letters in
Devanagari that could be graphically analyzed this way, the single code point that should
be used to represent them in text, and the sequence of code points resulting from analysis
that should not be used.
ऩ 0929
<0929, 094D, 093E>, <0929, 094D, 200D, 093E>,
<0928, 093C, 094D, 093E>, <0928, 093C, 094D, 200D, 093E>
प 092A <092A, 094D, 093E>, <092A, 094D, 200D, 093E>
ख़ 0959
<0959, 094D, 093E>, <0959, 094D, 200D, 093E>,
<0916, 093C, 094D, 093E>, <0916, 093C, 094D, 200D, 093E>
ग़ 095A
<095A, 094D, 093E>, <095A, 094D, 200D, 093E>,
<0917, 093C, 094D, 093E>, <0917, 093C, 094D, 200D, 093E>
ज़ 095B
<095B, 094D, 093E>, <095B, 094D, 200D, 093E>,
<091C, 093C, 094D, 093E>, <091C, 093C, 094D, 200D, 093E>
य़ 095F
<095F, 094D, 093E>, <095F, 094D, 200D, 093E>,
<092F, 093C, 094D, 093E>, <092F, 093C, 094D, 200D, 093E>
ॹ 0979 <0979, 094D, 093E>, <0979, 094D, 200D, 093E>
The principle of using atomic consonant representations, rather than representations ana-
lyzing the consonant into a half form plus stem, also applies to other Indic scripts, such as
Gujarati and Bengali.
South and Central Asia-I 453 12.1 Devanagari
Consonant Conjuncts. The Indic scripts are noted for a large number of consonant con-
junct forms that serve as orthographic abbreviations (ligatures) of two or more adjacent
letterforms. This abbreviation takes place only in the context of a consonant cluster. An
orthographic consonant cluster is defined as a sequence of characters that represents one or
more dead consonants (denoted Cd) followed by a normal, live consonant letter (denoted
Cl).
Under normal circumstances, a consonant cluster is depicted with a conjunct glyph if such
a glyph is available in the current font. In the absence of a conjunct glyph, the one or more
dead consonants that form part of the cluster are depicted using half-form glyphs. In the
absence of half-form glyphs, the dead consonants are depicted using the nominal conso-
nant forms combined with visible virama signs (see Figure 12-2).
ª† + œ → Çœ ∑† + · → S
₍2₎ KAd + KAl → K.KAn ₍4₎ RAd + KAl → KAl + RAsup
∑† + ∑→ P ⁄† +∑ → ∑F
प् + स् + य
A well-designed Indic script font may contain hundreds of conjunct glyphs, but they are
not encoded as Unicode characters because they are the result of ligation of distinct letters.
South and Central Asia-I 454 12.1 Devanagari
Indic script rendering software must be able to map appropriate combinations of charac-
ters in context to the appropriate conjunct glyphs in fonts.
A dead consonant conjunct may have an appearance like a half form, because the vertical
stem of the last consonant is removed. As a result, a live consonant conjunct could be ana-
lyzed visually as consisting of the dead, consonant-conjunct half form plus the vowel sign
/ā/. As in the case of consonant letters, the live form should not be represented using a half
form followed by U+093E devanagari vowel sign aa. Table 12-3 shows some examples
of live consonant conjuncts that exhibit this visual pattern, but that should not be repre-
sented with fully analyzed sequences. Table 12-3 also shows the sequence of code points
that should be used to represent these conjuncts in text, and the sequence of code points
resulting from analysis that should not be used.
Note that these are illustrative examples only. There are many consonant conjuncts that
could be visually analyzed in the same way, and the same principle applies to all such cases:
these should not be represented as dead conjunct plus vowel sign sequences. The principle
of using atomic consonant representations, rather than representations analyzing the con-
sonant into a half form plus stem, also applies to other Indic scripts, such as Gujarati and
Bengali.
Explicit Virama (Halant). Normally a virama character serves to create dead consonants
that are, in turn, combined with subsequent consonants to form conjuncts. This behavior
usually results in a virama sign not being depicted visually. Occasionally, this default
behavior is not desired when a dead consonant should be excluded from conjunct forma-
tion, in which case the virama sign is visibly rendered. To accomplish this goal, the Uni-
code Standard adopts the convention of placing the character U+200C zero width non-
joiner immediately after the encoded dead consonant that is to be excluded from conjunct
formation. In this case, the virama sign is always depicted as appropriate for the consonant
to which it is attached.
For example, in Figure 12-4, the use of zero width non-joiner prevents the default for-
mation of the conjunct form S (K.SSAn).
Explicit Half-Consonants. When a dead consonant participates in forming a conjunct, the
dead consonant form is often absorbed into the conjunct form, such that it is no longer dis-
South and Central Asia-I 455 12.1 Devanagari
∑† + à + · → ∑†·
tinctly visible. In other contexts, the dead consonant may remain visible as a half-consonant
form. In general, a half-consonant form is distinguished from the nominal consonant form
by the loss of its inherent vowel stem, a vertical stem appearing to the right side of the con-
sonant form. In other cases, the vertical stem remains but some part of its right-side geom-
etry is missing.
In certain cases, it is desirable to prevent a dead consonant from assuming full conjunct
formation yet still not appear with an explicit virama. In these cases, the half-form of the
consonant is used. To explicitly encode a half-consonant form, the Unicode Standard
adopts the convention of placing the character U+200D zero width joiner immediately
after the encoded dead consonant. The zero width joiner denotes a nonvisible letter that
presents linking or cursive joining behavior on either side (that is, to the previous or fol-
lowing letter). Therefore, in the present context, the zero width joiner may be consid-
ered to present a context to which a preceding dead consonant may join so as to create the
half-form of the consonant.
For example, if Ch denotes the half-form glyph of consonant C, then a half-consonant form
is represented as shown in Figure 12-5.
∑† + Ä + · → Ä·
In the absence of the zero width joiner, the sequence in Figure 12-5 would normally pro-
duce the full conjunct form S (K.SSAn).
This encoding of half-consonant forms also applies in the absence of a base letterform.
That is, this technique may be used to encode independent half-forms, as shown in
Figure 12-6.
ª† + Ä → Ç
South and Central Asia-I 456 12.1 Devanagari
Other Indic scripts have similar half-forms for the initial consonants of a conjunct. Some,
such as Oriya, also have similar half-forms for the final consonants; those are represented
as shown in Figure 12-7.
As the rendering of conjuncts and half-forms depends on the availability of glyphs in the
font, the following fallback strategy should be employed:
• If the coded character sequence would normally render with a full conjunct,
but such a conjunct is not available, the fallback rendering is to use half-forms.
If those are not available, the fallback rendering should use an explicit (visible)
virama.
• If the coded character sequence would normally render with a half-form (it
contains a ZWJ), but half-forms are not available, the fallback rendering should
use an explicit (visible) virama.
South and Central Asia-I 457 12.1 Devanagari
Rendering Devanagari
Rules for Rendering. This section provides more formal and detailed rules for minimal
rendering of Devanagari as part of a plain text sequence. It describes the mapping between
Unicode characters and the glyphs in a Devanagari font. It also describes the combining
and ordering of those glyphs.
These rules provide minimal requirements for legibly rendering interchanged Devanagari
text. As with any script, a more complex procedure can add rendering characteristics,
depending on the font and application.
In a font that is capable of rendering Devanagari, the number of glyphs is
greater than the number of Devanagari characters.
Notation. In the next set of rules, the following notation applies:
Cn Nominal glyph form of consonant C as it appears in the code
charts.
Cl A live consonant, depicted identically to Cn.
Cd Glyph depicting the dead consonant form of consonant C.
Ch Glyph depicting the half-consonant form of consonant C.
Ln Nominal glyph form of a conjunct ligature consisting of two or
more component consonants. A conjunct ligature composed of
two consonants X and Y is also denoted X.Yn.
RAsup A nonspacing combining mark glyph form of U+0930 devana-
gari letter ra positioned above or attached to the upper part
of a base glyph form. This form is also known as repha.
RAsub A nonspacing combining mark glyph form of U+0930 devana-
gari letter ra positioned below or attached to the lower part
of a base glyph form.
Vvs Glyph depicting the dependent vowel sign form of a vowel V.
VIRAMAn The nominal glyph form of the nonspacing combining mark
depicting U+094D devanagari sign virama.
A virama character is not always depicted. When it is depicted, it adopts this nonspacing
mark form.
Dead Consonant Rule. The following rule logically precedes the application of any other
rule to form a dead consonant. Once formed, a dead consonant may be subject to other
rules described next.
South and Central Asia-I 458 12.1 Devanagari
à + † → Æ
Consonant RA Rules. The character U+0930 devanagari letter ra takes one of a num-
ber of visual forms depending on its context in a consonant cluster. By default, this letter is
depicted with its nominal glyph form (as shown in the code charts). In some contexts, it is
depicted using one of two nonspacing glyph forms that combine with a base letterform.
R2 If the dead consonant RAd precedes a consonant, then it is replaced by the super-
script nonspacing mark RAsup , which is positioned so that it applies to the logically
subsequent element in the memory representation.
⁄† + ∑ → ∑+ F → ∑F
RAd + RAd → RAd + RAsup
1 2 2 1
⁄† + ⁄† → ⁄† + F → ⁄†Z
R3 If the superscript mark RAsup is to be applied to a dead consonant and that dead
consonant is combined with another consonant to form a conjunct ligature, then
the mark is positioned so that it applies to the conjunct ligature form as a whole.
⁄† + ¡† + ƒ → Æ + F → ÆF
R4 If the superscript mark RAsup is to be applied to a dead consonant that is subse-
quently replaced by its half-consonant form, then the mark is positioned so that it
applies to the form that serves as the base of the consonant cluster.
⁄† + ª† + Ω → Ç + Ω + F → ÇΩ F
South and Central Asia-I 459 12.1 Devanagari
R5 In conformance with the ISCII standard, the half-consonant form RRAh is repre-
sented as eyelash-RA. This form of RA is commonly used in writing Marathi and
Newari.
RRAn + VIRAMAn → RRAh
⁄. + † → :
R5a For compatibility with The Unicode Standard, Version 2.0, if the dead consonant
RAd precedes zero width joiner, then the half-consonant form RAh , depicted as
eyelash-RA, is used instead of RAsup .
⁄† +Ä → :
R6 Except for the dead consonant RAd , when a dead consonant Cd precedes the live
consonant RAl , then Cd is replaced with its nominal form Cn , and RA is replaced by
the subscript nonspacing mark RAsub , which is positioned so that it applies to Cn.
∆† + ⁄ → ∆ + ˛ → ∆˛
R7 For certain consonants, the mark RAsub may graphically combine with the conso-
nant to form a conjunct ligature form. These combinations, such as the one shown
here, are further addressed by the ligature rules described shortly.
”† + ⁄ → ” + ˛ → p
R8 If a dead consonant (other than RAd ) precedes RAd , then the substitution of RA for
RAsub is performed as described above; however, the VIRAMA that formed RAd
remains so as to form a dead consonant conjunct form.
Æ + ⁄† → à + ˛ + † → d†
South and Central Asia-I 460 12.1 Devanagari
A dead consonant conjunct form that contains an absorbed RAd may subsequently
combine to form a multipart conjunct form.
d† + ÿ → òÿ
Modifier Mark Rules. In addition to vowel signs, three other types of combining marks
may be applied to a component of an orthographic syllable or to the syllable as a whole:
nukta, bindus, and svaras.
R9 The nukta sign, which modifies a consonant form, is placed immediately after the
consonant in the memory representation and is attached to that consonant in ren-
dering. If the consonant represents a dead consonant, then NUKTA should precede
VIRAMA in the memory representation.
∑ + . + † → ∏∑†
R10 Other modifying marks, in particular bindus and svaras, apply to the
orthographic syllable as a whole and should follow (in the memory representa-
tion) all other characters that constitute the syllable. The bindus should follow any
vowel signs, and the svaras should come last. The relative placement of these
marks is horizontal rather than vertical; the horizontal rendering order may vary
according to typographic concerns.
KAn + AAvs + CANDRABINDUn
∑ + Ê + ° → ∑Ê °
Ligature Rules. Subsequent to the application of the rules just described, a set of rules gov-
erning ligature formation apply. The precise application of these rules depends on the
availability of glyphs in the current font being used to display the text.
R11 If a dead consonant immediately precedes another dead consonant or a live conso-
nant, then the first dead consonant may join the subsequent element to form a
two-part conjunct ligature form.
¡† + ƒ → Æ ≈† + ∆ → _
South and Central Asia-I 461 12.1 Devanagari
R12 A conjunct ligature form can itself behave as a dead consonant and enter into fur-
ther, more complex ligatures.
‚† + Æ + ⁄ → ‚† + d → ñd
A conjunct ligature form can also produce a half-form.
S† + ÿ → óÿ
R13 If a nominal consonant or conjunct ligature form precedes RAsub as a result of the
application of rule R6, then the consonant or ligature form may join with RAsub to
form a multipart conjunct ligature (see rule R6 for more information).
∑ + ˛ → R ” + ˛ → p
R14 In some cases, other combining marks will combine with a base consonant, either
attaching at a nonstandard location or changing shape. In minimal rendering,
there are only two cases: RAl with Uvs or UUvs .
⁄ + G → L ⁄ + H → M
Memory Representation and Rendering Order. The storage of plain text in Devanagari
and all other Indic scripts generally follows phonetic order; that is, a CV syllable with a
dependent vowel is always encoded as a consonant letter C followed by a vowel sign V in
the memory representation. This order is employed by the ISCII standard and corresponds
to both the phonetic order and the keying order of textual data (see Figure 12-9).
∑ +Á → Á∑
South and Central Asia-I 462 12.1 Devanagari
Because Devanagari and other Indic scripts have some dependent vowels that must be
depicted to the left side of their consonant letter, the software that renders the Indic scripts
must be able to reorder elements in mapping from the logical (character) store to the pre-
sentational (glyph) rendering. For example, if Cn denotes the nominal form of consonant
C, and Vvs denotes a left-side dependent vowel sign form of vowel V, then a reordering of
glyphs with respect to encoded characters occurs as just shown.
R15 When the dependent vowel Ivs is used to override the inherent vowel of a syllable, it
is always written to the extreme left of the orthographic syllable. If the
orthographic syllable contains a consonant cluster, then this vowel is always
depicted to the left of that cluster.
Æ + ⁄ +Á → d +Á → Ád
R16 The presence of an explicit virama (either caused by a ZWNJ or by the absence of a
conjunct in the font) blocks this reordering, and the dependent vowel Ivs is ren-
dered after the rightmost such explicit virama.
§ + Ã + ⁄ + Á →F
Alternative Forms of Cluster-Initial RA. In addition to reph (rule R2) and eyelash (rule
R5a), a cluster-initial RA may also take its nominal form while the following consonant
takes a reduced form. This behavior is required by languages that make a morphological
distinction between “reph on YA” and “RA with reduced YA”, such as Braj Bhasha. To trig-
ger this behavior, a ZWJ is placed immediately before the virama to request a reduced form
of the following consonant, while preventing the formation of reph, as shown in the third
example below.
$्
र य
$्
र य
$्
र य र
Similar, special rendering behavior of cluster-initial RA is noted in other scripts of India.
See, for example, “Interaction of Repha and Ya-phalaa” in Section 12.2, Bengali (Bangla),
“Reph” in Section 12.7, Telugu, and “Consonant Clusters Involving RA” in Section 12.8,
Kannada.
South and Central Asia-I 463 12.1 Devanagari
Sample Half-Forms. Table 12-4 shows examples of half-consonant forms that are com-
monly used with the Devanagari script. These forms are glyphs, not characters. They may
be encoded explicitly using zero width joiner as shown. In normal conjunct formation,
they may be used spontaneously to depict a dead consonant in combination with subse-
quent consonant forms.
∑+ 0 + Ä → Ä –+ 0 + Ä → ã
π+ 0 + Ä → Å “+ 0 + Ä → å
ª+ 0 + Ä → Ç ”+ 0 + Ä → ç
Ω+ 0 + Ä → É ’+ 0 + Ä → é
ø+ 0 + Ä → Ñ ÷+ 0 + Ä → è
¡+ 0 + Ä → Ö ◊+ 0 + Ä → ê
√+ 0 + Ä → ß ÿ+ 0 + Ä → ë
ƒ+ 0 + Ä → Ü ‹+ 0 + Ä → í
À+ 0 + Ä → á fl + 0 + Ä → ì
Ã+ 0 + Ä → à ‡+ 0 + Ä → î
Õ+ 0 + Ä → â ·+ 0 + Ä → ï
œ+ 0 + Ä → ä ‚+ 0 + Ä → ñ
Sample Ligatures. Table 12-5 shows examples of conjunct ligature forms that are com-
monly used with the Devanagari script. These forms are glyphs, not characters. Not every
writing system that employs this script uses all of these forms; in particular, many of these
forms are used only in writing Sanskrit texts. Furthermore, individual fonts may provide
fewer or more ligature forms than are depicted here.
South and Central Asia-I 464 12.1 Devanagari
∑+ 0 + ∑→ P ≈+ 0 + ∆ → _
∑+ 0 + Ã→ Q ∆+ 0 + ∆ → n
∑+ 0 + ⁄ → R «+ 0 + ª → `
∑+ 0 + ·→ S «+ 0 + « → a
æ+ 0 + ∑→ V «+ 0 + … → b
æ+ 0 + π→ W Ã+ 0 + Ã → c
æ+ 0 + ª→ X Ã+ 0 + ⁄ → d
æ+ 0 + Ω→ Y –+ 0 + – → Ÿ
ƒ+ 0 + ¡→ ¨ ”+ 0 + ⁄ → p
¡+ 0 + ƒ→ Æ ‡+ 0 + ⁄ → o
Œ+ 0 + Ω→ f „+ 0 + ◊ → r
Œ+ 0 + Œ→ g „+ 0 + ÿ → s
Œ+ 0 + œ→ h „+ 0 + ‹ → t
Œ+ 0 + ’→ i „+ 0 + fl → u
Œ+ 0 + ÷→ j „+ A → N
Œ+ 0 + ◊→ k ⁄ + B → L
Œ+ 0 + ÿ→ l ⁄ + C → M
Œ+ 0 + fl → m ‚+ 0 + d → ù
≈+ 0 + ≈→ ^
South and Central Asia-I 465 12.1 Devanagari
r + a → i or b
r + c → j or d
r + e → k or f
r + g → m or h
The graphical forms displayed above with the reph (RAsup) should not be represented by
sequences of RA + virama + independent vowel, as such sequences violate the general
encoding principles of the script. CV orthographic syllables are not represented by conso-
nant + virama + independent vowel.
The practice of writing these phonological sequences as a reph on an independent vocalic
liquid letter is also observed in other Indic scripts, such as Bengali, Gujarati, Oriya, Telugu,
Kannada, and Bhaiksuki.
Sample Half-Ligature Forms. In addition to half-form glyphs of individual consonants,
half-forms are used to depict conjunct ligature forms. A sample of such forms is shown in
Table 12-7. These forms are glyphs, not characters. They may be encoded explicitly using
zero width joiner as shown. In normal conjunct formation, they may be used sponta-
neously to depict a conjunct ligature in combination with subsequent consonant forms.
∑+ 0 + ·+ 0 + Ä → ó
¡+ 0 + ƒ+ 0 + Ä → ô
Ã+ 0 + Ã+ 0 + Ä → û
Ã+ 0 + ⁄+ 0 + Ä → ò
‡+ 0 + ⁄+ 0 + Ä → ü
South and Central Asia-I 466 12.1 Devanagari
gali, Gujarati, and so on. However, analogous punctuation marks for other Brahmi-derived
scripts are separately encoded, particularly for scripts used primarily outside of India.
Many modern languages written in the Devanagari script intersperse punctuation derived
from the Latin script. Thus U+002C comma and U+002E full stop are freely used in writ-
ing Hindi, and the danda is usually restricted to more traditional texts. However, the
danda may be preserved when such traditional texts are transliterated into the Latin script.
Other Symbols. U+0970 3 devanagari abbreviation sign appears after letters or combi-
nations of letters and marks the sequence as an abbreviation. It is intended specifically for
Devanagari script-based abbreviations, such as the Devanagari rupee sign. Other symbols
and signs most commonly occurring in Vedic texts are encoded in the Devanagari
Extended and Vedic Extensions blocks and are discussed in the text that follows.
The svasti (or well-being) signs often associated with the Hindu, Buddhist, and Jain tradi-
tions are encoded in the Tibetan block. See Section 13.4, Tibetan for further information.
Example Meaning
तला sole
तलाऽ pond
Letters for Bihari Languages. A number of the Devanagari vowel letters have been used to
write the Bihari languages Bhojpuri, Magadhi, and Maithili, as listed in Table 12-9.
Letter Short a. The character U+0904 devanagari letter short a is used to denote a
short e in the Awadi language, an Indo-Aryan language spoken in the north Indian state of
Uttar Pradesh and southern Nepal. A publisher in Lucknow, Uttar Pradesh also uses it in
Hindi translations and Devanagari transliterations of the Kannada, Telugu, Tamil, Malay-
alam and Kashmiri languages.
Prishthamatra Orthography. In the historic Prishthamatra orthography, the vowel signs
for e, ai, o, and au are represented using U+094E devanagari vowel sign prishthama-
tra e (which goes on the left side of the consonant) alone or in combination with one of
U+0947 devanagari vowel sign e, U+093E devanagari vowel sign aa or U+094B
devanagari vowel sign o. Table 12-10 shows those combinations applied to ka. In the
underlying representation of text, U+094E should be first in the sequence of dependent
vowel signs after the consonant, and may be followed by U+0947, U+093E or U+094B.
encoded as a series of combining digits, alphabetic characters, and avagraha in the range
U+A8E0..U+A8F1.
Cantillation Marks for the SZmaveda. One of the four major Vedic texts is SZmaveda. The
text is both recited (SZmaveda-SaZhitZ) and sung (SZmagZna), and is marked differently
for the purposes of each. Cantillation marks are used to indicate length, tone, and other
features in the recited text of SZmaveda, and in the Kauthuma and RQNQyanSya traditions of
SZmagZna. These marks are encoded as a series of combining digits, alphabetic characters,
and avagraha in the range U+A8E0..U+A8F1. The marks are rendered directly over the
base letter. They are represented in text immediately after the syllable they modify.
In certain cases, two marks may occur over a letter: U+A8E3 combining devanagari
digit three may be followed by U+A8EC combining devanagari letter ka, for exam-
ple. Although no use of U+A8E8 combining devanagari digit eight has been found in
the SZmagZna, it is included to provide a complete set of 0–9 digits. The combining marks
encoded for the SZmaveda do not include characters that may appear as subscripts and
superscripts in the JaiminSya tradition of SZmagZna, which used interlinear annotation.
Interlinear annotation may be rendered using Ruby and may be represented by means of
markup or other higher-level protocols.
Nasalization Marks. The set of spacing marks in the range U+A8F2..U+A8F7 include the
term candrabindu in their names and indicate nasalization. These marks are all aligned
with the headline. Note that U+A8F2 devanagari sign spacing candrabindu is lower
than the U+0901 devanagari sign candrabindu.
Editorial Marks. A set of editorial marks is encoded in the range U+A8F8..U+A8FB for use
with Devanagari. U+A8F9 devanagari gap filler signifies an intentional gap that would
ordinarily be filled with text. In contrast, U+A8FB devanagari headstroke indicates
illegible gaps in the original text. The glyph for devanagari headstroke should be
designed so that it does not connect to the headstroke of the letters beside it, which will
make it possible to indicate the number of illegible syllables in a given space. U+A8F8
devanagari sign pushpika acts as a filler in text, and is commonly flanked by double dan-
das. U+A8FA devanagari caret, a zero-width spacing character, marks the insertion
point of omitted text, and is placed at the insertion point between two orthographic sylla-
bles. It can also be used to indicate word division.
which a pause is disallowed. The block also contains several Vedic signs for ardhavisarga,
jihvamuliya, upadhmaniya and atikrama.
Tone Marks. The Vedic tone marks are all combining marks. The tone marks are grouped
together in the code charts based upon the tradition in which they appear: they are used in
the four core texts of the Vedas (SZmaveda, Yajurveda, Rigveda, and Atharvaveda) and in
the prose text on Vedic ritual (YatapathabrZhmaDa). The character U+1CD8 vedic tone
candra below is also used to identify the short vowels e and o. In this usage, the pre-
scribed order is the Indic syllable (aksara), followed by U+1CD8 vedic tone candra
below and the tone mark (svara). When a tone mark is placed below, it appears below the
vedic tone candra below.
In addition to the marks encoded in this block, Vedic texts may use other nonspacing
marks from the General Diacritics block and other blocks. For example, U+20F0 combin-
ing asterisk above would be used to represent a mark of that shape above a Vedic letter.
Diacritics for the Visarga. A set of combining marks that serve as diacritics for the visarga
is encoded in the range U+1CE2..U+1CE8. These marks indicate that the visarga has a par-
ticular tone. For example, the combination U+0903 devanagari sign visarga plus
U+1CE2 vedic sign visarga svarita represents a svarita visarga. The upward-shaped
diacritic is used for the udZtta (high-toned), the downward-shaped diacritic for anudZtta
(low-toned), and the midline glyph indicates the svarita (modulated tone).
In Vedic manuscripts the tonal mark (that is, the horizontal bar, upward curve and down-
ward curve) appears in colored ink, while the two dots of the visarga appear in black ink.
The characters for accents can be represented using separate characters, to make it easier
for color information to be maintained by means of markup or other higher-level proto-
cols.
Nasalization Marks. A set of spacing marks and one combining mark, U+1CED vedic
sign tiryak, are encoded in the range U+1CE9..U+1CF1. They describe phonetic distinc-
tions in the articulation of nasals. The gomukha characters from U+1CE9..U+1CEC may
be combined with U+0902 devanagari sign anusvara or U+0901 devanagari sign
candrabindu. U+1CF1 vedic sign anusvara ubhayato mukha may indicate a visarga
with a tonal mark as well as a nasal. The three characters, U+1CEE vedic sign hexiform
long anusvara, U+1CEF vedic sign long anusvara, and U+1CF0 vedic sign rthang
long anusvara, are all synonymous and indicate a long anusvZra after a short vowel.
U+1CED vedic sign tiryak is the only combining character in this set of nasalization
marks. While it appears similar to the U+094D devanagari sign virama, it is used to ren-
der glyph variants of nasal marks that occur in manuscripts and printed texts.
Ardhavisarga. U+1CF2 vedic sign ardhavisarga is a character that marks either the jih-
vZm^l\ya, a velar fricative occurring only before the unvoiced velar stops ka and kha, or the
upadhmZn\ya, a bilabial fricative occurring only before the unvoiced labial stops pa and
pha. Ardhavisarga is a spacing character. It is represented in text in visual order before the
consonant it modifies.
South and Central Asia-I 472 12.2 Bengali (Bangla)
There is an exception to this general pattern for the representation of Bengali independent
vowel letters, for the Bengali script orthography of Kokborok, a major language of Tripura
state in Northeast India. Kokborok has diphthongs which can occur as initial letters. To
reflect existing practice, these diphthongs are represented with two character sequences,
rather than as atomic characters, as shown in Table 12-12. Rendering systems which sup-
port display of the Kokborok orthography need to be aware of these exceptional sequences.
The sequence for vowel letter aw uses U+09D7 bengali au length mark, also noted in
the following discussion of two-part vowel signs.
Two-Part Vowel Signs. The Bengali script, along with a number of other Indic scripts,
makes use of two-part dependent vowel signs. In these dependent vowels (matras) one-half
of the vowel is displayed on each side of a consonant letter or cluster—for example,
U+09CB bengali vowel sign o and U+09CC bengali vowel sign au. To provide com-
South and Central Asia-I 473 12.2 Bengali (Bangla)
patibility with existing implementations of the scripts that use two-part vowel signs, the
Unicode Standard explicitly encodes the right half of these vowel signs. For example,
U+09D7 bengali au length mark represents the right-half glyph component of
U+09CC bengali vowel sign au. In Bengali orthography, the au length mark is always
used in conjunction with the left part and does not have a meaning on its own.
Special Characters. U+09F2..U+09F9 are a series of Bengali additions for writing currency
and fractions.
Historic Characters. The characters vocalic rr, vocalic l and vocalic ll, both in their inde-
pendent and dependent forms (U+098C, U+09C4, U+09E0..U+09E3), are only used to
write Sanskrit words in the Bengali script.
Characters for Assamese. Assamese employs two letters not used for the Bengali language.
The Assamese letter ra is represented in Unicode by U+09F0 ৰ bengali letter ra with
middle diagonal, and the Assamese letter wa is represented by U+09F1 ৱ bengali let-
ter ra with lower diagonal.
Assamese uses a conjunct character called kssa. Although kssa is often considered a sepa-
rate letter of the alphabet, it is not separately encoded. The conjunct is represented by the
sequence <U+0995 b bengali letter ka, U+09CD d bengali sign virama, U+09B7 q
bengali letter ssa>. This same sequence is also used to represent the Bengali letter
khinya (or khiya).
Assamese uses two additional consonant-vowel ligatures formed with U+09F0 bengali
letter ra with middle diagonal, which are not used for the Bengali language. These
consonant-vowel ligatures are shown in the “ligated” column in Table 12-13.
Rendering Behavior. Like other Brahmic scripts in the Unicode Standard, Bengali uses the
hasant to form conjunct characters. For example, U+09B8 a bengali letter sa +
U+09CD d bengali sign virama + U+0995 b bengali letter ka yields the conjunct c
SKA. For general principles regarding the rendering of the Bengali script, see the rules for
rendering in Section 12.1, Devanagari.
South and Central Asia-I 474 12.2 Bengali (Bangla)
Consonant-Vowel Ligatures. Some Bengali consonant plus vowel combinations have two
distinct visual presentations. The first visual presentation is a traditional ligated form, in
which the vowel combines with the consonant in a novel way. In the second presentation,
the vowel is joined to the consonant but retains its nominal form, and the combination is
not considered a ligature. These consonant-vowel combinations are illustrated in
Table 12-14.
The ligature forms of these consonant-vowel combinations are traditional. They are used
in handwriting and some printing. The “non-ligated” forms are more common; they are
used in newspapers and are associated with modern typefaces. However, the traditional lig-
atures are preferred in some contexts.
No semantic distinctions are made in Bengali text on the basis of the two different presen-
tations of these consonant-vowel combinations. However, some users consider it import-
ant that implementations support both forms and that the distinction be representable in
plain text. This may be accomplished by using U+200D zero width joiner and U+200C
zero width non-joiner to influence ligature glyph selection. (See “Cursive Connection
and Ligatures” in Section 23.2, Layout Controls.) Joiners are rarely needed in this situation.
The rendered appearance will typically be the result of a font choice.
A given font implementation can choose whether to treat the ligature forms of the conso-
nant-vowel combinations as the defaults for rendering. If the non-ligated form is the
default, then ZWJ can be inserted to request a ligature, as shown in Figure 12-12.
B + $å → Bå
0997 09C1 ga + u
B + Ä + $å → |
0997 200D 09C1 ga + u ligature
South and Central Asia-I 475 12.2 Bengali (Bangla)
If the ligated form is the default for a given font implementation, then ZWNJ can be
inserted to block a ligature, as shown in Figure 12-13.
B + $å → |
0997 09C1 ga + u ligature
B +Ã + $å → Bå
0997 200C 09C1 ga + u
Khiya. The letter r, known as khiya or khinya, is often considered as a distinct letter of the
Bengla alphabet. However, it is not encoded separately. It is represented by the sequence
<U+0995 b bengali letter ka, U+09CD d bengali sign virama, U+09B7 q bengali
letter ssa>.
Khanda Ta. In Bengali, a dead consonant ta makes use of a special form, U+09CE bengali
letter khanda ta. This form is used in all contexts except where it is immediately fol-
lowed by one of the consonants: ta, tha, na, ba, ma, ya, or ra.
Khanda ta cannot bear a vowel matra or combine with a following consonant to form a
conjunct aksara. It can form a conjunct aksara only with a preceding dead consonant ra,
with the latter being displayed with a repha glyph placed on the khanda ta.
Versions of the Unicode Standard prior to Version 4.1 recommended that khanda ta be
represented as the sequence <U+09A4 bengali letter ta, U+09CD bengali sign
virama, U+200D zero width joiner> in all circumstances. U+09CE bengali letter
khanda ta should instead be used explicitly in newly generated text, but users are cau-
tioned that instances of the older representation may exist.
The Bengali syllable tta illustrates the usage of khanda ta when followed by ta. The syllable
tta is normally represented with the sequence <U+09A4 ta, U+09CD hasant, U+09A4 ta>.
That sequence will normally be displayed using a single glyph tta ligature, as shown in the
first example in Figure 12-14.
u +$
z+ u → t
09A4 09CD 09A4 ta-ta ligature
u +$
z+ +u → vu
09A4 09CD 200C 09A4 ta hasant ta
w + u → wu
09CE 09A4 khanda-ta ta
South and Central Asia-I 476 12.2 Bengali (Bangla)
It is also possible for the sequence <ta, hasant, ta> to be displayed with a full ta glyph com-
bined with a hasant glyph, followed by another full ta glyph vu. The choice of form actu-
ally displayed depends on the display engine, based on the availability of glyphs in the font.
The Unicode Standard also provides an explicit way to show the hasant glyph. To do so, a
zero width non-joiner is inserted after the hasant. That sequence is always displayed
with the explicit hasant, as shown in the second example in Figure 12-14.
When the syllable tta is written with a khanda ta, however, the character U+09CE bengali
letter khanda ta is used and no hasant is required, as khanda ta is already a dead conso-
nant. The rendering of khanda ta is illustrated in the third example in Figure 12-14.
Ya-phalaa. Ya-phalaa is a presentation form of U+09AF { bengali letter ya. Repre-
sented by the sequence <U+09CD z bengali sign virama, U+09AF { bengali letter
ya>, ya-phalaa has a special form |. When combined with U+09BE Ä} bengali vowel
sign aa, it is used for transcribing [æ] as in the “a” in the English word “bat.” The ya-pha-
laa appears in WXYZ [ræt] “rash,” which provides a minimal pair with WYZ [rat] “a whole
lot.”
Ya-phalaa can be applied to initial vowels as well:
x|} = <0985, 09CD, 09AF, 09BE> (a- hasant ya -aa)
y|} = <098F, 09CD, 09AF, 09BE> (e- hasant ya -aa)
If a candrabindu or other combining mark needs to be added in the sequence, it comes at
the end of the sequence. For example:
x|}H = <0985, 09CD, 09AF, 09BE, 0981> (a- hasant ya -aa candrabindu)
Further examples:
x + z + { + Ä} → x|}
y + z + { + Ä} → y|}
u + z + { + Ä} → u|}
Interaction of Repha and Ya-phalaa. The formation of the repha form is defined in
Section 12.1, Devanagari, “Rules for Rendering,” R2. Basically, the repha is formed when a
ra that has the inherent vowel killed by the hasant begins a syllable. This scenario is shown
in the following example:
[ + $à + X → XÞ as in @ XÞ (karma)
The ya-phalaa is a post-base form of ya and is formed when the ya is the final consonant of
a syllable cluster. In this case, the previous consonant retains its base shape and the hasant
is combined with the following ya. This scenario is shown in the following example:
@ + $à + Y → @ ó as in Uá@ ó (bakyô)
South and Central Asia-I 477 12.2 Bengali (Bangla)
[ + Ä + $à + Y → [ó
09B0 200D 09CD 09AF
When the first character of the cluster is not a ra, the ya-phalaa is the normal rendering of
a ya, and a ZWJ is not necessary but can be present. Such a convention would make it pos-
sible, for example, for input methods to consistently associate ya-phalaa with the sequence
<ZWJ, hasant, ya>.
Jihvamuliya and Upadhmaniya. In Bengali, the voiceless velar and bilabial fricatives are
represented by U+1CF5 x vedic sign jihvamuliya and U+1CF6 y vedic sign upadh-
maniya, respectively. When the signs appear with a following homorganic voiceless stop
consonant, they can be rendered in a font as a stacked ligature without a virama:
ᳵ @ @
ᳶ S S
The sequences can also be represented linearly by inserting a U+200C zero width non-
joiner after the jihvamuliya or upadhmaniya, but before the following consonant:
ᳵ Ã @ @
ᳶ Ã S S
Dependent vowel signs can also be added to the stack or linear sequence. Consonant clus-
ters containing U+1CF5 vedic sign jihvamuliya and U+1CF6 vedic sign upadhmaniya
can occur with more than two consonants, such as ẖkra and ḫpra.
Punctuation. Bengali uses punctuation marks shared across many Indic scripts, including
the danda and double danda marks. In Bangla these are called the dahri and double dahri.
For a description of these common punctuation marks, see Section 12.1, Devanagari.
Truncation. The orthography of the Bangla language makes use of U+02BC “ ’ ” modifier
letter apostrophe to indicate the truncation of words. This sign is called urdha-comma.
South and Central Asia-I 478 12.2 Bengali (Bangla)
Examples illustrating the use of U+02BC “ ’ ” modifier letter apostrophe are shown in
Table 12-15.
X
Y } above
South and Central Asia-I 479 12.3 Gurmukhi
12.3 Gurmukhi
Gurmukhi: U+0A00–U+0A7F
The Gurmukhi script is a North Indian script used to write the Punjabi (or Panjabi) lan-
guage of the Punjab state of India. Gurmukhi, which literally means “proceeding from the
mouth of the Guru,” is attributed to Angad, the second Sikh Guru (1504–1552 ce). It is
derived from an older script called Landa and is closely related to Devanagari structurally.
The script is closely associated with Sikhs and Sikhism, but it is used on an everyday basis
in East Punjab. (West Punjab, now in Pakistan, uses the Arabic script.)
Encoding Principles. The Gurmukhi block is based on ISCII-1988, which makes it parallel
to Devanagari. Gurmukhi, however, has a number of peculiarities described here.
The additional consonants (called pairin bindi; literally, “with a dot in the foot,” in Pun-
jabi) are primarily used to differentiate Urdu or Persian loan words. They include U+0A36
gurmukhi letter sha and U+0A33 gurmukhi letter lla, but do not include U+0A5C
gurmukhi letter rra, which is genuinely Punjabi. For unification with the other scripts,
ISCII-1991 considers rra to be equivalent to dda+nukta, but this decomposition is not
considered in Unicode. At the same time, ISCII-1991 does not consider U+0A36 to be
equivalent to <0A38, 0A3C>, or U+0A33 to be equivalent to <0A32, 0A3C>.
Two different marks can be associated with U+0902 devanagari sign anusvara:
U+0A02 gurmukhi sign bindi and U+0A70 gurmukhi tippi. Present practice is to use
bindi only with the dependent and independent forms of the vowels aa, ii, ee, ai, oo, and au,
and with the independent vowels u and uu; tippi is used in the other contexts. Older texts
may depart from this requirement. ISCII-1991 uses only one encoding point for both
marks.
U+0A71 gurmukhi addak is a special sign to indicate that the following consonant is
geminate. ISCII-1991 does not have a specific code point for addak and encodes it as a
cluster. For example, the word () pagg, “turban,” can be represented with the sequence
<0A2A, 0A71, 0A17> (or <pa, addak, ga>) in Unicode, while in ISCII-1991 it would be <pa,
ga, virama, ga>.
U+0A75 l gurmukhi sign yakash probably originated as a subjoined form of U+0A2F J
gurmukhi letter ya. However, because its usage is relatively rare and not entirely pre-
dictable, it is encoded as a separate character. Some modern fonts render yakash with the
glyph i , which varies from the traditional shape found in the code charts. This character
should occur after the consonant to which it attaches and before any vowel sign.
U+0A51 m gurmukhi sign udaat occurs in older texts and indicates a high tone. This
character should occur after the consonant to which it attaches and before any vowel sign.
Punjabi does not have complex combinations of consonant sounds. Furthermore, the
orthography is not strictly phonetic, and sometimes the inherent /a/ sound is not pro-
nounced. For example, the word *+,-. gurmukh\ is represented with the sequence
<0A17, 0A41, 0A30, 0A2E, 0A41, 0A16, 0A40>, which could be transliterated as guramukh\;
South and Central Asia-I 480 12.3 Gurmukhi
this lack of pronunciation is systematic at the end of a word. As a result, the virama sign is
seldom used with the Gurmukhi script.
In older texts, such as the Sri Guru Granth Sahib (the Sikh holy book), one can find typo-
graphic clusters with a vowel sign attached to a vowel letter, or with two vowel signs
attached to a consonant. The most common cases are nu attached to K, as in S and
both the vowel signs o and n attached to a consonant, as in T goubinda; this is used to
indicate the metrical shortening of /o/ or the lengthening of /u/ depending on the context.
Other combinations are attested as well, such as U ghiana, represented by the sequence
<U+0A17, U+0A4D, U+0A39, U+0A3F, U+0A3E, U+0A28>.
Because of the combining classes of the characters U+0A4B gurmukhi vowel sign oo
and U+0A41 gurmukhi vowel sign u, the sequences <consonant, U+0A4B, U+0A41>
and <consonant, U+0A41, U+0A4B> are not canonically equivalent. To avoid ambiguity in
representation, the first sequence, with U+0A4B before U+0A41, should be used in such
cases. More generally, when a consonant or independent vowel is modified by multiple
vowel signs, the sequence of the vowel signs in the underlying representation of the text
should be: left, top, bottom, right.
Vowel Letters. Vowel letters are encoded atomically in Unicode, even if they can be ana-
lyzed visually as consisting of multiple parts. Table 12-16 shows the letters that can be ana-
lyzed, the single code point that should be used to represent them in text, and the sequence
of code points resulting from analysis that should not be used.
Tones. The Punjabi language is tonal, but the Gurmukhi script does not contain any spe-
cific signs to indicate tones. Instead, the voiced aspirates (gha, jha, ddha, dha) and the let-
ter ha combine consonantal and tonal functions.
South and Central Asia-I 481 12.3 Gurmukhi
Ordering. U+0A73 gurmukhi ura and U+0A72 gurmukhi iri are the first and third “let-
ters” of the Gurmukhi syllabary, respectively. They are used as bases or bearers for some of
the independent vowels, while U+0A05 gurmukhi letter a is both the second “letter”
and the base for the remaining independent vowels. As a result, the collation order for Gur-
mukhi is based on a seven-by-five grid:
• The first row is U+0A73 ura, U+0A05 a, U+0A72 iri, U+0A38 sa, U+0A39 ha.
• This row is followed by five main rows of consonants, grouped according to the
point of articulation, as is traditional in all South and Southeast Asian scripts.
• The semiconsonants follow in the seventh row: U+0A2F ya, U+0A30 ra,
U+0A32 la, U+0A35 va, U+0A5C rra.
• The letters with nukta, added later, are presented in a subsequent eighth row if
needed.
Rendering Behavior. For general principles regarding the rendering of the Gurmukhi
script, see the rules for rendering in Section 12.1, Devanagari. In many aspects, Gurmukhi
is simpler than Devanagari. In modern Punjabi, there are no half-consonants, no half-
forms, no repha (upper form of U+0930 devanagari letter ra), and no real ligatures.
Rules R2–R5, R11, and R14 do not apply. Conversely, the behavior for subscript RA (rules
R6–R8 and R13) applies to U+0A39 gurmukhi letter ha and U+0A35 gurmukhi let-
ter va, which also have subjoined forms, called pairin in Punjabi. The subjoined form for
RA is like a knot, while the subjoined HA and VA are written the same as the base form,
without the top bar, but are reduced in size. As described in rule R13, they attach at the bot-
tom of the base consonant, and will “push” down any attached vowel sign for U or UU.
When U+0A2F gurmukhi letter ya follows a dead consonant, it assumes a different
form called addha in Punjabi, without the leftmost part, and the dead consonant returns to
the nominal form, as shown in Table 12-17.
/ + 0 + 1 → 2 (mha) pairin ha
3 + 0 + + → 4 (pra) pairin ra
5 + 0 + 6 → 7 (dva) pairin va
5 + 0 + 8 → 59 (dya) addha ya
South and Central Asia-I 482 12.3 Gurmukhi
0 + 0 + A → a (sga) pairin ga
0 + 0 + B → b (sca) pairin ca
0 + 0 + E → e (sta) pairin ta
0 + 0 + F → f (sda) pairin da
0 + 0 + G → g (sna) pairin na
0 + 0 + J → k (sya) pairin ya
0 + 0 + / → j (sma) addha ma
Older texts also exhibit another feature that is not found in modern Gurmukhi—namely,
the use of a half- or reduced form for the first consonant of a cluster, whereas the modern
practice is to represent the second consonant in a half- or reduced form. Joiners can be
used to request this older rendering, as shown in Table 12-19. The reduced form of an ini-
tial U+0A30 gurmukhi letter ra is similar to the Devanagari superscript RA (repha), but
this usage is rare, even in older texts.
0 + 0 + 6 → L (sva)
+ + 0 + 6 → M (rva)
0 + 0 + Ä + 6 → N (sva)
+ + 0 + Ä + 6 → O (rva)
0 + 0 + Ã + 6 → PQ (sva)
+ + 0 + Ã + 6 → RQ (rva)
A rendering engine for Gurmukhi should make accommodations for the correct position-
ing of the combining marks (see Section 5.13, Rendering Nonspacing Marks, and particu-
larly Figure 5-11). This is important, for example, in the correct centering of the marks
above and below U+0A28 gurmukhi letter na and U+0A20 gurmukhi letter ttha,
South and Central Asia-I 483 12.3 Gurmukhi
which are laterally symmetrical. It is also important to avoid collisions between the various
upper marks, vowel signs, bindi, and/or addak.
Other Symbols. The religious symbol khanda sometimes used in Gurmukhi texts is
encoded at U+262C adi shakti in the Miscellaneous Symbols block. U+0A74 gurmukhi
ek onkar, which is also a religious symbol, can have different presentation forms, which
do not change its meaning. The font used in the code charts shows a highly stylized form;
simpler forms look like the digit one, followed by a sign based on ura, along with a long
upper tail.
Punctuation. Danda and double danda marks as well as some other unified punctuation
used with Gurmukhi are found in the Devanagari block. See Section 12.1, Devanagari, for
more information. Punjabi also uses Latin punctuation.
South and Central Asia-I 484 12.4 Gujarati
12.4 Gujarati
Gujarati: U+0A80–U+0AFF
The Gujarati script is a North Indian script closely related to Devanagari. It is most obvi-
ously distinguished from Devanagari by not having a horizontal bar for its letterforms, a
characteristic of the older Kaithi script to which Gujarati is related. The Gujarati script is
used to write the Gujarati language of the Gujarat state in India.
Vowel Letters. Vowel letters are encoded atomically in Unicode, even if they can be ana-
lyzed visually as consisting of multiple parts. Table 12-20 shows the letters that can be ana-
lyzed, the single code point that should be used to represent them in text, and the sequence
of code points resulting from analysis that should not be used.
Rendering Behavior. For rendering of the Gujarati script, see the rules for rendering in
Section 12.1, Devanagari. Like other Brahmic scripts in the Unicode Standard, Gujarati
uses the virama to form conjunct characters. The virama is informally called kho}o, which
means “lame” in Gujarati. Many conjunct characters, as in Devanagari, lose the vertical
stroke; there are also vertical conjuncts. U+0AB0 gujarati letter ra takes special forms
when it combines with other consonants, as shown in Table 12-21.
Marks for Transliteration of Arabic. The combining marks encoded in the range
U+0AFA..U+0AFF are used for the transliteration of the Arabic script into Gujarati. This
system of transliteration was devised in the late 19th century, and is used by Ismaili Khoja
communities. These marks occur both in manuscripts and in printed materials.
The three forms of nukta encoded in the range U+0AFD..U+0AFF are diacritics, placed
above regular Gujarati letters to create new letters corresponding to Arabic letters for non-
Gujarati sounds. U+0AFF gujarati sign two-circle nukta above is used only with
U+0A9D gujarati letter jha, to transliterate the Arabic zah. U+0AFE gujarati sign
South and Central Asia-I 485 12.4 Gujarati
: + ; + < → = (kXa)
>+ ; + ? → @ (jña)
A + ; + B → CB (tya)
D + ; + D → E (YYa)
F + ; + : → G (rka)
: + ; + F → ' (kra)
circle nukta above is used with U+0A9D gujarati letter jha to transliterate the Ara-
bic thal and with U+0AB8 gujarati letter sa to transliterate the Arabic theh. U+0AFD
gujarati sign three-dot nukta above occurs with a number of different Gujarati let-
ters, to transliterate a variety of Arabic letters.
U+0AFA gujarati sign sukun, U+0AFB gujarati sign shadda, and U+0AFC gujarati
sign maddah are used to transliterate the Arabic sukun, shadda, and maddah above,
respectively. These marks may be applied to a Gujarati letter which also uses one of the
three above-base nukta diacritic marks. In such cases, the nukta occurs first in the combin-
ing sequence, followed by the sukun, shadda, or maddah mark. However, instead of being
rendered above the nukta mark on the letter, the sukun, shadda, or maddah mark is ren-
dered to the left of the nukta mark.
Punctuation. Words in Gujarati are separated by spaces. Danda and double danda marks
as well as some other unified punctuation used with Gujarati are found in the Devanagari
block; see Section 12.1, Devanagari.
South and Central Asia-I 486 12.5 Oriya (Odia)
Rendering Behavior. For rendering of the Oriya script, see the rules for rendering in
Section 12.1, Devanagari. Like other Brahmic scripts in the Unicode Standard, Oriya uses
the virama to suppress the inherent vowel. Oriya has a visible virama, often being a length-
ening of a part of the base consonant:
U + > + c → Ud (tya)
Consonant Forms. In the initial position in a cluster, RA is reduced and placed above the
following consonant, while it is also reduced in the second position:
_ + > + ` → a (rpa)
` + > + _ → b (pra)
Nasal and stop clusters may be written with conjuncts, or the anusvara may be used:
< + A → B (kZ)
< + C → D (ki)
< + E→ F (k\)
< + G→ H (ku)
< + I→ J (k^)
< + K → L (kW)
< + M→ N (ke)
< + O→ P (kai)
< + Q→ R (ko)
< + S→ T (kau)
Oriya VA and WA. These two letters are extensions to the basic Oriya alphabet. Because
Sanskrit yx vana becomes Oriya qz bana in orthography and pronunciation, an
extended letter U+0B35 r oriya letter va was devised by dotting U+0B2C p oriya let-
ter ba for use in academic and technical text. For example, basic Oriya script cannot dis-
tinguish Sanskrit wy bava from ww baba or yy vava, but this distinction can be made
with the modified version of ba. In some older sources, the glyph N is sometimes found for
va; in others, P and Q have been shown, which in a more modern type style would be R.
The letter va is not in common use today.
In a consonant conjunct, subjoined U+0B2C p oriya letter ba is usually—but not
always—pronounced [wa]:
U+0B15 1 ka + U+0B4D B virama + U+0B2C C ba '→ 1A [kwa]
U+0B2E M ma + U+0B4D B virama + U+0B2C C ba '→ MA [mba]
The extended Oriya letter U+0B71 T oriya letter wa is sometimes used in Perso-Arabic
or English loan words for [w]. It appears to have originally been devised as a ligature of V o
and p ba, but because ligatures of independent vowels and consonants are not normally
used in Oriya, this letter has been encoded as a single character that does not have a
decomposition. It is used initially in words or orthographic syllables to represent the for-
eign consonant; as a native semivowel, virama + ba is used because that is historically
accurate. Glyph variants of wa are S, U, and VW.
Punctuation and Symbols. Danda and double danda marks as well as some other unified
punctuation used with Oriya are found in the Devanagari block; see Section 12.1, Devana-
gari. The mark U+0B70 oriya isshar is placed before names of persons who are deceased.
The sacred syllable om is formed by U+0B13 oriya letter o and U+0B01 oriya sign
candrabindu. Ligation of the two glyphs can be encouraged or discouraged by the use of
U+200D zero width joiner or U+200C zero width non-joiner between the two char-
acters, as seen in Table 12-25. In the absence of a joiner, both the non-ligated and the
ligated forms are acceptable renderings.
A + Ä + ^ → B or C
D + Ã + ^ → E
Fraction Characters. As for many other scripts of India, Oriya has characters used to
denote factional values. These were more commonly used before the advent of decimal
weights, measures, and currencies. Oriya uses six signs: three for quarter values (1/4, 1/2,
3/4) and three for sixteenth values (1/16, 1/8, and 3/16). These are used additively, with
quarter values appearing before sixteenths. Thus U+0B73 oriya fraction one half fol-
lowed by U+0B75 oriya fraction one sixteenth represents the value 5/16.
South and Central Asia-I 489 12.6 Tamil
12.6 Tamil
Tamil: U+0B80–U+0BFF
The Tamil script is descended from the South Indian branch of Brahmi. It is used to write
the Tamil language of the Tamil Nadu state in India as well as minority languages such as
Irula, the Dravidian language Badaga, and the Indo-European language Saurashtra. Tamil
is also used in Sri Lanka, Singapore, and parts of Malaysia.
The Tamil script has fewer consonants than the other Indic scripts. When representing the
“missing” consonants in transcriptions of languages such as Sanskrit or Saurashtra, super-
script European digits are often used, so 2 = pha, 3 = ba, and 4 = bha. The characters
U+00B2, U+00B3, and U+2074 can be used to preserve this distinction in plain text. The
Grantha script is often also used by Tamil speakers to write Sanskrit because Grantha con-
tains these missing consonants.
The Tamil script also avoids the use of conjunct consonant forms, although a few conven-
tional conjuncts are used.
Virama (Pu!!i). Because the Tamil encoding in the Unicode Standard is based on ISCII-
1988 (Indian Script Code for Information Interchange), it makes use of the abugida model.
An abugida treats the basic consonants as containing an inherent vowel, which can be can-
celed by the use of a visible mark, called a virama in Sanskrit. In most Brahmi-derived
scripts, the placement of a virama between two consonants implies the deletion of the
inherent vowel of the first consonant and causes a conjoined or subjoined consonant clus-
ter. In those scripts, zero width non-joiner is used to display a visible virama, as shown
previously in the Devangari example in Figure 12-4.
The situation is quite different for Tamil because the script uses very few consonant con-
juncts. An orthographic cluster consisting of multiple consonants (represented by <C1,
U+0BCD tamil sign virama, C2, ...>) is normally displayed with explicit viramas, which
are called pukki in Tamil. The pukki is typically rendered as a dot centered above the charac-
ter. It occasionally appears as small circle instead of a dot, but this glyph variant should be
handled by the font, and not be represented by the similar-appearing U+0B82 tamil sign
anusvara.
The conjuncts kssa and shrii are traditionally displayed by conjunct ligatures, as illustrated
for kssa in Figure 12-15, but nowadays tend to be displayed using an explicit pukki as well.
μ + Ä|| + ◊ → a kXa
To explicitly display a pukki for such sequences, zero width non-joiner can be inserted
after the pukki in the sequence of characters.
South and Central Asia-I 490 12.6 Tamil
Rendering of the Tamil Script. The Tamil script is complex and requires special rules for
rendering. The following discussion describes the most important features of Tamil ren-
dering behavior. As with any script, a more complex procedure can add rendering charac-
teristics, depending on the font and application.
In a font that is capable of rendering Tamil, the number of glyphs is greater
than the number of Tamil characters.
Tamil Vowels
Vowel Letters. Vowel letters are encoded atomically in Unicode, even if they can be ana-
lyzed visually as consisting of multiple parts. Table 12-26 shows the letters that can be ana-
lyzed, the single code point that should be used to represent them in text, and the sequence
of code points resulting from analysis that should not be used.
Independent Versus Dependent Vowels. In the Tamil script, the dependent vowel signs are
not equivalent to a sequence of virama + independent vowel. For example:
… + Äw ≠…+ Ä| +ß
Left-Side Vowels. The Tamil vowels U+0BC6 ÊÄ, U+0BC7 ÁÄ, and U+0BC8 ËÄ are
reordered in front of the consonant to which they are applied. When occurring in a sylla-
ble, these vowels are rendered to the left side of their consonant, as shown in Figure 12-16.
In these examples, the representation on the left, which is a single code point, is the pre-
ferred form and the form in common use for Tamil.
In the process of rendering, these two-part vowels are transformed into the two separate
glyphs equivalent to those on the right, which are then subject to vowel reordering, as
shown in Figure 12-18.
Tamil Ligatures
A number of ligatures are conventionally used in Tamil. Most ligatures involve the shape
taken by a consonant plus vowel sequence. A wide variety of modern Tamil words are writ-
ten without a conjunct form, with a fully visible pukki.
Ligatures with Vowel i. The vowel signs i Ä w and iiÄ« form ligatures with the consonant
tta ø as shown in examples 1 and 2 of Figure 12-21. These vowels often change shape or
position slightly so as to join cursively with other consonants, as shown in examples 3 and
4 of Figure 12-21.
1 ø +Ä w →C Yi
2 ø + Ä« →D Y\
3 “ +Ä w → Ñ li
4 “ + Ä« → Ö l\
Ligatures with Vowel u. The vowel signs uÄõ and uuÄú normally ligate with their conso-
nant, as shown in Table 12-27. In the first column, the basic consonant is shown; the sec-
ond column illustrates the ligation of that consonant with the u vowel sign; and the third
column illustrates the ligation with the uu vowel sign.
º + Äõ → º˜ ju
º + Äú → º¯ j^
Ligatures with ra. Based on typographical preferences, the consonant ra – may change
shape to fi, when it ligates. Such change, if it occurs, will happen only when the fi form of
U+0BB0 – tamil letter ra would not be confused with the nominal form fi of U+0BBE
tamil vowel sign aa (namely, when – is combined withÄ|, Ä w , orÄ« ). This change in
shape is illustrated in Figure 12-23.
– +Ä| → l r
– +Ä w → m ri
– + Ä« → n r\
However, various governmental bodies mandate that the basic shape of the consonant ra –
should be used for these ligatures as well, especially in school textbooks. Media and literary
publications in Malaysia and Singapore mostly use the unchanged form of ra –. Sri Lanka,
on the other hand, specifies the use of the changed forms shown in Figure 12-24.
Tamil Ligature shri. Prior to Unicode 4.1, the best mapping to represent the ligature shri
was to the sequence <U+0BB8, U+0BCD, U+0BB0, U+0BC0>. Unicode 4.1 in 2005 added
the character U+0BB6 tamil letter sha and as a consequence, the best mapping became
South and Central Asia-I 494 12.6 Tamil
7 + Ä| +– + Ä« → 8
6 + Ä| +– + Ä« → 8
Ligatures with aa in Traditional Tamil Orthography. In traditional Tamil orthography,
the vowel sign aa Äfi optionally ligates with √, …, or —, as illustrated in Figure 12-25.
√ + Äfi → @ DZ
… + Äfi → A hZ
— + Äfi → B 9Z
These ligations also affect the right-hand part of two-part vowels, as shown in Figure 12-26.
√ + ÊÄfi → Ê@ Do
√ + ÁÄfi → Á@ D]
… + ÊÄfi → ÊA ho
… + ÁÄfi → ÁA h]
— + ÊÄfi → ÊB 9o
— + ÁÄfi → ÁB 9]
√ + ËÄ → È√ Dai
… + ËÄ → È… hai
“ + ËÄ → È“ lai
” + ËÄ → È” kai
By contrast, in modern Tamil orthography, this vowel does not change its shape, as shown
in Figure 12-28.
√ + ËÄ → Ë√ Dai
Tamil aytham. The character U+0B83 tamil sign visarga is normally called aytham in
Tamil. It is historically related to the visarga in other Indic scripts, but has become an ordi-
nary spacing letter in Tamil. The aytham occurs in native Tamil words, but is frequently
used as a modifying prefix before consonants used to represent foreign sounds. In particu-
lar, it is used in the spelling of words borrowed into Tamil from English or other languages.
Punctuation. Danda and double danda marks as well as some other unified punctuation
used with Tamil are found in the Devanagari block; see Section 12.1, Devanagari.
Numbers. Modern Tamil decimal digits are encoded at U+0BE6..U+0BEF. Note that some
digits are confusable with letters, as shown in Table 12-28. In some Tamil fonts, the digits
for two and eight look exactly like the letters u and a, respectively. In other fonts, as shown
here, the shapes for the digits two and eight are adjusted to minimize confusability.
Tamil also has distinct numerals for ten, one hundred, and one thousand at
U+0BF0..U+0BF2 used for historical numbers.
Use of Nukta. In addition to Tamil, several other languages of southern India are written
using the Tamil script. For example, Irula is written with the Tamil script. Some of these
languages contain sounds distinct from those normally written for the Tamil language. In
South and Central Asia-I 496 12.6 Tamil
such cases, the writing systems of these languages apply diacritic nukta marks to Tamil let-
ters to represent their distinct sounds. For example, Irula uses a double dot nukta below for
some sounds. That nukta can be represented with U+1133C grantha sign nukta.
12.7 Telugu
Telugu: U+0C00–U+0C7F
The Telugu script is a South Indian script used to write the Telugu language of the Andhra
Pradesh state in India as well as minority languages such as Gondi (Adilabad and Koi dia-
lects) and Lambadi. The script is also used in Maharashtra, Odisha (Orissa), Madhya
Pradesh, and West Bengal. The Telugu script became distinct by the thirteenth century ce
and shares ancestors with the Kannada script.
Vowel Letters. Vowel letters are encoded atomically in Unicode, even if they can be ana-
lyzed visually as consisting of multiple parts. Table 12-30 shows the letters that can be ana-
lyzed, the single code point that should be used to represent them in text, and the sequence
of code points resulting from analysis that should not be used.
Rendering Behavior. Telugu script rendering is similar to that of other Brahmic scripts in
the Unicode Standard—in particular, the Tamil script. Unlike Tamil, however, the Telugu
script writes conjunct characters with subscript letters. Many Telugu letters have a v-
shaped headstroke, which is a structural mark corresponding to the horizontal bar in
Devanagari and the arch in Oriya script. When a virama (called virZmamu in Telugu) or
certain vowel signs are added to a letter with this headstroke, it is replaced:
2
U+0C15 ka + U+0C4D 3 virama + U+200C Ã zero width non-
4
joiner → (k)
U+0C15 2 ka + U+0C3F 5 vowel sign i → 6 (ki)
Telugu consonant clusters are most commonly represented by a subscripted, and often
transformed, consonant glyph for the second element of the cluster:
U+0C17 < ga + U+0C4D 3 virama + U+0C17 < ga → <= (gga)
U+0C15 2 ka + U+0C4D 3 virama + U+0C15 2 ka → 29 (kka)
U+0C15 2 ka + U+0C4D 3 virama + U+0C2F : ya → 2; (kya)
U+0C15 2 ka + U+0C4D 3 virama + U+0C37 > ssa → 2? (kXa)
South and Central Asia-I 500 12.7 Telugu
NakZra-Pollu. The sequence <U+0C28 telugu letter na, U+0C4D telugu sign
virama> can have two representations in Telugu text. The first is the “regular” or “new
style” form D, which takes its shape from the glyphs in the sequence <U+0C28 C telugu
letter na , U+0C4D y telugu sign virama>. Older texts display the other vowel-less
form F, called nakZra-pollu. The two forms are semantically identical. Fonts should ren-
der the sequence <U+0C28 telugu letter na, U+0C4D telugu sign virama> with
either the old-style glyph For the new style glyph D. The character U+200C zero width
non-joiner can be used to prevent interaction of this sequence with following consonants,
as shown in Table 12-31.
Reph. In modern Telugu, U+0C30 telugu letter ra behaves in the same manner as most
other initial consonants in a consonant cluster. That is, the ra appears in its nominal form,
and the second consonant takes the C2-conjoining or subscripted form:
U+0C30 x ra + U+0C4D 3 virama + U+0C2E z ma → xB (rma)
However, in older texts, U+0C30 telugu letter ra takes the reduced (or reph) form A
when it appears first in a consonant cluster, and the following consonant maintains its
nominal form:
U+0C30 x ra + U+0C4D 3 virama + U+0C2E z ma → zA (rma)
U+200D zero width joiner is placed immediately after the virama to render the reph
explicitly in modern texts:
U+0C30 x ra + U+0C4D 3 virama + U+200D Ä ZWJ + U+0C2E z ma
→ zA
To prevent display of a reph, U+200D zero width joiner is placed after the ra, but pre-
ceding the virama:
U+0C30 x ra + U+200D Ä ZWJ + U+0C4D 3 virama + U+0C2E z ma
→ xB
Special Characters. U+0C55 telugu length mark is provided as an encoding for the
second element of the vowel U+0C47 telugu vowel sign ee. U+0C56 telugu ai length
mark is provided as an encoding for the second element of the surroundrant vowel
U+0C48 telugu vowel sign ai. The length marks are both nonspacing characters. For a
detailed discussion of the use of two-part vowels, see “Two-Part Vowels” in Section 12.6,
Tamil.
South and Central Asia-I 501 12.7 Telugu
Fractions. Prior to the adoption of the metric system, Telugu fractions were used as part of
the system of measurement. Telugu fractions are quaternary (base-4), and use eight marks,
which are conceptually divided into two sets. The first set represents odd-numbered nega-
tive powers of four in fractions. The second set represents even-numbered negative powers
of four in fractions. Different zeros are used with each set. The zero from the first set is
known as hakki, U+0C78 telugu fraction digit zero for odd powers of four. The
zero for the second set is U+0C66 telugu digit zero.
Punctuation. Danda and double danda are used primarily in the domain of religious texts
to indicate the equivalent of a comma and full stop, respectively. The danda and double
danda marks as well as some other unified punctuation used with Telugu are found in the
Devanagari block; see Section 12.1, Devanagari.
South and Central Asia-I 502 12.8 Kannada
12.8 Kannada
Kannada: U+0C80–U+0CFF
The Kannada script is a South Indian script. It is used to write the Kannada (or Kanarese)
language of the Karnataka state in India and to write minority languages such as Tulu. The
Kannada language is also used in many parts of Tamil Nadu, Kerala, Andhra Pradesh, and
Maharashtra. This script is very closely related to the Telugu script both in the shapes of the
letters and in the behavior of conjunct consonants. The Kannada script also shares many fea-
tures common to other Indic scripts. See Section 12.1, Devanagari, for further information.
The Unicode Standard follows the ISCII layout for encoding, which also reflects the tradi-
tional Kannada alphabetic order.
Consonant Conjuncts. Kannada is also noted for a large number of consonant conjunct
forms that serve as ligatures of two or more adjacent forms. This use of ligatures takes place
South and Central Asia-I 503 12.8 Kannada
Kannada script need to be aware that these sequences involving independent vowels fol-
lowed by virama and U+0CDE are valid and required in orthographies for Badaga. Exam-
ples of the use of subjoined U+0CDE to indicate retroflexion, both for independent vowel
letters and for dependent vowels, are shown in Figure 12-29.
ಉ$ ೞ → ಉ
ೞ
0C89 0CCD 0CDE
ಯ$ ೞ
0CAF 0CCD 0CDE
$ೆ0CC6
→
ೞ
Rendering Kannada
Plain text in Kannada is generally stored in phonetic order; that is, a CV syllable with a
dependent vowel is always encoded as a consonant letter C followed by a vowel sign V in
the memory representation. This order is employed by the ISCII standard and corresponds
to the phonetic and keying order of textual data. Unlike in Devanagari and some other
Indian scripts, all of the dependent vowels in Kannada are depicted to the right of their
consonant letters. Hence there is no need to reorder the elements in mapping from the log-
ical (character) store to the presentation (glyph) rendering, and vice versa.
Explicit Virama (Halant). Normally, a halant character creates dead consonants, which in
turn combine with subsequent consonants to form conjuncts. This behavior usually results
in a halant sign not being depicted visually. Occasionally, this default behavior is not
desired when a dead consonant should be excluded from conjunct formation, in which
case the halant sign is visibly rendered. To accomplish this, U+200C zero width non-
joiner is introduced immediately after the encoded dead consonant that is to be excluded
from conjunct formation. See Section 12.1, Devanagari, for examples.
Vowelless NA. The sequence <U+0CA8 kannada letter na, U+0CCD kannada sign
virama> can have two representations in Kannada text. The first is the “regular” or “new
style” form n, which takes its shape from the glyphs in the sequence <U+0CA8 kannada
letter na, U+0CCD kannada sign virama>. Older texts display the other vowel-less
form o. The two forms are semantically identical. Fonts should render the sequence
<U+0CA8 kannada letter na, U+0CCD kannada sign virama> with either the old-
style glyph o or the new style glyph n. The character U+200C zero width non-joiner
can be used to prevent interaction of this sequence with the following consonants, as
shown in Table 12-33.
See the discussion of the analogous rendering of na in Telugu, called nakQra-pollu, in
Section 12.7, Telugu.
Consonant Clusters Involving RA. Whenever a consonant cluster is formed with the
U+0CB0 D kannada letter ra as the first component of the consonant cluster, the letter
South and Central Asia-I 505 12.8 Kannada
ra is depicted with two different presentation forms: one as the initial element and the
other as the final display element of the consonant cluster.
U+0CB0 D ra + U+0CCD @ halant + U+0C95 I ka → IK rka
U+0CB0 D ra + Ä + U+0CCD @ halant + U+0C95 I ka → DL rka
U+0C95 I ka + U+0CCD @ halant + U+0CB0 D ra → IJ kra
Jihvamuliya and Upadhmaniya. Voiceless velar and bilabial fricatives in Kannada are rep-
resented by U+0CF1 kannada sign jihvamuliya and U+0CF2 kannada sign upadh-
maniya, respectively. When the signs appear with a following homorganic voiceless stop
consonant, the combination should be rendered in the font as a stacked ligature, without a
virama:
U+0CF1 ೱ jihvamuliya + U+0C95 ಕ ka ೱ
U+0CF2 ೲ upadhmaniya + U+0CAB ಫ pha ೲ
Modifier Mark Rules. In addition to the vowel signs, one or more types of combining
marks may be applied to a component of a written syllable or the syllable as a whole. If the
consonant represents a dead consonant, then the nukta should precede the halant in the
memory representation. The nukta is represented by a double-dot mark, U+0CBC E kan-
nada sign nukta. Two such modified consonants are used in the Kannada language: one
representing the syllable za and one representing the syllable fa.
Avagraha Sign. A spacing mark, U+0CBD F kannada sign avagraha, is used when ren-
dering Sanskrit texts.
Punctuation. Danda and double danda marks as well as some other unified punctuation
used with this script are found in the Devanagari block; see Section 12.1, Devanagari.
South and Central Asia-I 506 12.9 Malayalam
12.9 Malayalam
Malayalam: U+0D00–U+0D7F
The Malayalam script is a South Indian script used to write the Malayalam language of the
Kerala state. Malayalam is a Dravidian language like Kannada, Tamil, and Telugu.
Throughout its history, it has absorbed words from Tamil, Sanskrit, Arabic, and English.
The shapes of Malayalam letters closely resemble those of Tamil. Malayalam, however, has
a very full and complex set of conjunct consonant forms.
Vowel Letters. Vowel letters are encoded atomically in Unicode, even if they can be ana-
lyzed visually as consisting of multiple parts. Table 12-34 shows the letters that can be ana-
lyzed, the single code point that should be used to represent them in text, and the sequence
of code points resulting from analysis that should not be used.
Two-Part Vowels. The Malayalam script uses several two-part vowel characters. In modern
times, the dominant practice is to write the dependent form of the au vowel using only “w”,
which is placed on the right side of the consonant it modifies; such texts are represented in
Unicode using U+0D57 malayalam au length mark. In the past, this dependent form
was written using both “v” on the left side and “w” on the right side; U+0D4C malayalam
vowel sign au can be used for documents following this earlier tradition. This historical
simplification started much earlier than the orthographic reforms mentioned in the text
that follows.
For a detailed discussion of the use of two-part vowels, see “Two-Part Vowels” in
Section 12.6, Tamil.
Historic Characters. The four characters, avagraha, vocalic rr sign, vocalic l sign, and
vocalic ll sign, are only used to write Sanskrit words in the Malayalam script. The avagraha
is the most common of the four. The vocalic l sign is also commonly used in Sanskrit words.
Two specific forms of viramas are found in historical materials. The U+0D3B malayalam
sign vertical bar virama was used to indicate a pure consonant when transliterating
foreign words, while the U+0D3C malayalam sign circular virama was employed to
indicate a pure consonant in native Malayalam texts.
South and Central Asia-I 507 12.9 Malayalam
Suriyani Malayalam. The Suriyani dialect of Malayalam is written using the Syriac script.
It is also called Garshuni (Karshoni) or Syriac Malayalam. This usage requires eleven addi-
tional letters encoded in the Syriac Supplement block (U+0860..U+086F) to represent the
sounds of Malayalam. The dialect was widely used by the St. Thomas Christians living in
Kerala, India, in the 19th century.
Rendering Malayalam
Candrakkala. As is the case for many other Brahmi-derived scripts in the Unicode Stan-
dard, Malayalam uses a virama character to form consonant conjuncts. The virama sign
itself is known as candrakkala in Malayalam. Table 12-36 provides a variety of examples of
consonant conjuncts. There are both horizontal and vertical conjuncts, some of which
ligate, and some of which are merely juxtaposed.
\ + $ + \ → ( (kka)
_ + $ + ) → * ( jña)
+ + $ + + → , (YYa)
- + $ + - → . (ppa)
/ + $ + 0 → 1 (ccha)
2 + $ + 2 → 3 (bba)
b + $ + = → b> (nya)
- + $ + d → B (pra)
e + $ + o → e@ (#va)
When the candrakkala sign is visibly shown in Malayalam, it indicates either the suppres-
sion of the preceding vowel or its replacement with a neutral vowel sound. This sound is
often called “half-u” or samvruthokaram. In traditional orthography it is displayed with a
vowel sign -u followed by candrakkala, and in modern orthography it is displayed with a
candrakkala alone. In all cases, the candrakkala sign is represented by the character
U+0D4D malayalam sign virama, which follows any vowel sign that may be present and
precedes any anusvara that may be present. Examples are shown in Table 12-37.
Explicit Candrakkala. The sequence <C1, virama, ZWNJ, C2>, where C1 and C2 are con-
sonants, may be used to request display with an explicit visible candrakkala, instead of the
default conjunct form. See Table 12-38 for an example. This convention is consistent with
the use of this sequence in other Indic scripts.
Requesting Traditional Ligatures. The sequence <C1, ZWJ, virama, C2> may be used to
request traditional ligatures, even if the current font defaults to the conjuncts appropriate
for the reformed orthography. When such sequences occur, a closed or cursively connected
ligature should be displayed, if available. See Table 12-38 for examples. This convention is
consistent with the use of this sequence in some other Indic scripts, such as Kannada,
Oriya, and Telugu.
Requesting Open Forms of Conjuncts. The sequence <C1, ZWNJ, virama, C2> may be
used to request open ligatures or those used in the reformed orthography, even if the cur-
rent font defaults to the conjuncts appropriate for the traditional orthography. When such
sequences occur, an open or disconnected conjunct form should be displayed, if available.
See Table 12-38 for examples. Note that such sequences are defined for Malayalam only,
and are left undefined for other Indic scripts.
/+1 + A → C or T (kra)
*+ 1 + / → D or E (ska)
F +1 + * → G or H (tsa)
I +1 + J → S or L or M (rva)
N+1 +N → O (yya)
/+1 +Ã +A → P (kra)
/+Ä +1 +A → T (kra)
*+ Ä + 1 + / → D (ska)
F +Ä +1 +* → G (tsa)
I +Ä +1 +J → S (rva)
/+Ã +1 +A → C (kra)
I +Ã +1 +J → L (rva)
N+Ã +1 +N → R (yya)
Anusvara. The anusvara can be seen multiple times after vowels, whether independent let-
ters or dependent vowel signs, as in vxxxx <0D08, 0D02, 0D02, 0D02, 0D02>. Vowel
South and Central Asia-I 510 12.9 Malayalam
signs can also be seen after digits, as in 355wx <0033, 0035, 0035, 0D3E, 0D02>. More gen-
erally, rendering engines should be prepared to handle Malayalam letters (including vowel
letters), digits (both European and Malayalam), dashes, U+00A0 no-break space and
U+25CC dotted circle as base characters for the Malayalam vowel signs, U+0D4D
malayalam sign virama, U+0D02 malayalam sign anusvara, and U+0D03 malay-
alam sign visarga. They should also be prepared to handle multiple combining marks on
those bases.
Dot Reph. U+0D4E malayalam letter dot reph is used to represent the dead conso-
nant form of U+0D30 malayalam letter ra, when it is displayed as a dot or small vertical
stroke above the consonant that follows it in logical order. It has the character properties of
a letter rather than those of a combining mark, but special behavior is required in imple-
mentations. Conceptually, dot reph is analogous to the sequence <ra, virama> which, in
many Indic scripts, is rendered as a reph mark over the following consonant. This same
behavior is expected for dot reph: it should be rendered as a mark over the following con-
sonant. In standard Malayalam, the sequence <ra, virama> would normally occur only
within the sequence <ra, virama, ya>, which should be rendered as the nominal form of ra
with a conjoining form of ya.
The sequence <ra, virama, ZWJ> is not used to represent the dot reph, because that
sequence has considerable preexisting usage to represent the chillu form of ra, prior to the
encoding of the chillu form as a distinct character, U+0D7C malayalam letter chillu
rr.
The Malayalam dot reph was in common print usage until 1970, but has fallen into disuse.
Words that formerly used dot reph on a consonant are now spelled instead with a chillu-rr
form preceding the consonant. (See the following discussion of chillu characters.) The dot
reph form is predominantly used by those who completed elementary education in Malay-
alam prior to 1970.
Chillu Forms. The six characters, U+0D7A..U+0D7F, encode dead consonants (those
without an inherent vowel) known as chillu or cillakXaram. In Malayalam language text,
chillu forms never start a word. Occasionally, chillu forms may take vowels or be elements
of conjuncts. The chillu forms nn, -n, -rr, -l, and -ll are quite common; chillu-k is relatively
rare in contemporary usage.
For backward-compatibility issues regarding the representation of chillu forms, see the dis-
cussion of legacy chillu sequences later in this section.
Special Cases Involving rra. There are a number of textual representation and reading
issues involving the letter rra. These issues are discussed here and tables of explicit exam-
ples are presented.
The letter x rra is normally read /sa/. Repetition of that sound is naturally written by
repeating the letter: xx. Each occurrence can bear a vowel sign.
The same repetition of the letter rra as xx is also used for /uua/, which can be unambigu-
ously represented by y. The sequence of two x letters fundamentally behaves as a digraph
in this instance. The digraph can bear a vowel sign in which case the digraph as a whole acts
South and Central Asia-I 511 12.9 Malayalam
graphically as an atom: a left vowel part goes to the left of the digraph and a right vowel part
goes to the right of the digraph. Historically, the side-by-side form was used until around
1960 when the stacked form began appearing and supplanted the side-by-side form.
As a consequence the graphical sequence xx in text is ambiguous in reading. The reader
must generally use the context to understand if xx is read /sasa/ or /uua/. It is only when a
vowel part appears between the two x that the reading cannot be /uua/. Note that similar
situations are common in many other orthographies. For example, th in English can be a
digraph (cathode) or two separate letters (cathouse); gn in French can be a digraph
(oignon) or two separate letters (gnome).
The sequence <0D31, 0D31> is rendered as xx, regardless of the reading of that text. The
sequence <0D31, 0D4D, 0D31> is rendered as y. In both cases, vowels signs can be used as
appropriate, as shown in Table 12-39.
A very similar situation exists for the combination of ; chillu-n and x rra. When used side
by side, ;x can be read either /vsa/ or /vua/, while stacked z is always read /vua/.
The sequence <0D7B, 0D31> is rendered as ;x, regardless of the reading of that text. The
sequence <0D7B, 0D4D, 0D31> is rendered as z. In both cases, vowels signs can be used
as appropriate, as shown in Table 12-40.
Legacy Chillu Sequences. Prior to Unicode Version 5.1, the representation of text with
chillu forms was problematic, and not clearly described in the text of the standard. Because
older data will use different representation for chillu forms, implementations must be pre-
pared to handle both kinds of data. For chillu forms considered in isolation, the following
table shows the relationship between their representation in Version 5.0 and earlier, and
the recommended representation starting with Version 5.1. Note that only the first five
chillu forms listed in Table 12-41 were represented in legacy text by <virama, ZWJ>
sequences. The other chillu forms are only represented as atomically encoded chillu char-
acters.
Chapter 13
South and Central Asia-II 13
Other Modern Scripts
This chapter describes the following other modern scripts in South and Central Asia:
The Thaana script is used to write Dhivehi, the language of the Republic of Maldives, an
island nation in the middle of the Indian Ocean.
Sinhala is an official script of Sri Lanka, where it is used to write the majority language, also
known as Sinhala.
The Newa script, also known as Nepaalalipi in Nepal and as Newar in English-speaking
countries, is a Brahmi-based script that dates to the tenth century ce. It was actively used in
central Nepal until the late 18th century. Newa is presently used to write the Nepal Bhasa
language, a Tibeto-Burman language spoken in the Kathmandu Valley of Nepal and in the
Indian state of Sikkim.
The Mongolian script was developed as an adaption of the Old Uyghur alphabets around
the beginning of the thirteenth century, during the reign of Genghis Khan. It is used in
both China and Mongolia.
The Tibetan script is used for writing the Tibetan language in several countries and regions
throughout the Himalayas. The approach to the encoding of Tibetan in the Unicode Stan-
dard differs from that for most Brahmi-derived scripts. Instead of using a virama-based
model for consonant conjuncts, it uses a subjoined consonant model.
Limbu is a Brahmi-derived script primarily used to write the Limbu language, spoken
mainly in eastern Nepal, Sikkim, and in the Darjeeling district of West Bengal. Its encoding
follows a variant of the Tibetan model, making use of subjoined medial consonants, but
also explicitly encoded syllable-final consonants.
Lepcha is the writing system for the Lepcha language, spoken in Sikkim and in the Darjeel-
ing district of the West Bengal state of India. Lepcha is directly derived from the Tibetan
script, but all of the letters were rotated by ninety degrees.
South and Central Asia-II 514
13.1 Thaana
Thaana: U+0780–U+07BF
The Thaana script is used to write the modern Dhivehi language of the Republic of Mal-
dives, a group of atolls in the Indian Ocean. Like the Arabic script, Thaana is written from
right to left and uses vowel signs, but it is not cursive. The basic Thaana letters have been
extended by a small set of dotted letters used to transcribe Arabic. The use of modified
Thaana letters to write Arabic began in the middle of the 20th century. Loan words from
Arabic may be written in the Arabic script, although this custom is not very prevalent
today. (See Section 9.2, Arabic.)
While Thaana’s glyphs were borrowed in part from Arabic (letters haa through vaavu were
based on the Arabic-Indic digits, for example), and while vowels and sukun are marked
with combining characters as in Arabic, Thaana is properly considered an alphabet, rather
than an abjad, because writing the vowels is obligatory.
Directionality. The Thaana script is written from right to left. Conformant implementa-
tions of Thaana script must use the Unicode Bidirectional Algorithm (see Unicode Stan-
dard Annex #9, “Unicode Bidirectional Algorithm”).
Vowels. Consonants are always written with either a vowel sign (U+07A6..U+07AF) or the
null vowel sign (U+07B0 thaana sukun). U+0787 thaana letter alifu with the null
vowel sign denotes a glottal stop. The placement of the Thaana vowel signs is shown in
Table 13-1.
tionality in Thaana. Arabic numeric punctuation is used with digits, whether Arabic or
European.
Punctuation. The Thaana script uses spaces between words. It makes use of a mixture of
Arabic and European punctuation, though rules of usage are not clearly defined. Sentence-
final punctuation is now generally shown with a single period (U+002E “.” full stop) but
may also use a sequence of two periods (U+002E followed by U+002E). Phrases may be
separated with a comma (usually U+060C arabic comma) or with a single period
(U+002E). Colons, dashes, and double quotation marks are also used in the Thaana script.
In addition, Thaana makes use of U+061F arabic question mark and U+061B arabic
semicolon.
Character Names and Arrangement. The character names are based on the names used in
the Republic of Maldives. The character name at U+0794, yaa, is found in some sources as
yaviyani, but the former name is more common today. Characters are listed in Thaana
alphabetical order from haa to ttaa for the Thaana letters, followed by the extended char-
acters in Arabic alphabetical order from hhaa to waavu.
South and Central Asia-II 517 13.2 Sinhala
13.2 Sinhala
Sinhala: U+0D80–U+0DFF
The Sinhala script, also known as Sinhalese or Singhalese, is used to write the Sinhala lan-
guage, the majority language of Sri Lanka. It is also used to write the Pali and Sanskrit lan-
guages. The script is a descendant of Brahmi and resembles the scripts of South India in form
and structure.
Sinhala differs from other languages of the region in that it has a series of prenasalized
stops that are distinguished from the combination of a nasal followed by a stop. In other
words, both forms occur and are written differently—for example, AB <U+0D85,
U+0DAC> a8}a [a:;a] “sound” versus ACDE <U+0D85, U+0DAB, U+0DCA, U+0DA9>
aV}a [a9;a] “egg.” Sinhala also has distinct signs for both a short and a long low front
vowel whose sound [æ] is similar to the initial vowel in the English word “apple.” The inde-
pendent forms of these vowels are encoded at U+0D87 and U+0D88; the corresponding
dependent forms are U+0DD0 and U+0DD1.
Because of these extra letters, the encoding for Sinhala does not precisely follow the pattern
established for the other Indic scripts (for example, Devanagari). It does use the same gen-
eral structure, making use of phonetic order, matra reordering, and use of the virama
(U+0DCA sinhala sign al-lakuna) to indicate orthographic consonant clusters. Sinhala
does not use half-forms in the Devanagari manner, but does use many ligatures.
Vowel Letters. Vowel letters are encoded atomically in Unicode, even if they can be ana-
lyzed visually as consisting of multiple parts. Table 13-2 shows the letters that can be ana-
lyzed, the single code point that should be used to represent them in text, and the sequence
of code points resulting from analysis that should not be used.
Other Letters for Tamil. The Sinhala script may also be used to write Tamil. In this case,
some additional combinations may be required. Some letters, such as U+0DBB sinhala
letter rayanna and U+0DB1 sinhala letter dantaja nayanna, may be modified by
adding the equivalent of a nukta. There is, however, no nukta presently encoded in the Sin-
hala block.
Rendering. Rendering Sinhala is similar to other Brahmic scripts in the Unicode Standard
(in particular, the Tamil script), however, consonant forms are encoded differently.
Virama (Al-lakuna). Unless combined with a U+200D zero width joiner, an al-lakuna
is always visible and does not join consonants to form orthographic consonant clusters.
Consonant Forms. Each consonant may be represented as any of the following forms:
• a live consonant
• a dead consonant with a visible al-lakuna
• a consonant as a part of a touching conjunct
• a consonant as either a reduced form or a part of a ligated conjunct
The use of ZWJ in Sinhala differs from that of typical Indic scripts. The order of an al-
lakuna and a ZWJ between two consonants is not related to which consonant will take a
reduced form, but instead affects the style of orthographic consonant clusters.
The sequence <ZWJ, al-lakuna> joins consonants to form orthographic consonant clus-
ters in the style of touching conjuncts.
The sequence <al-lakuna, ZWJ> joins consonants to form orthographic consonant clus-
ters in the style of reduced forms or ligated conjuncts. Three reduced forms are commonly
recognized: repaya, the above-base form of ra when it is the first consonant in an
orthographic consonant cluster; yansaya and rakaaraansaya, the post-base form of ya and
the below-base form of ra, respectively, when they follow another consonant in an
orthographic consonant cluster.
Punctuation. Sinhala currently uses Western-style punctuation marks. U+0DF4 w sin-
hala punctuation kunddaliya was used historically as a full stop. U+0964 devanagari
danda is used to represent the dandas which occur in historic Sanskrit or Pali texts written
in the Sinhala script.
Digits. Modern Sinhala text uses Western digits. The set of digits in the range U+0DE6 to
U+0DEF was used into the twentieth century, primarily to write horoscopes. That set of
astrological digits is known as Sinhala Lith Illakkam, and includes a form for zero.
13.3 Newa
Newa: U+11400–U+1147F
The Newa script, also known as Nepaalalipi in Nepal and as Newar in English-speaking
countries, is a Brahmi-based script that dates to the tenth century ce. The script is attested
in inscriptions, coins, manuscripts, books, and other publications.
Newa was actively used in central Nepal until the latter half of the 18th century, when the
Newa dynasties were overthrown and the use of the script began to decline. In 1905 the
script was banned, but the ban was lifted in 1951. Today Newa is used to write the Nepal
Bhasa language, a Tibeto-Burman language spoken predominantly in the Kathmandu Val-
ley of Nepal and in the Indian state of Sikkim. It also is used to write Sanskrit and Nepali.
Historically, Newa has been used for Maithili, Bengali, and Hindi. At present, the Nepal
Bhasa language is most often written in the Devanagari script.
Structure. Like other Brahmi-derived Indic scripts, Newa is an abugida and makes use of a
virama. The script is written left-to-right.
Vowels. Vowel length is usually indicated by the dependent vowel signs. The visarga may
also be used to show vowel length. Some vowels are used only for Sanskrit and are not
needed for the representation of Nepal Bhasa.
Virama and Conjuncts. Conjuncts are represented with U+11442 newa sign virama. The
conjuncts in Newa are rendered in reduced form, which appear in vertical conjuncts, as
half forms, or post-base forms of the letters. Half-forms are used for writing horizontal
conjuncts, and generally used only for consonants with right descenders. Explicit half-
forms can be produced by writing U+200D zero width joiner after the virama. Post-base
forms are contextual forms used when the letter appears last in a vertical conjunct. The cur-
rent preference is for producing vertical conjuncts, but manuscripts show more variation,
such as conjuncts in a horizontal or cascading shape.
Murmured Resonant Consonants. Six consonant letters are encoded to represent mur-
mured resonants in the Nepal Bhasa language, as shown in Table 13-3. The murmured res-
onants are analyzed as individual letters in the modern orthography, and are separately
encoded. Similar-appearing conjuncts involving the consonant ha in Sanskrit text should
be represented as conjuncts, using a sequence of <C1, virama, C2>, consistent with San-
skrit practice for other Indic scripts.
Rendering. Combinations of certain consonants and vowel signs may have special render-
ing requirements. For example, the vowel signs ai, o, and au have two-part contextual
forms when the vowels occur after the consonants ga, nya, ttha, nna, tha, dha, and sha. In
addition, several consonant letters have glyphic variants. These include ga, jha, nya, ra, and
sha.
Ligatures. The consonant clusters kXa and jña are represented by the sequences <U+1140E
newa letter ka, U+11442 newa sign virama, U+11432 newa letter ssa> and
<U+11416 newa letter ja, U+11442 newa sign virama, U+11423 newa letter na>,
South and Central Asia-II 520 13.3 Newa
respectively. The consonants ja and ra are also written as ligatures when combined with
U+11438 newa vowel sign u or U+11439 newa vowel sign uu.
Digits. Newa has a full set of decimal digits located at U+11450 to U+11459.
Punctuation. Newa makes use of script-specific dandas, U+1144B newa danda and
U+1144C newa double danda. Other Newa punctuation marks include U+1144D newa
comma, which is used as a phrase separator, and U+1145D newa insertion sign. The
punctuation mark U+1144E newa gap filler indicates a break or fills a gap in a line at the
margin. The character U+1145B newa placeholder mark also is used to fill a gap in a
line, but may be used to mark the end of text. U+1144F newa abbreviation sign is
employed to indicate an abbreviation.
Unicode characters in other blocks may be used for other punctuation marks that occur in
Newa texts . A flower mark, used to identify the end of a text section, can be represented by
U+2055 flower punctuation mark. This mark typically occurs with U+1144C newa
double danda on either side. To indicate a deletion, U+1DFB combining deletion
mark can be used.
Other Signs. To indicate nasalization, U+11443 newa sign candrabindu and U+11444
newa sign anusvara are used. U+11445 newa sign visarga represents post-vocalic aspi-
ration or can be used to mark vowel length. U+11446 newa sign nukta is used to indicate
sounds for which distinct characters in Newa do not exist, such as in loanwords. The char-
acter U+11447 newa sign avagraha marks elision of a word-initial a in Sanskrit as the
result of sandhi. U+11448 newa sign final anusvara has different uses. In certain manu-
scripts, it indicates nasalization, whereas in other sources, it is a form of punctuation, sim-
ilar to a semicolon.
Newa includes two invocation signs, U+11449 newa om and U+1144A newa siddhi. The
sign om may also be written using the sequence <U+1140C newa letter o, U+11443
newa sign candrabindu>. The Newa sign siddhi is written at the beginning of text, often
beside om. It represents siddhirastu, “may there be success.”
South and Central Asia-II 521 13.4 Tibetan
13.4 Tibetan
Tibetan: U+0F00–U+0FFF
The Tibetan script is used for writing Tibetan in several countries and regions throughout
the Himalayas. Aside from Tibet itself, the script is used in Ladakh, Nepal, and northern
areas of India bordering Tibet where large Tibetan-speaking populations now reside. The
Tibetan script is also used in Bhutan to write Dzongkha, the official language of that coun-
try. In Bhutan, as well as in some scholarly traditions, the Tibetan script is called the Bodhi
script, and the particular version written in Bhutan is known as Joyi (mgyogs yig). In addi-
tion, Tibetan is used as the language of philosophy and liturgy by Buddhist traditions
spread from Tibet into the Mongolian cultural area that encompasses Mongolia, Buriatia,
Kalmykia, and Tuva.
The Tibetan scripting and grammatical systems were originally defined together in the
sixth century by royal decree when the Tibetan King Songtsen Gampo sent 16 men to India
to study Indian languages. One of those men, Thumi Sambhota, is credited with creating
the Tibetan writing system upon his return, having studied various Indic scripts and gram-
mars. The king’s primary purpose was to bring Buddhism from India to Tibet. The new
script system was therefore designed with compatibility extensions for Indic (principally
Sanskrit) transliteration so that Buddhist texts could be represented properly. Because of
this origin, over the last 1,500 years the Tibetan script has been widely used to represent
Indic words, a number of which have been adopted into the Tibetan language retaining
their original spelling.
A note on Latin transliteration: Tibetan spelling is traditional and does not generally
reflect modern pronunciation. Throughout this section, Tibetan words are represented in
italics when transcribed as spoken, followed at first occurrence by a parenthetical translit-
eration; in these transliterations, the presence of the tsek (tsheg) character is expressed with
a hyphen.
Thumi Sambhota’s original grammar treatise defined two script styles. The first, called
uchen (dbu-can, “with head”), is a formal “inscriptional capitals” style said to be based on
an old form of Devanagari. It is the script used in Tibetan xylograph books and the one
used in the coding tables. The second style, called u-mey (dbu-med, or “headless”), is more
cursive and said to be based on the Wartu script. Numerous styles of u-mey have evolved
since then, including both formal calligraphic styles used in manuscripts and running
handwriting styles. All Tibetan scripts follow the same lettering rules, though there is a
slight difference in the way that certain compound stacks are formed in uchen and u-mey.
General Principles of the Tibetan Script. Tibetan grammar divides letters into consonants
and vowels. There are 30 consonants, and each consonant is represented by a discrete writ-
ten character. There are five vowel sounds, only four of which are represented by written
marks. The four vowels that are explicitly represented in writing are each represented with
a single mark that is applied above or below a consonant to indicate the application of that
vowel to that consonant. The absence of one of the four marks implies that the first vowel
South and Central Asia-II 522 13.4 Tibetan
sound (like a short “ah” in English) is present and is not modified to one of the four other
possibilities. Three of the four marks are written above the consonants; one is written
below.
Each word in Tibetan has a base or root consonant. The base consonant can be written sin-
gly or it can have other consonants added above or below it to make a vertically “stacked”
letter. Tibetan grammar contains a very complete set of rules regarding letter gender, and
these rules dictate which letters can be written in adjacent positions. The rules therefore
dictate which combinations of consonants can be joined to make stacks. Any combination
not allowed by the gender rules does not occur in native Tibetan words. However, when
transcribing other languages (for example, Sanskrit, Chinese) into Tibetan, these rules do
not operate. In certain instances other than transliteration, any consonant may be com-
bined with any other subjoined consonant. Implementations should therefore be prepared
to accept and display any combinations.
For example, the syllable spyir “general,” pronounced [t"í#], is a typical example of a Tibetan
syllable that includes a stack comprising a head letter, two subscript letters, and a vowel
sign. Figure 13-1 shows the characters in the order in which they appear in the backing
store.
The model adopted to encode the Tibetan lettering set described above contains the fol-
lowing groups of items: Tibetan consonants, vowels, numerals, punctuation, ornamental
signs and marks, and Tibetan-transliterated Sanskrit consonants and vowels. Each of these
will be described in this section.
Both in this description and in Tibetan, the terms “subjoined” (-btags) and “head” (-mgo)
are used in different senses. In the structural sense, they indicate specific slots defined in
native Tibetan orthography. In spatial terms, they refer to the position in the stack; any-
thing in the topmost position is “head,” anything not in the topmost position is “sub-
joined.” Unless explicitly qualified, the terms “subjoined” and “head” are used here in their
spatial sense. For example, in a conjunct like “rka,” the letter in the root slot is “KA.”
Because it is not the topmost letter of the stack, however, it is expressed with a subjoined
character code, while “RA”, which is structurally in the head slot, is expressed with a nomi-
nal character code. In a conjunct “kra,” in which the root slot is also occupied with “KA”,
South and Central Asia-II 523 13.4 Tibetan
the “KA” is encoded with a nominal character code because it is in the topmost position in
the stack.
The Tibetan script has its own system of formatting, and details of that system relevant to
the characters encoded in this standard are explained herein. However, an increasing num-
ber of publications in Tibetan do not strictly adhere to this original formatting system. This
change is due to the partial move from publishing on long, horizontal, loose-leaf folios, to
publishing in vertically oriented, bound books. The Tibetan script also has a punctuation
set designed to meet needs quite different from the punctuation that has evolved for West-
ern scripts. With the appearance of Tibetan newspapers, magazines, school textbooks, and
Western-style reference books in the last 20 or 30 years, Tibetans have begun using things
like columns, indented blocks of text, Western-style headings, and footnotes. Some West-
ern punctuation marks, including brackets, parentheses, and quotation marks, are becom-
ing commonplace in these kinds of publication. With the introduction of more
sophisticated electronic publishing systems, there is also a renaissance in the publication of
voluminous religious and philosophical works in the traditional horizontal, loose-leaf for-
mat—many set in digital typefaces closely conforming to the proportions of traditional
hand-lettered text.
Consonants. The system described here has been devised to encode the Tibetan system of
writing consonants in both single and stacked forms.
All of the consonants are encoded a first time from U+0F40 through U+0F69. There are the
basic Tibetan consonants and, in addition, six compound consonants used to represent the
Indic consonants gha, jha, d.ha, dha, bha, and ksh.a. These codes are used to represent
occurrences of either a stand-alone consonant or a consonant in the head position of a ver-
tical stack. Glyphs generated from these codes will always sit in the normal position start-
ing at and dropping down from the design baseline. All of the consonants are then encoded
a second time. These second encodings from U+0F90 through U+0FB9 represent conso-
nants in subjoined stack position.
To represent a single consonant in a text stream, one of the first “nominal” set of codes is
placed. To represent a stack of consonants in the text stream, a “nominal” consonant code
is followed directly by one or more of the subjoined consonant codes. The stack so formed
continues for as long as subjoined consonant codes are contiguously placed.
This encoding method was chosen over an alternative method that would have involved a
virama-based encoding, such as Devanagari. There were two main reasons for this choice.
First, the virama is not normally used in the Tibetan writing system to create letter combi-
nations. There is a virama in the Tibetan script, but only because of the need to represent
Devanagari; called “srog-med”, it is encoded at U+0F84 tibetan mark halanta. The
virama is never used in writing Tibetan words and can be—but almost never is—used as a
substitute for stacking in writing Sanskrit mantras in the Tibetan script. Second, there is a
prevalence of stacking in native Tibetan, and the model chosen specifically results in
decreased data storage requirements. Furthermore, in languages other than Tibetan, there
are many cases where stacks occur that do not appear in Tibetan-language texts; it is thus
imperative to have a model that allows for any consonant to be stacked with any subjoined
South and Central Asia-II 524 13.4 Tibetan
consonant(s). Thus a model for stack building was chosen that follows the Tibetan
approach to creating letter combinations, but is not limited to a specific set of the possible
combinations.
Vowels. Each of the four basic Tibetan vowel marks is coded as a separate entity. These
code points are U+0F72, U+0F74, U+0F7A, and U+0F7C. For compatibility, a set of sev-
eral compound vowels for Sanskrit transcription is also provided in the other code points
between U+0F71 and U+0F7D. Most Tibetan users do not view these compound vowels as
single characters, and their use is limited to Sanskrit words. It is acceptable for users to
enter these compounds as a series of simpler elements and have software render them
appropriately. Canonical equivalences are specified for all of these compound vowels, with
the exception of U+0F77 tibetan vowel sign vocalic rr and U+0F79 tibetan vowel
sign vocalic ll, which for historic reasons have only compatibility equivalences specified.
These last two characters are deprecated, and their use is strongly discouraged.
A vowel sign may be applied either to a stand-alone consonant or to a stack of consonants.
The vowel sign occurs in logical order after the consonant (or stack of consonants). Each of
the vowel signs is a nonspacing combining mark. The four basic vowel marks are rendered
either above or below the consonant. The compound vowel marks also appear either above
or below the consonant, but in some cases have one part displayed above and one part dis-
played below the consonant.
All of the symbols and punctuation marks have straightforward encodings. Further infor-
mation about many of them appears later in this section.
Coding Order. In general, the correct coding order for a stream of text will be the same as
the order in which Tibetans spell and in which the characters of the text would be written
by hand. For example, the correct coding order for the most complex Tibetan stack would
be
head position consonant
first subjoined consonant
... (intermediate subjoined consonants, if any)
last subjoined consonant
subjoined vowel a-chung (U+0F71)
standard or compound vowel sign, or virama
Where used, the character U+0F39 tibetan mark tsa -phru occurs immediately after the
consonant it modifies.
Allographical Considerations. When consonants are combined to form a stack, one of
them retains the status of being the principal consonant in the stack. The principal conso-
nant always retains its stand-alone form. However, consonants placed in the “head” and
“subjoined” positions to the main consonant sometimes retain their stand-alone forms and
sometimes are given new, special forms. Because of this fact, certain consonants are given a
further, special encoding treatment—namely, “wa” (U+0F5D), “ya” (U+0F61), and “ra”
(U+0F62).
South and Central Asia-II 525 13.4 Tibetan
Head Position “ra”. When the consonant “ra” is written in the “head” position (ra-mgo,
pronounced ra-go) at the top of a stack in the normal Tibetan-defined lettering set, the
shape of the consonant can change. It can either be a full-form shape or the full-form shape
but with the bottom stroke removed (looking like a short-stemmed letter “T”). This
requirement of “ra” in the head position where the glyph representing it can change shape
is correctly coded by using the stand-alone “ra” consonant (U+0F62) followed by the
appropriate subjoined consonant(s). For example, in the normal Tibetan ra-mgo combina-
tions, the “ra” in the head position is mostly written as the half-ra but in the case of “ra +
subjoined nya” must be written as the full-form “ra”. Thus the normal Tibetan ra-mgo
combinations are correctly encoded with the normal “ra” consonant (U+0F62) because it
can change shape as required. It is the responsibility of the font developer to provide the
correct glyphs for representing the characters where the “ra” in the head position will
change shape—for example, as in “ra + subjoined nya”.
Full-Form “ra” in Head Position. Some instances of “ra” in the head position require that
the consonant be represented as a full-form “ra” that never changes. This is not standard
usage for the Tibetan language itself, but rather occurs in transliteration and transcription.
Only in these cases should the character U+0F6A tibetan letter fixed-form ra be used
instead of U+0F62 tibetan letter ra. This “ra” will always be represented as a full-form
“ra consonant” and will never change shape to the form where the lower stroke has been
cut off. For example, the letter combination “ra + ya”, when appearing in transliterated
Sanskrit works, is correctly written with a full-form “ra” followed by either a modified sub-
joined “ya” form or a full-form subjoined “ya” form. Note that the fixed-form “ra” should
be used only in combinations where “ra” would normally transform into a short form but
the user specifically wants to prevent that change. For example, the combination “ra + sub-
joined nya” never requires the use of fixed-form “ra”, because “ra” normally retains its full
glyph form over “nya”. It is the responsibility of the font developer to provide the appropri-
ate glyphs to represent the encodings.
Subjoined Position “wa”, “ya”, and “ra”. All three of these consonants can be written in
subjoined position to the main consonant according to normal Tibetan grammar. In this
position, all of them change to a new shape. The “wa” consonant when written in sub-
joined position is not a full “wa” letter any longer but is literally the bottom-right corner of
the “wa” letter cut off and appended below it. For that reason, it is called a wa-zur (wa-zur
or “corner of a wa”) or, less frequently but just as validly, wa-ta (wa-btags) to indicate that
it is a subjoined “wa”. The consonants “ya” and “ra” when in the subjoined position are
called ya-ta (ya-btags) and ra-ta (ra-btags), respectively. To encode these subjoined conso-
nants that follow the rules of normal Tibetan grammar, the shape-changed, subjoined
forms U+0FAD tibetan subjoined letter wa, U+0FB1 tibetan subjoined letter ya,
and U+0FB2 tibetan subjoined letter ra should be used.
All three of these subjoined consonants also have full-form non-shape-changing counter-
parts for the needs of transliterated and transcribed text. For this purpose, the full sub-
joined consonants that do not change shape (encoded at U+0FBA, U+0FBB, and U+0FBC,
respectively) are used where necessary. The combinations of “ra + ya” are a good example
South and Central Asia-II 526 13.4 Tibetan
because they include instances of “ra” taking a short (ya-btags) form and “ra” taking a full-
form subjoined “ya”.
U+0FB0 tibetan subjoined letter -a (a-chung) should be used only in the very rare
cases where a full-sized subjoined ’a-chung letter is required. The small vowel lengthening
’a-chung encoded as U+0F71 tibetan vowel sign aa is far more frequently used in
Tibetan text, and it is therefore recommended that implementations treat this character
(rather than U+0FB0) as the normal subjoined ’a-chung.
Halanta (Srog-Med). Because two sets of consonants are encoded for Tibetan, with the
second set providing explicit ligature formation, there is no need for a “dead character” in
Tibetan. When a halanta (srog-med) is used in Tibetan, its purpose is to suppress the
inherent vowel “a”. If anything, the halanta should prevent any vowel or consonant from
forming a ligature with the consonant preceding the halanta. In Tibetan text, this character
should be displayed beneath the base character as a combining glyph and not used as a
(purposeless) dead character.
Line Breaking Considerations. Tibetan text separates units called natively tsek-bar
(“tsheg-bar”), an inexact translation of which is “syllable.” Tsek-bar is literally the unit of
text between tseks and is generally a consonant cluster with all of its prefixes, suffixes, and
vowel signs. It is not a “syllable” in the English sense.
Tibetan script has only two break characters. The primary break character is the standard
interword tsek (tsheg), which is encoded at U+0F0B. The second break character is the
space. Space or tsek characters in a stream of Tibetan text are not always break characters
and so need proper contextual handling.
The primary delimiter character in Tibetan text is the tsek (U+0F0B tibetan mark inter-
syllabic tsheg). In general, automatic line breaking processes may break after any occur-
rence of this tsek, except where it follows a U+0F44 tibetan letter nga (with or without
a vowel sign) and precedes a shay (U+0F0D), or where Tibetan grammatical rules do not
permit a break. (Normally, tsek is not written before shay except after “nga”. This type of
tsek-after-nga is called “nga-phye-tsheg” and may be expressed by U+0F0B or by the spe-
cial character U+0F0C, a nonbreaking form of tsek.) The Unicode names for these two
types of tsek are misnomers, retained for compatibility. The standard tsek U+0F0B tibetan
mark intersyllabic tsheg is always required to be a potentially breaking character,
whereas the “nga-phye-tsheg” is always required to be a nonbreaking tsek. U+0F0C
tibetan mark delimiter tsheg bstar is specifically not a “delimiter” and is not for gen-
eral use.
There are no other break characters in Tibetan text. Unlike English, Tibetan has no system
for hyphenating or otherwise breaking a word within the group of letters making up the
word. Tibetan text formatting does not allow text to be broken within a word.
Whitespace appears in Tibetan text, although it should be represented by U+00A0 no-
break space instead of U+0020 space. Tibetan text breaks lines after tsek instead of at
whitespace.
South and Central Asia-II 527 13.4 Tibetan
Complete Tibetan text formatting is best handled by a formatter in the application and not
just by the code stream. If the interword and nonbreaking tseks are properly employed as
breaking and nonbreaking characters, respectively, and if all spaces are nonbreaking
spaces, then any application will still wrap lines correctly on that basis, even though the
breaks might be sometimes inelegant.
Tibetan Punctuation. The punctuation apparatus of Tibetan is relatively limited. The
principal punctuation characters are the tsek; the shay (transliterated “shad”), which is a
vertical stroke used to mark the end of a section of text; the space used sparingly as a space;
and two of several variant forms of the shay that are used in specialized situations requiring
a shay. There are also several other marks and signs but they are sparingly used.
The shay at U+0F0D marks the end of a piece of text called “tshig-grub”. The mode of
marking bears no commonality with English phrases or sentences and should not be
described as a delimiter of phrases. In Tibetan grammatical terms, a shay is used to mark
the end of an expression (“brjod-pa”) and a complete expression. Two shays are used at the
end of whole topics (“don-tshan”). Because some writers use the double shay with a differ-
ent spacing than would be obtained by coding two adjacent occurrences of U+0F0D, the
double shay has been coded at U+0F0E with the intent that it would have a larger spacing
between component shays than if two shays were simply written together. However, most
writers do not use an unusual spacing between the double shay, so the application should
allow the user to write two U+0F0D codes one after the other. Additionally, font designers
will have to decide whether to implement these shays with a larger than normal gap.
The U+0F11 rin-chen-pung-shay (rin-chen-spungs-shad) is a variant shay used in a specific
“new-line” situation. Its use was not defined in the original grammars but Tibetan tradition
gives it a highly defined use. The drul-shay (“sbrul-shad”) is likewise not defined by the
original grammars but has a highly defined use; it is used for separating sections of mean-
ing that are equivalent to topics (“don-tshan”) and subtopics. A drul-shay is usually sur-
rounded on both sides by the equivalent of about three spaces (though no rule is specified).
Hard spaces will be needed for these instances because the drul-shay should not appear at
the beginning of a new line and the whole structure of spacing-plus-shay should not be
broken up, if possible.
Tibetan texts use a yig-go (“head mark,” yig-mgo) to indicate the beginning of the front of a
folio, there being no other certain way, in the loose-leaf style of traditional Tibetan books,
to tell which is the front of a page. The head mark can and does vary from text to text; there
are many different ways to write it. The common type of head mark has been provided for
with U+0F04 tibetan mark initial yig mgo mdun ma and its extension U+0F05
tibetan mark closing yig mgo sgab ma. An initial mark yig-go can be written alone or
combined with as many as three closing marks following it. When the initial mark is writ-
ten in combination with one or more closing marks, the individual parts of the whole must
stay in proper registration with each other to appear authentic. Therefore, it is strongly rec-
ommended that font developers create precomposed ligature glyphs to represent the vari-
ous combinations of these two characters. The less common head marks mainly appear in
Nyingmapa and Bonpo literature. Three of these head marks have been provided for with
U+0F01, U+0F02, and U+0F03; however, many others have not been encoded. Font devel-
South and Central Asia-II 528 13.4 Tibetan
opers will have to deal with the fact that many types of head marks in use in this literature
have not been encoded, cannot be represented by a replacement that has been encoded,
and will be required by some users.
Two characters, U+0F3C tibetan mark ang khang gyon and U+0F3D tibetan mark
ang khang gyas, are paired punctuation; they are typically used together to form a roof
over one or more digits or words. In this case, kerning or special ligatures may be required
for proper rendering. The right ang khang may also be used much as a single closing paren-
thesis is used in forming lists; again, special kerning may be required for proper rendering.
The marks U+0F3E tibetan sign yar tshes and U+0F3F tibetan sign mar tshes are
paired signs used to combine with digits; special glyphs or compositional metrics are
required for their use.
A set of frequently occurring astrological and religious signs specific to Tibetan is encoded
between U+0FBE and U+0FCF.
U+0F34, which means “et cetera” or “and so on,” is used after the first few tsek-bar of a
recurring phrase. U+0FBE (often three times) indicates a refrain.
U+0F36 and U+0FBF are used to indicate where text should be inserted within other text
or as references to footnotes or marginal notes.
Svasti Signs. The svasti signs encoded in the range U+0FD5..U+0FD8 are widely used
sacred symbols associated with Hinduism, Buddhism, and Jainism. They are often printed
in religious texts, marriage invitations, and decorations, and are considered symbols of
good luck and well-being. In the Hindu tradition in India, the dotted forms are often used.
The svasti signs are used to mark religious flags in Jainism and also appear on Buddhist
temples, or as map symbols to indicate the location of Buddhist temples throughout Asia.
These signs are encoded in the Tibetan block, but are intended for general use; they occur
with many other scripts in Asia.
In the Tibetan language, the right-facing svasti sign is referred to as gyung drung nang -khor
and the left-facing svasti sign as gyung drung phyi -khor. U+0FCC tibetan symbol nor bu
bzhi -khyil, or quadruple body symbol, is a Tibetan-specific version of the left-facing
svasti sign.
The svasti signs have also been borrowed into the Han script and adapted as CJK ideo-
graphs. The CJK unified ideographs U+534D and U+5350 correspond to the left-facing
and right-facing svasti signs, respectively. These CJK unified ideographs have adopted Han
script-specific features and properties: they share metrics and type style characteristics
with other ideographs, and are given radicals and stroke counts like those for other ideo-
graphs.
Other Characters. The Wheel of Dharma, which occurs sometimes in Tibetan texts, is
encoded in the Miscellaneous Symbols block at U+2638.
The marks U+0F35 tibetan mark ngas bzung nyi zla and U+0F37 tibetan mark ngas
bzung sgor rtags conceptually attach to a tsek-bar rather than to an individual character
and function more like attributes than characters—for example, as underlining to mark or
South and Central Asia-II 529 13.4 Tibetan
emphasize text. In Tibetan interspersed commentaries, they may be used to tag the tsek-bar
belonging to the root text that is being commented on. The same thing is often accom-
plished by setting the tsek-bar belonging to the root text in large type and the commentary
in small type. Correct placement of these glyphs may be problematic. If they are treated as
normal combining marks, they can be entered into the text following the vowel signs in a
stack; if used, their presence will need to be accounted for by searching algorithms, among
other things.
Tibetan Half-Numbers. The half-number forms (U+0F2A..U+0F33) are peculiar to
Tibetan, though other scripts (for example, Bengali) have similar fractional concepts. The
value of each half-number is 0.5 less than the number within which it appears. These forms
are used only in some traditional contexts and appear as the last digit of a multidigit num-
ber. For example, the sequence of digits “U+0F24 U+0F2C” represents the number 42.5 or
forty-two and one-half.
Tibetan Transliteration and Transcription of Other Languages. Tibetan traditions are in
place for transliterating other languages. Most commonly, Sanskrit has been the language
being transliterated, although Chinese has become more common in modern times. Addi-
tionally, Mongolian has a transliterated form. There are even some conventions for trans-
literating English. One feature of Tibetan script/grammar is that it allows for totally
accurate transliteration of Sanskrit. The basic Tibetan letterforms and punctuation marks
contain most of what is needed, although a few extra things are required. With these addi-
tions, Sanskrit can be transliterated perfectly into Tibetan, and the Tibetan transliteration
can be rendered backward perfectly into Sanskrit with no ambiguities or difficulties.
The six Sanskrit retroflex letters are interleaved among the other consonants.
The compound Sanskrit consonants are not used in normal Tibetan. Precomposed forms
of aspirate letters (and the conjunct “kssa”) are explicitly coded, along with their corre-
sponding subjoined forms: for example, U+0F43 tibetan letter gha, and U+0F93
tibetan subjoined letter gha, or U+0F69 tibetan letter kssa, and U+0FB9 tibetan
subjoined letter kssa. However, these characters, including the subjoined forms,
decompose to stacked sequences involving subjoined “ha” (or “reversed sha”) in all Uni-
code normalization forms.
The vowel signs of Sanskrit not included in Tibetan are encoded with other vowel signs
between U+0F70 and U+0F7D. U+0F7F tibetan sign rnam bcad (nam chay) is the vis-
arga, and U+0F7E tibetan sign rjes su nga ro (nga-ro) is the anusvara. See Section 12.1,
Devanagari, for more information on these two characters.
The characters encoded in the range U+0F88..U+0F8B are used in transliterated text and
are most commonly found in Kalachakra literature.
When the Tibetan script is used to transliterate Sanskrit, consonants are sometimes
stacked in ways that are not allowed in native Tibetan stacks. Even complex forms of this
stacking behavior are catered for properly by the method described earlier for coding
Tibetan stacks.
South and Central Asia-II 530 13.4 Tibetan
Other Signs. U+0F09 tibetan mark bskur yig mgo is a list enumerator used at the begin-
ning of administrative letters in Bhutan, as is the petition honorific U+0F0A tibetan mark
bka- shog yig mgo.
U+0F3A tibetan mark gug rtags gyon and U+0F3B tibetan mark gug rtags gyas are
paired punctuation marks (brackets).
The sign U+0F39 tibetan mark tsa -phru (tsa-’phru, which is a lenition mark) is the
ornamental flaglike mark that is an integral part of the three consonants U+0F59 tibetan
letter tsa, U+0F5A tibetan letter tsha, and U+0F5B tibetan letter dza. Although
those consonants are not decomposable, this mark has been abstracted and may by itself be
applied to “pha” and other consonants to make new letters for use in transliteration and
transcription of other languages. For example, in modern literary Tibetan, it is one of the
ways used to transcribe the Chinese “fa” and “va” sounds not represented by the normal
Tibetan consonants. Tsa-’phru is also used to represent tsa, tsha, and dza in abbreviations.
Traditional Text Formatting and Line Justification. Native Tibetan texts (“pecha”) are
written and printed using a justification system that is, strictly speaking, right-ragged but
with an attempt to right-justify. Each page has a margin. That margin is usually demarcated
with visible border lines required of a pecha. In modern times, when Tibetan text is pro-
duced in Western-style books, the margin lines may be dropped and an invisible margin
used. When writing the text within the margins, an attempt is made to have the lines of text
justified up to the right margin. To do so, writers keep an eye on the overall line length as
they fill lines with text and try manually to justify to the right margin. Even then, a gap at
the right margin often cannot be filled. If the gap is short, it will be left as is and the line will
be said to be justified enough, even though by machine-justification standards the line is
not truly flush on the right. If the gap is large, the intervening space will be filled with as
many tseks as are required to justify the line. Again, the justification is not done perfectly in
the way that English text might be perfectly right-justified; as long as the last tsek is more or
less at the right margin, that will do. The net result is that of a right-justified, blocklike look
to the text, but the actual lines are always a little right-ragged.
Justifying tseks are nearly always used to pad the end of a line when the preceding character
is a tsek—in other words, when the end of a line arrives in the middle of tshig-grub (see the
previous definition under “Tibetan Punctuation”). However, it is unusual for a line that
ends at the end of a tshig-grub to have justifying tseks added to the shay at the end of the
tshig-grub. That is, a sequence like that shown in the first line of Figure 13-2 is not usually
padded as in the second line of Figure 13-2, though it is allowable. In this case, instead of
justifying the line with tseks, the space between shays is enlarged and/or the whitespace fol-
lowing the final shay is usually left as is. Padding is never applied following an actual space
character. For example, given the existence of a space after a shay, a line such as the third
line of Figure 13-2 may not be written with the padding as shown because the final shay
should have a space after it, and padding is never applied after spaces. The same rule
applies where the final consonant of a tshig-grub that ends a line is a “ka” or “ga”. In that
case, the ending shay is dropped but a space is still required after the consonant and that
space must not be padded. For example, the appearance shown in the fourth line of
Figure 13-2 is not acceptable.
South and Central Asia-II 531 13.4 Tibetan
1
2
3
4
Tibetan text has two rules for formatting text at the beginning of a new line. There are
severe constraints on which characters can start a new line, and the first rule is traditionally
stated as follows: A shay of any description may never start a new line. Nothing except
actual words of text can start a new line, with the only exception being a yig-go (yig-mgo) at
the head of a front page or a da-tshe (zla-tshe, meaning “crescent moon”—for example,
U+0F05) or one of its variations, which is effectively an “in-line” yig-go (yig-mgo), on any
other line. One of two or three ornamental shays is also commonly used in short pieces of
prose in place of the more formal da-tshe. This also means that a space may not start a new
line in the flow of text. If there is a major break in a text, a new line might be indented.
A syllable (tsheg-bar) that comes at the end of a tshig-grub and that starts a new line must
have the shay that would normally follow it replaced by a rin-chen-spungs-shad (U+0F11).
The reason for this second rule is that the presence of the rin-chen-spungs-shad makes the
end of tshig-grub more visible and hence makes the text easier to read.
In verse, the second shay following the first rin-chen-spungs-shad is sometimes replaced by
a rin-chen-spungs-shad, though the practice is formally incorrect. It is a writer’s trick done
to make a particular scribing of a text more elegant. Although a moderately popular device,
it does break the rule. Not only is rin-chen-spungs-shad used as the replacement for the
shay but a whole class of “ornamental shays” are used for the same purpose. All are scribal
variants on a rin-chen-spungs-shad, which is correctly written with three dots above it.
Tibetan Shorthand Abbreviations (bskungs-yig) and Limitations of the Encoding. A
consonant functioning as the word base (ming-gzhi) is allowed to take only one vowel sign
according to Tibetan grammar. The Tibetan shorthand writing technique called bskungs-
yig does allow one or more words to be contracted into a single, very unusual combination
of consonants and vowels. This construction frequently entails the application of more
than one vowel sign to a single consonant or stack, and the composition of the stacks them-
selves can break the rules of normal Tibetan grammar. For this reason, vowel signs some-
times interact typographically, which accounts for their particular combining classes (see
Section 4.3, Combining Classes).
The Unicode Standard accounts for plain text compounds of Tibetan that contain at most
one base consonant, any number of subjoined consonants, followed by any number of
vowel signs. This coverage constitutes the vast majority of Tibetan text. Rarely, stacks are
seen that contain more than one such consonant-vowel combination in a vertical arrange-
ment. These stacks are highly unusual and are considered beyond the scope of plain text
rendering. They may be handled by higher-level mechanisms.
South and Central Asia-II 532 13.5 Mongolian
13.5 Mongolian
Mongolian: U+1800–U+18AF
The Mongolians are key representatives of a cultural-linguistic group known as Altaic, after
the Altai mountains of central Asia. In the past, these peoples have dominated the vast
expanses of Asia and beyond, from the Baltic to the Sea of Japan. Echoes of Altaic languages
remain from Finland, Hungary, and Turkey, across central Asia, to Korea and Japan.
Today the Mongolians are represented politically in the country of Mongolia (also known
as Outer Mongolia) and Inner Mongolia (formally the Inner Mongolia Autonomous
Region, China), with Mongolian populations also living in other areas of China.
The Mongolian block unifies the traditional writing system for the Mongolian language
and the three derivative writing systems Todo, Manchu, and Sibe. The traditional writing
system is also known as “Hudum Mongolian,” and is explicitly referred to as “Hudum” in
the following text. Each of the three derivative writing systems shares some common letters
with Hudum, and these letters are encoded only once. Each derivative writing system also
has a number of modified letterforms or new letters, which are encoded separately. The let-
ters typically required by each writing system’s modern usage are encoded as shown in
Table 13-4.
Mongolian, Todo, and Manchu also have a number of special “Ali Gali” letters that are
used for transcribing Tibetan and Sanskrit in Buddhist texts.
History. The Mongolian script was created around the beginning of the thirteenth century,
during the reign of Genghis Khan. It derives from the Old Uyghur script, which was in use
from about the eighth to the fifteenth century. Old Uyghur itself was an adaptation of Sog-
dian Aramaic, a Semitic script written horizontally from right to left. Probably under the
influence of the Chinese script, the Old Uyghur script became rotated ninety degrees coun-
South and Central Asia-II 533 13.5 Mongolian
terclockwise so that the lines of text read vertically in columns running from left to right.
The Mongolian script inherited this directionality from the Old Uyghur script.
The Mongolian script has remained in continuous use for writing Mongolian within the
Inner Mongolia Autonomous Region of China and elsewhere in China. However, in Mon-
golia (Outer Mongolia), the traditional script was replaced by a Cyrillic orthography in the
early 1940s. The traditional script was revived in the early 1990s, so that now both the
Cyrillic and the Mongolian scripts are used. The spelling used with the traditional Mongo-
lian script represents the literary language of the seventeenth and early eighteenth centu-
ries, whereas the Cyrillic script is used to represent the modern, colloquial pronunciation
of words. As a consequence, there is no one-to-one relationship between the traditional
Mongolian orthography and Cyrillic orthography. Approximate correspondence map-
pings are indicated in the code charts, but are not necessarily unique in either direction. All
of the Cyrillic characters needed to write Mongolian are included in the Cyrillic block of
the Unicode Standard.
In addition to the traditional Mongolian script of Mongolia, several historical modifica-
tions and adaptations of the Mongolian script have emerged elsewhere. These adaptations
are often referred to as scripts in their own right, although for the purposes of character
encoding in the Unicode Standard they are treated as styles of the Mongolian script and
share encoding of their basic letters.
The Todo script is a modified and improved version of the Mongolian script, devised in
1648 by Zaya Pandita for use by the Kalmyk Mongolians, who had migrated to Russia in
the sixteenth century, and who now inhabit the Republic of Kalmykia in the Russian Feder-
ation. The name Todo means “clear” in Mongolian; it refers to the fact that the new script
eliminates the ambiguities inherent in the original Mongolian script. The orthography of
the Todo script also reflects the Oirat-Kalmyk dialects of Mongolian rather than literary
Mongolian. In Kalmykia, the Todo script was replaced by a succession of Cyrillic and Latin
orthographies from the mid-1920s and is no longer in active use. Until very recently the
Todo script was still used by speakers of the Oirat and Kalmyk dialects within Xinjiang and
Qinghai in China.
The Manchu script is an adaptation of the Mongolian script used to write Manchu, a Tun-
gusic language that is not closely related to Mongolian. The Mongolian script was first
adapted for writing Manchu in 1599 under the orders of the Manchu leader Nurhachi, but
few examples of this early form of the Manchu script survive. In 1632, the Manchu scholar
Dahai reformed the script by adding circles and dots to certain letters in an effort to distin-
guish their different sounds and by devising new letters to represent the sounds of the Chi-
nese language. When the Manchu people conquered China to rule as the Qing dynasty
(1644–1911), Manchu become the language of state. The ensuing systematic program of
translation from Chinese created a large and important corpus of books written in Man-
chu. Over time the Manchu people became completely sinified, and as a spoken language
Manchu is now almost extinct.
The Sibe (also spelled Sibo, Xibe, or Xibo) people are closely related to the Manchus, and
their language is often classified as a dialect of Manchu. The Sibe people are widely dis-
South and Central Asia-II 534 13.5 Mongolian
persed across northwest and northeast China due to deliberate programs of ethnic disper-
sal during the Qing dynasty. The majority have become assimilated into the local
population and no longer speak the Sibe language. However, there is a substantial Sibe
population in the Sibe Autonomous County in the Ili River valley in Western Xinjiang, the
descendants of border guards posted to Xinjiang in 1764, who still speak and write the Sibe
language. The Sibe script is based on the Manchu script, with a few modified letters.
Directionality. The Mongolian script is written vertically from top to bottom in columns
running from left to right. In modern contexts, words or phrases may be embedded in hor-
izontal scripts. In such a case, the Mongolian text will be rotated ninety degrees counter-
clockwise so that it reads from left to right.
When rendering Mongolian text in a system that does not support vertical layout, the text
should be laid out in horizontal lines running left to right. If such text is viewed sideways,
the usual Mongolian column order appears reversed, but this orientation can be workable
for short stretches of text. There are no bidirectional effects in such a layout because all text
is horizontal left to right.
Encoding Principles. The encoding model for Mongolian is somewhat different from that
for any other script within Unicode, and in many respects it is the most complicated. For
this reason, only the essential features of Mongolian shaping behavior are presented here.
The Semitic alphabet from which the Mongolian script was ultimately derived is funda-
mentally inadequate for representing the sounds of the Mongolian language. As a result,
many of the Mongolian letters are used to represent two different sounds, and the correct
pronunciation of a letter may be known only from the context. In this respect, Mongolian
orthography is similar to English spelling, in which the pronunciation of a letter such as c
may be known only from the context.
Unlike in the Latin script, in which c /k/ and c /s/ are treated as the same letter and
encoded as a single character, in the Mongolian script different phonetic values of the same
glyph may be encoded as distinct characters. Modern Mongolian grammars consider the
phonetic value of a letter to be its distinguishing feature, rather than its glyph shape. For
example, the four Mongolian vowels o, u, ö, and ü are considered four distinct letters and
are encoded as four characters (U+1823, U+1824, U+1825, and U+1826, respectively),
even though o is written identically to u in all positional forms, ö is written identically to ü
in all positional forms, o and u are normally distinguished from ö and ü only in the first syl-
lable of a word. Likewise, the letters t (U+1832) and d (U+1833) are often indistinguish-
able. For example, pairs of Mongolian words such as urtu “long” and ordu “palace, camp,
horde” or ende “here” and ada “devil” are written identically, but are represented using dif-
ferent sequences of Unicode characters, as shown in Figure 13-3. There are many such
examples in Mongolian, but not in Todo, Manchu, or Sibe, which have largely eliminated
ambiguous letters.
Cursive Joining. The Mongolian script is cursive, and the letters constituting a word are
normally joined together. In most cases the letters join together naturally along a vertical
stem, but in the case of certain “bowed” consonants (for example, U+182A mongolian
letter ba and the feminine form of U+182C mongolian letter qa), which lack a trail-
South and Central Asia-II 535 13.5 Mongolian
1832 1833
1824 1824
ende ada
1821 → 1820 →
1828 1833
1833
1820
1821
ing vertical stem, they may form ligatures with a following vowel. This is illustrated in
Figure 13-4, where the letter ba combines with the letter u to form a ligature in the Mongo-
lian word abu “father.”
abu
1820
→
182A
1824
The Joining_Type values for Mongolian characters are defined in ArabicShaping.txt in the
Unicode Character Database. For a discussion of the meaning of Joining_Type values in
the context of a vertically rendered script, see “Cursive Joining” in Section 14.4, Phags-pa.
Most Mongolian characters are Dual_Joining, as they may join on both top and bottom.
Many letters also have distinct glyph forms depending on their position within a word.
These positional forms are classified as initial, medial, final, or isolate. The medial form is
often the same as the initial form, but the final form is always distinct from the initial or
medial form. Figure 13-5 shows the Mongolian letters U+1823 o and U+1821 e, rendered
with distinct positional forms initially and finally in the Mongolian words odo “now” and
ene “this.”
U+200C zero width non-joiner (ZWNJ) and U+200D zero width joiner (ZWJ) may
be used to select a particular positional form of a letter in isolation or to override the
South and Central Asia-II 536 13.5 Mongolian
odo ene
1823 1821
1833
→ 1828
→
1823 1821
expected positional form within a word. Basically, they evoke the same contextual selection
effects in neighboring letters as do non-joining or joining regular letters, but are them-
selves invisible (see Chapter 23, Special Areas and Format Characters). For example, the
various positional forms of U+1820 mongolian letter a may be selected by means of the
following character sequences:
<1820> selects the isolate form.
<1820 200D> selects the initial form.
<200D 1820> selects the final form.
<200D 1820 200D> selects the medial form.
Some letters have additional variant forms that do not depend on their position within a
word, but instead reflect differences between modern versus traditional orthographic prac-
tice or lexical considerations—for example, special forms used for writing foreign words.
On occasion, other contextual rules may condition a variant form selection. For example, a
certain variant of a letter may be required when it occurs in the first syllable of a word or
when it occurs immediately after a particular letter.
The various positional and variant glyph forms of a letter are considered presentation
forms and are not encoded separately. It is the responsibility of the rendering system to
select the correct glyph form for a letter according to its context.
Free Variation Selectors. When a glyph form that cannot be predicted algorithmically is
required (for example, when writing a foreign word), the user needs to append an appro-
priate variation selector to the letter to indicate to the rendering system which glyph form
is required. The following free variation selectors are provided for use specifically with the
Mongolian block:
U+180B mongolian free variation selector one (FVS1)
U+180C mongolian free variation selector two (FVS2)
U+180D mongolian free variation selector three (FVS3)
These format characters normally have no visual appearance. When required, a free varia-
tion selector immediately follows the base character it modifies. This combination of base
character and variation selector is known as a standardized variant. The table of standard-
ized variants, StandardizedVariants.txt, in the Unicode Character Database exhaustively
lists all currently defined standardized variants. All combinations not listed in the table are
South and Central Asia-II 537 13.5 Mongolian
unspecified and are reserved for future standardization; no conformant process may inter-
pret them as standardized variants. Therefore, any free variation selector not immediately
preceded by one of their defined base characters will be ignored.
Figure 13-6 gives an example of how a free variation selector may be used to select a partic-
ular glyph variant. In modern orthography, the initial letter ga in the Mongolian word gal
“fire” is written with two dots; in traditional orthography, the letter ga is written without
any dots. By default, the dotted form of the letter ga is selected, but this behavior may be
overridden by means of FVS1, so that ga plus FVS1 selects the undotted form of the letter
ga.
gal gal
182D 182D
→ →
1820 180B
182F 1820
182F
Vowel Harmony. Mongolian has a system of vowel harmony, whereby the vowels in a word
are either all “masculine” and “neuter” vowels (that is, back vowels plus /i/) or all “femi-
nine” and “neuter” vowels (that is, front vowels plus /i/). Words that are written with mas-
culine/neuter vowels are considered to be masculine, and words that are written with
feminine/neuter vowels are considered to be feminine. Words with only neuter vowels
behave as feminine words (for example, take feminine suffixes). Manchu and Sibe have a
similar system of vowel harmony, although it is not so strict. Some words in these two
scripts may include both masculine and feminine vowels, and separated suffixes with mas-
culine or feminine vowels may be applied to a stem irrespective of its gender.
Vowel harmony is an important element of the encoding model, as the gender of a word
determines the glyph form of the velar series of consonant letters for Mongolian, Todo,
Sibe, and Manchu. In each script, the velar letters have both masculine and feminine
forms. For Mongolian and Todo, the masculine and feminine forms of these letters have
different pronunciations.
When one of the velar consonants precedes a vowel, it takes the masculine form before
masculine vowels, and the feminine form before feminine or neuter vowels. In the latter
case, a ligature of the consonant and vowel is required.
When one of these consonants precedes another consonant or is the final letter in a word,
it may take either a masculine or feminine glyph form, depending on its context. The ren-
dering system should automatically select the correct gender form for these letters based
on the gender of the word (in Mongolian and Todo) or the gender of the preceding vowel
(in Manchu and Sibe). This is illustrated by Figure 13-7, where U+182D mongolian let-
ter ga takes a masculine glyph form when it occurs finally in the masculine word jarlig
“order,” but takes a feminine glyph form when it occurs finally in the feminine word cherig
“soldier.” In this example, the gender form of the final letter ga depends on whether the
first vowel in the word is a back (masculine) vowel or a front (feminine or neuter) vowel.
Where the gender is ambiguous or a form not derivable from the context is required, the
user needs to specify which form is required by means of the appropriate free variation
selector.
jarlig cherig
1835 1834
1820
1837
→ 1821
→
1837
182F 1822
1822 182D
182D
Narrow No-Break Space. In Mongolian, Todo, Manchu, and Sibe, certain grammatical
suffixes are separated from the word stem or from other suffixes by a gap. Many separated
South and Central Asia-II 539 13.5 Mongolian
suffixes exhibit shapes that are distinct from ordinary words, and thus require special shap-
ing.
There are many separated suffixes in Mongolian, usually occurring in masculine and femi-
nine pairs (for example, the dative suffixes -dur and -dür), most of which require special
shaping; a stem may have multiple separated suffixes. In contrast, there are only six sepa-
rated suffixes for Manchu and Sibe, only one of which (-i) requires special shaping; stems
do not have more than one separated suffix at a time.
Because separated suffixes are usually considered an integral part of the word as a whole, a
line break opportunity does not normally occur before a separated suffix. The whitespace
preceding the suffix is often narrower than an ordinary space, although the width may
expand during justification. U+202F narrow no-break space (NNBSP) is used to repre-
sent this small whitespace; it not only prevents word breaking and line breaking, but also
triggers special shaping for the following separated suffix. The resulting shape depends on
the particular separated suffix. Note that NNBSP may be preceded by another separated
suffix, and NNBSP may also appear between non-Mongolian characters and a separated
suffix.
Normally, NNBSP does not provide a line breaking opportunity. However, in situations
where a line is broken before a separated suffix, such as in narrow columns, it is important
not to disable the special shaping triggered by NNBSP. This behavior may be achieved by
placing the break so that the NNBSP character is at the start of the new line. At the begin-
ning of the line, the NNBSP should affect only the shaping of the following Mongolian
characters, and should display with no advance width.
Mongolian Vowel Separator. In Mongolian, the letters a (U+1820) and e (U+1821) in a
word-final position may take a “forward tail” form or a “backward tail” form depending on
the preceding consonant that they are attached to. In some words, a final letter a or e is dis-
connected from the preceding consonant, in which case the vowel always takes the “for-
ward tail” form. U+180E mongolian vowel separator (MVS) is used to represent the
break between a final letter a or e and the rest of the word. MVS is similar in function to
NNBSP, as it divides a word and disconnects the two parts. Whereas NNBSP marks off a
grammatical suffix, however, the a or e following MVS is not a suffix but an integral part of
the word stem.
Whether a final letter a or e is joined or separated is purely lexical and is not a question of
varying orthography. This distinction is shown in Figure 13-8. The example on the left
shows the word qana <182C, 1820, 1828, 1820> without a break before the final letter a,
which means “the outer casing of a vein.” The example on the right shows the word qana
<182C, 1820, 1828, 180E, 1820> with a break before the final letter a, which means “the
wall of a tent.”
The MVS has a twofold effect on shaping. On the one hand, it always selects the forward
tail form of a following letter a or e. On the other hand, it may affect the form of the preced-
ing letter. The particular form that is taken by a letter preceding an MVS depends on the
particular letter and in some cases on whether traditional or modern orthography is being
used. The MVS is not needed for writing Todo, Manchu, or Sibe.
South and Central Asia-II 540 13.5 Mongolian
Baluda. The two Mongolian baluda characters, U+1885 mongolian letter ali gali
baluda and U+1886 mongolian letter ali gali three baluda, are historically related
to U+0F85 tibetan mark paluta. When used in Mongolian text rendered vertically, a
baluda or three baluda character appears to the right side of the first character in a word.
To simplify rendering implementations for Mongolian Ali Gali texts, the baluda characters
have been categorized as nonspacing combining marks, rather than as letters.
Numbers. The Mongolian and Todo scripts use a set of ten digits derived from the Tibetan
digits. In vertical text, numbers are traditionally written from left to right across the width
of the column. In modern contexts, they are frequently rotated so that they follow the ver-
tical flow of the text.
The Manchu and Sibe scripts do not use any special digits, although Chinese number ideo-
graphs may be employed—for example, for page numbering in traditional books.
Punctuation. Traditional punctuation marks used for Mongolian and Todo include the
U+1800 mongolian birga (marks the start of a passage or the recto side of a folio),
U+1802 mongolian comma, U+1803 mongolian full stop, and U+1805 mongolian
four dots (marks the end of a passage). The birga occurs in several different glyph forms.
In writing Mongolian and Todo, U+1806 mongolian todo soft hyphen is used at the
beginning of the second line to indicate resumption of a broken word. It functions like
U+2010 hyphen, except that U+1806 appears at the beginning of a line rather than at the
end.
The Manchu script normally uses only two punctuation marks: U+1808 mongolian man-
chu comma and U+1809 mongolian manchu full stop.
In modern contexts, Mongolian, Todo, and Sibe may use a variety of Western punctuation
marks, such as parentheses, quotation marks, question marks, and exclamation marks.
U+2048 question exclamation mark and U+2049 exclamation question mark are
used for side-by-side display of a question mark and an exclamation mark together in ver-
tical text. Todo and Sibe may additionally use punctuation marks borrowed from Chinese,
such as U+3001 ideographic comma, U+3002 ideographic full stop, U+300A left
double angle bracket, and U+300B right double angle bracket.
Nirugu. U+180A mongolian nirugu acts as a stem extender. In traditional Mongolian
typography, it is used to physically extend the stem joining letters, so as to increase the sep-
aration between all letters in a word. This stretching behavior should preferably be carried
out in the font rather than by the user manually inserting U+180A.
South and Central Asia-II 541 13.5 Mongolian
The nirugu may also be used to separate two parts of a compound word. For example,
altan-agula “The Golden Mountains” may be written with the words altan, “golden,” and
agula, “mountains,” joined together using the nirugu. In this usage the nirugu is similar to
the use of hyphen in Latin scripts, but it is nonbreaking.
Syllable Boundary Marker. U+1807 mongolian sibe syllable boundary marker is
used to disambiguate syllable boundaries within a word. It is mainly used for writing Sibe,
but may also occur in Manchu texts. In native Manchu or Sibe words, syllable boundaries
are never ambiguous; when transcribing Chinese proper names in the Manchu or Sibe
script, however, the syllable boundary may be ambiguous. In such cases, U+1807 may be
inserted into the character sequence at the syllable boundary.
13.6 Limbu
Limbu: U+1900–U+194F
The Limbu script is a Brahmic script primarily used to write the Limbu language. Limbu is
a Tibeto-Burman language of the East Himalayish group and is spoken by about 200,000
persons mainly in eastern Nepal, but also in the neighboring Indian states of Sikkim and
West Bengal (Darjeeling district). Its close relatives are the languages of the East Himalay-
ish or “Kiranti” group in Eastern Nepal. Limbu is distantly related to the Lepcha (Róng)
language of Sikkim and to Tibetan. Limbu was recognized as an official language in Sikkim
in 1981.
The Nepali name Limbu is of uncertain origin. In Limbu, the Limbu call themselves yak-
thuz. Individual Limbus often take the surname “Subba,” a Nepali term of Arabic origin
meaning “headman.” The Limbu script is often called “Sirijanga” after the Limbu culture-
hero Sirijanga, who is credited with its invention. It is also sometimes called Kirat, kirZta
being a Sanskrit term probably referring to some variety of non-Aryan hill-dwellers.
The oldest known writings in the Limbu script, most of which are held in the India Office
Library, London, were collected in Darjeeling district in the 1850s. The modern script was
developed beginning in 1925 in Kalimpong (Darjeeling district) in an effort to revive writ-
ing in Limbu, which had fallen into disuse. The encoding in the Unicode Standard sup-
ports the three versions of the Limbu script: the nineteenth-century script, found in
manuscript documents; the early modern script, used in a few, mainly mimeographed,
publications between 1928 and the 1970s; and the current script, used in Nepal and India
(especially Sikkim) since the 1970s. There are significant differences, particularly between
some of the glyphs required for the nineteenth-century and modern scripts.
Virtually all Limbu speakers are bilingual in Nepali, and far more Limbus are literate in
Nepali than in Limbu. For this reason, many Limbu publications contain material both in
Nepali and in Limbu, and in some cases Limbu appears in both the Limbu script and the
Devanagari script. In some publications, literary coinages are glossed in Nepali or in
English.
Consonants. Consonant letters and clusters represent syllable initial consonants and clus-
ters followed by the inherent vowel, short open o ([t]). Subjoined consonant letters are
joined to the bottom of the consonant letters, extending to the right to indicate “medials”
in syllable-initial consonant clusters. There are very few of these clusters in native Limbu
words. The script provides for subjoined | -ya, } -ra, and ~ -wa. Small letters are used to
indicate syllable-final consonants. (See the following information on vowel length for fur-
ther details.) The small letter consonants are found in the range U+1930..U+1938, corre-
sponding to the syllable finals of native Limbu words. These letters are independent forms
that, unlike the conjoined or half-letter forms of Indian scripts, may appear alone as word-
final consonants (where Indian scripts use full consonant letters and a virama). The sylla-
ble finals are pronounced without a following vowel.
South and Central Asia-II 543 13.6 Limbu
Punctuation. The main punctuation mark used is the double vertical line, U+0965 deva-
nagari double danda. U+1945 A limbu question mark and U+1944 B limbu exclama-
tion mark have shapes peculiar to Limbu, especially in Sikkimese typography. They are
encoded in the Unicode Standard to facilitate the use of both Limbu and Devanagari
scripts in the same documents. U+1940 C limbu sign loo is used for the exclamatory par-
ticle lo. This particle is also often simply spelled out >D.
Digits. Limbu digits have distinctive forms and are assigned code points because Limbu
and Devanagari (or Limbu and Arabic-Indic) numbers are often used in the same docu-
ment.
South and Central Asia-II 546 13.7 Meetei Mayek
“head,” sam designates “hair-parting,” and lai is “forehead.” The last 9 letters, gok, jham,
rai, and so forth, derive from a subset of the original 18. The ordering system employed
today differs from the Brahmi-based order, which relies on the point of articulation.
Punctuation. The modern Meetei Mayek script uses two punctuation marks in addition to
the killer. U+ABEB meetei mayek cheikhei functions as a double danda mark. U+ABEC
meetei mayek lum iyek is a heavy tone mark, used to orthographically distinguish words
which would otherwise not be differentiated.
Digits. Meetei Mayek has a unique set of ten digits for zero to nine encoded in the range at
U+ABF0..U+ABF9.
13.8 Mro
Mro: U+16A40–U+16A6F
The Mro script was invented in the 1980s. It is used to write the Mro (or Mru) language, a
language of the Mruic branch of Tibeto-Burman spoken in Southeastern Bangladesh and
neighboring areas of Myanmar. (This language is distinct from Mro-Khimi, a language of
the Kukish branch of Tibeto-Burman spoken in Myanmar.)
The Mro script is unrelated to any other script. Some of the letters of the Mro alphabet
have a visual similarity to letters from other alphabets, but such similarities are coinciden-
tal.
Structure. The Mro script is a left-to-right alphabet with no combining characters or tone
marks. Some sounds are represented by more than one letter.
Character Names. Consonant letter names are traditional, based on phonetic transcrip-
tion.
Digits. Mro has a script-specific set of digits
Punctuation. There are two script-specific punctuation characters, U+16A6E mro danda
and U+16A6F mro double danda. Words are separated by spaces.
Two of the Mro letters are used as abbreviations. U+16A5E mro letter tek can be used
instead of the word “tek,” meaning “quote.” U+16A5C mro letter hai can be used for
various groups of letters.
South and Central Asia-II 549 13.9 Warang Citi
13.10 Ol Chiki
Ol Chiki: U+1C50–U+1C7F
The Ol Chiki script was invented by Pandit Raghunath Murmu in the first half of the 20th
century ce to write Santali, a Munda language of India. The script is also called Ol Cemet’,
Ol Ciki, or simply Ol. Santali has also been written with the Devanagari, Bengali, and Oriya
scripts, as well as the Latin alphabet.
Various dialects of Santali are spoken by 5.8 million people, with 25% to 50% literacy rates,
mostly in India, with a few in Nepal or Bangladesh. The Ol Chiki script is used primarily
for the southern dialect of Santali as spoken in the Odishan Mayurbhañj district. The script
has received some official recognition by the Odishan government.
Ol Chiki has recently been promoted by some Santal organizations, with uncertain success,
for use in writing certain other Munda languages in the Chota Nagpur area, as well as for
the Dravidian Dhangar-Kudux language.
Structure. Ol Chiki is alphabetic and has none of the structural properties of the abugidas
typical for other Indic scripts. There are separate letters representing consonants and vow-
els. A number of modifier letters are used to indicate tone, nasalization, vowel length, and
deglottalization. There are no combining characters in the script.
Ol Chiki is written from left to right.
Digits. The Ol Chiki script has its own set of digits. These are separately encoded in the Ol
Chiki block.
Punctuation. Western-style punctuation, such as the comma, exclamation mark, question
mark, and quotation marks are used in Ol Chiki text. U+002E “.” full stop is not used,
because it is visually confusable with the modifier letter U+1C79 ol chiki gaahlaa ttud-
daag.
The danda, U+1C7E ol chiki punctuation mucaad, is used as a text delimiter in prose.
The danda and the double danda, U+1C7F ol chiki punctuation double mucaad, are
both used in poetic text.
Modifier Letters. The southern dialect of Santali has only six vowels, each represented by a
single vowel letter. The Santal Parganas dialect, on the other hand, has eight or nine vow-
els. The extra vowels for Santal Parganas are represented by a sequence of one of the vowel
letters U+1C5A, U+1C5F, or U+1C6E followed by the diacritic modifier letter, U+1C79 ol
chiki gaahlaa ttuddaag, displayed as a baseline dot.
Nasalization is indicated by the modifier letter, U+1C78 ol chiki mu ttuddag, displayed
as a raised dot. This mark can follow any vowel, long or short.
When the vowel diacritic and nasalization occur together, the combination is represented
by a separate modifier letter, U+1C7A ol chiki mu-gaahlaa ttuddaag, displayed as
South and Central Asia-II 551 13.10 Ol Chiki
both a baseline and a raised dot. The combination is treated as a separate character and is
entered using a separate key on Ol Chiki keyboards.
U+1C7B ol chiki relaa is a length mark, which can be used with any oral or nasalized
vowel.
Glottalization. U+1C7D ol chiki ahad is a special letter indicating the deglottalization of
an Ol Chiki consonant in final position. This unique feature of the writing system pre-
serves the morphophonemic relationship between the glottalized (ejective) and voiced
equivalents of consonants. For example, U+1C5C ol chiki letter ag represents an ejec-
tive [k’] when written in word-final position, but voiced [g] when written word-initially. A
voiced [g] in word-final position is written with the deglottalization mark as a sequence:
<U+1C5C ol chiki letter ag, U+1C7D ol chiki ahad>.
U+1C7C ol chiki phaarkaa serves the opposite function. It is a “glottal protector.” When
it follows one of the four ejective consonants, it preserves the ejective sound, even in word-
initial position followed by a vowel.
Aspiration. Aspirated consonants are written as digraphs, with U+1C77 ol chiki letter
oh as the second element, indicating the aspiration.
Ligatures. Ligatures are not a normal feature of printed Ol Chiki. However, in handwriting
and script fonts, letters form cursive ligatures with the deglottalization mark, U+1C7D ol
chiki ahad.
South and Central Asia-II 552 13.11 Chakma
13.11 Chakma
Chakma: U+11100–U+1114F
The Chakma people, who live in southeast Bangladesh near Chittagong City, as well as in
parts of India such as Mizoram, Assam, Tripura, and Arunachal Pradesh, speak an Indo-
European language also called Chakma. The language, spoken by about 500,000 people, is
related to the Assamese, Bengali, Chittagonian, and Sylheti languages.
The Chakma script is Brahmi-derived, and is sometimes also called AjhZ pZYh or Ojhopath.
There are some efforts to adapt the Chakma script to write the closely related Tanchangya
language. One of the interesting features of Chakma writing is that candrabindu (cZnaphu-
pudZ) can be used together with anusvara (ekaphudZ) and visarga (dviphudZ).
Independent Vowels. Like other Brahmi-derived scripts, Chakma uses consonant letters
that contain an inherent vowel. Consonant clusters are written with conjunct characters,
while a visible “vowel killer” (called the maayyaa) shows the deletion of the inherent vowel
when there is no conjunct. There are four independent vowels: U+11103 chakma letter
aa /#/, U+11104 chakma letter i /i/, U+11105 chakma letter u /u/, and U+11106
chakma letter e /e/. Other vowels in the initial position are formed by adding a depen-
dent vowel sign to the independent vowel /#/, to form vowels such as /S/, /T/, /ai/, and /oi/.
Vowel Killer and Virama. Like the Myanmar script and the characters used to write his-
toric Meetei Mayek, Chakma is encoded with two vowel-killing characters to conform to
modern user expectations. Chakma uses the maayyaa (killer) to invoke conjoined conso-
nants. Most letters have their vowels killed with the use of the explicit maayyaa character.
In addition to the visible killer, there is an explicit conjunct-forming character (virama),
permitting the user to choose between the subjoining style and the ligating style. Whether
a conjunct is required or not is part of the spelling of a word.
In principle, nothing prevents the visible killer from appearing together with a subjoining
sequence formed with virama. However, in practice, combinations of virama and maayyaa
following a consonant are not meaningful, as both kill the inherent vowel.
In 2001, an orthographic reform was recommended in the book CZumZ pattham pZt, lim-
iting the standard repertoire of conjuncts to those composed with the five letters U+11121
chakma letter yaa /y#/, U+11122 chakma letter raa /r#/, U+11123 chakma letter
laa /l#/, U+11124 chakma letter waa /w#/, and U+1111A chakma letter naa /n#/.
Chakma Fonts. Chakma fonts by default should display the subjoined form of letters that
follow virama to ensure legibility.
Punctuation. Chakma has a single and double danda. There is also a unique question
mark and a section mark, phulacihna.
Digits. A distinct set of digits is encoded for Chakma. Bengali digits are also used with
Chakma. Myanmar digits are used with the Chakma script when writing Tanchangya.
South and Central Asia-II 553 13.12 Lepcha
13.12 Lepcha
Lepcha: U+1C00–U+1C4F
Lepcha is a Sino-Tibetan language spoken by people in Sikkim and in the West Bengal
state of India, especially in the Darjeeling district, which borders Sikkim. The Lepcha script
is a writing system thought to have been invented around 1720 ce by the Sikkim king
Phyag-rdor rNam-rgyal (“Chakdor Namgyal,” born 1686). Both the language and the
script are also commonly known by the term Rong.
Structure. The Lepcha script was based directly on the Tibetan script. The letterforms are
obviously related to corresponding Tibetan letters. However, the dbu-med Tibetan precur-
sors to Lepcha were originally written in vertical columns, possibly influenced by Chinese
conventions. When Lepcha was invented it changed the dbu-med text to a left-to-right,
horizontal orientation. In the process, the entire script was effectively rotated ninety
degrees counterclockwise, so that the letters resemble Tibetan letters turned on their sides.
This reorientation resulted in some letters which are nonspacing marks in Tibetan becom-
ing spacing letters in Lepcha. Lepcha also introduced its own innovations, such as the use
of diacritical marks to represent final consonants.
The Lepcha script is an abugida: the consonant letters have an inherent vowel, and depen-
dent vowels (matras) are used to modify the inherent vowel of the consonant. No virama
(or vowel killer) is used to remove the inherent vowel. Instead, the script has a separate set
of explicit final consonants which are used to represent a consonant with no inherent vowel.
Vowels. Initial vowels are represented by the neutral letter U+1C23 lepcha letter a, fol-
lowed by the appropriate dependent vowel. U+1C23 lepcha letter a thus functions as a
vowel carrier.
The dependent vowel signs in Lepcha always follow the base consonant in logical order.
However, in rendering, three of these dependent vowel signs, -i, -o, and -oo, reorder to the
left side of their base consonant. One of the dependent vowel signs, -e, is a nonspacing
mark which renders below its base consonant.
Medials. There are three medial consonants, or glides: -ya, -ra, and -la. The first two are
represented by separate characters, U+1C24 lepcha subjoined letter ya and U+1C25
lepcha subjoined letter ra. These are called “subjoined”, by analogy with the corre-
sponding letters in Tibetan, which actually do join below a Tibetan consonant, but in Lep-
cha these are spacing forms which occur to the right of a consonant letter and then ligate
with it. These two medials can also occur in sequence to form a composite medial, -rya. In
that case both medials ligate with the preceding consonant.
On the other hand, Lepcha does not have a separate character to represent the medial -la.
Phonological consonant clusters of the form kla, gla, pla, and so on simply have separate,
atomic characters encoded for them. With few exceptions, these letters for phonological
clusters with the medial -la are independent letterforms, not clearly related to the corre-
sponding consonants without -la.
South and Central Asia-II 554 13.12 Lepcha
Retroflex Consonants. The Lepcha language contains three retroflex consonants: [P], [Ph],
and [K]. Traditionally, these retroflex consonants have been written in the Lepcha script
with the syllables kra, hra, and gra, respectively. In other words, the retroflex t would be
represented as <U+1C00 lepcha letter ka, U+1C25 lepcha subjoined letter ra>. To
distinguish such a sequence representing a retroflex t from a sequence representing the
actual syllable [kra], it is common to use the nukta diacritic sign, U+1C37 lepcha sign
nukta. In that case, the retroflex t would be visually distinct, and would be represented by
the sequence <U+1C00 lepcha letter ka, U+1C37 lepcha sign nukta, U+1C25 lep-
cha subjoined letter ra>. Recently, three newly invented letters have been added to the
script to unambiguously represent the retroflex consonants: U+1C4D lepcha letter tta,
U+1C4E lepcha letter ttha, and U+1C4F lepcha letter dda.
Ordering of Syllable Components. Dependent vowels and other signs are encoded after
the consonant to which they apply. The ordering of elements is shown in more detail in
Table 13-6.
Rendering. Most final consonants consist of nonspacing marks rendered above the base
consonant of a syllable.
The combining mark U+1C36 lepcha sign ran occurs after the inherent vowel -a or the
dependent vowel -i. When it occurs together with a final consonant sign, the ran sign ren-
ders above the sign for that final consonant.
The two final consonants representing the velar nasal occur in complementary contexts.
U+1C34 lepcha consonant sign nyin-do is only used when there is no dependent
vowel in the syllable. U+1C35 lepcha consonant sign kang is used instead when there is
a dependent vowel. These two consonant signs are rendered to the left of the base conso-
nant. If used with a left-side dependent vowel, the glyph for the kang is rendered to the left
of the dependent vowel. This behavior is understandable because these two marks are
derived from the Tibetan analogues of the Brahmic bindu and candrabindu, which nor-
mally stand above a Brahmic aksara.
Digits. The Lepcha script has its own, distinctive set of digits.
South and Central Asia-II 555 13.12 Lepcha
Punctuation. Currently the Lepchas use traditional punctuation marks only when copying
the old books. In everyday writing they use common Western punctuation marks such as
comma, full stop, and question mark.
The traditional punctuation marks include a script-specific danda mark, U+1C3B lepcha
punctuation ta-rol, and a double danda, U+1C3C lepcha punctuation nyet thyoom
ta-rol. Depending on style and hand, the Lepcha ta-rol may have a glyph appearance
more like its Tibetan analogue, U+0F0D tibetan mark shad.
South and Central Asia-II 556 13.13 Saurashtra
13.13 Saurashtra
Saurashtra: U+A880–U+A8DF
Saurashtra is an Indo-European language, related to Gujarati and spoken by about 310,000
people in southern India. The Telugu, Tamil, Devanagari, and Saurashtra scripts have
been used to publish books in Saurashtra since the end of the 19th century. At present, Sau-
rashtra is most often written in the Tamil script, augmented with the use of superscript dig-
its and a colon to indicate sounds not available in the Tamil script.
The Saurashtra script is of the Brahmic type. Early Saurashtra text made use of conjuncts,
which can be handled with the usual Brahmic shaping rules. The modernized script, devel-
oped in the 1880s, has undergone some simplification. Modern Saurashtra does not use
complex consonant clusters, but instead marks a killed vowel with a visible virama,
U+A8C4 saurashtra sign virama. An exception to the non-occurrence of complex con-
sonant clusters is the conjunct ksa, formed by the sequence <U+A892, U+A8C4, U+200D,
U+A8B0>. This conjunct is sorted as a unique letter in older dictionaries. Apart from its
use to form ksa, the virama is always visible by default in modern Saurashtra. If necessary,
U+200D zero width joiner may be used to force conjunct behavior.
The Unicode encoding of the Saurashtra script supports both older and newer conventions
for writing Saurashtra text.
Glyph Placement. The vowel signs (matras) in Saurashtra follow the consonant to which
they are applied. The long and short -i vowels, however, are typographically joined to the
top right corner of their consonant. Vowel signs are also applied to U+A8B4 saurashtra
consonant sign haaru.
Digits. The Saurashtra script has its own set of digits. These are separately encoded in the
Saurashtra block.
Punctuation. Western-style punctuation, such as comma, full stop, and the question mark
are used in modern Saurashtra text. U+A8CE saurashtra danda is used as a text delim-
iter in traditional prose. U+A8CE saurashtra danda and U+A8CF saurashtra double
danda are used in poetic text.
Saurashtra Consonant Sign Haaru. The character U+A8B4 saurashtra consonant
sign haaru, transliterated as “H”, is unique to Saurashtra, and does not have an equivalent
in the Devanagari, Tamil, or Telugu scripts. It functions in some regards like the Tamil
aytam, modifying other letters to represent sounds not found in the basic Brahmic alpha-
bet. It is a dependent consonant and is thus classified as a consonant sign in the encoding.
South and Central Asia-II 557 13.14 Masaram Gondi
There are a few exceptions to the consonant cluster formation rule: the conjuncts kssa,
jyna, and tra are atomically encoded, whereas consonant clusters with U+11D26 masaram
gondi letter ra have special contextual forms. When ra occurs as the first consonant in a
cluster and does not mark a morphological boundary, it is generally rendered with
U+11D46 masaram gondi repha. Repha is represented in logical order at the beginning
of a cluster, and does not interact with any combining signs. When ra appears first in a
cluster and marks a morphological distinction, the bare consonant appears. There is also a
cluster-final form of ra, a combining sign called U+11D47 ra-kara. The ra-kara appears in
logical order before any vowel sign. Neither repha nor ra-kara interact with the virama.
Details are shown in Figure 13-10.
South and Central Asia-II 558 13.14 Masaram Gondi
Various Signs. Masaram Gondi uses various signs, as summarized in Table 13-7.
Digits and Punctuation. Masaram Gondi has a full set of decimal digits. There are no
script-specific marks of punctuation. For dandas, Masaram Gondi uses U+0964 devana-
gari danda and U+0965 devanagari double danda.
South and Central Asia-II 559 13.15 Gunjala Gondi
𑵱 + 𑶗 + 𑵱 𑵱 (kka)
𑵸 + 𑶗 + 𑵺 𑵺 (dna)
Conjuncts composed of a consonant and the vowels signs -aa, -oo, and -au are usually writ-
ten as ligatures with a modified form of the consonant.
South and Central Asia-II 560 13.15 Gunjala Gondi
Digits and Punctuation. Gunjala Gondi has a full set of decimal digits in the range
U+11DA0..U+11DA9. Gunjala Gondi uses dandas and European punctuation, such as
middle dots, periods, and colons to mark word and sentence boundaries. Gunjala Gondi
uses U+0964 devanagari danda and U+0965 devanagari double danda.
Other Signs. U+11D95 gunjala gondi sign anusvara indicates nasalization. U+11D96
gunjala gondi sign visarga represents post-vocalic aspiration in words of Sanskrit ori-
gin. The om sign is encoded at U+11D98.
South and Central Asia-II 561 13.16 Wancho
13.16 Wancho
Wancho: U+1E2C0–U+1E2FF
The Wancho script was devised between 2001 and 2012 by Banwang Losu, a teacher at a
government middle school in his home village in Arunachal Pradesh, India; it is taught
today in schools. The Wancho language is a Sino-Tibetan language that has some 50,000
speakers and is used chiefly in the southeast of Arunachal Pradesh, as well as in Assam and
Nagaland, and in the countries of Myanmar and Bhutan.
Structure. Wancho is a simple left-to-right alphabetic script comprised of letters which
represent both consonants and vowels. Diacritical marks are used on vowel letters to indi-
cate tone.
Tones. There are four tone marks in Wancho:
• U+1E2EC wancho tone tup
• U+1E2ED wancho tone tupni
• U+1E2EE wancho tone koi
• U+1E2EF wancho tone koini
The four tone marks are in two pairs. One pair, wancho tone tup and wancho tone
tupni, is used with Southern Wancho, and the second pair, wancho tone koi and wan-
cho tone koini, is used with Northern Wancho.
Punctuation. Common Western punctuation marks such as comma, full stop, and ques-
tion mark are used in Wancho.
Currency Sign. The Wancho currency sign, U+ 1E2FF wancho ngun sign, is used to
indicate rupees.
Digits. Wancho uses decimal digits 0–9 encoded in the range U+1E2F0..U+1E2F9. Com-
mon operators are used for mathematical operations.
South and Central Asia-II 562 13.16 Wancho
563
Chapter 14
South and Central Asia-III 14
Ancient Scripts
The oldest lengthy inscriptions of India, the edicts of Ashoka from the third century bce,
were written in two scripts, Kharoshthi and Brahmi. These are both ultimately of Semitic
origin, probably deriving from Aramaic, which was an important administrative language
of the Middle East at that time. Kharoshthi, which was written from right to left, was sup-
planted by Brahmi and its derivatives.
The Bhaiksuki script is a Brahmi-derived script used around 1000 ce, primarily in the area
of the present-day states of Bihar and West Bengal in India and northern Bangladesh. Sur-
viving Bhaiksuki texts are limited to a few Buddhist manuscripts and inscriptions.
Phags-pa is an historical script related to Tibetan that was created as the national script of
the Mongol empire. Phags-pa was used mostly in Eastern and Central Asia for writing text
in the Mongolian and Chinese languages.
The Marchen script (Tibetan sMar-chen) is a Brahmi-derived script used in the Tibetan
Bön liturgical tradition. Marchen is used to write Tibetan and the historic Zhang-zhung
language. Although few historical examples of the script have been found, Marchen
appears in modern-day inscriptions and in modern Bön literature.
The Old Turkic script is known from eighth-century Siberian stone inscriptions, and is the
oldest known form of writing for a Turkic language. Also referred to as Turkic Runes due
to its superficial resemblance to Germanic Runes, it appears to have evolved from the Sog-
dian script, which is in turn derived from Aramaic.
Both the Soyombo script and the Zanabazar Square script are historic scripts used to write
Mongolian, Sanskrit, and Tibetan. These two scripts were both invented by Zanabazar
(1635–1723), one of the most important Buddhist leaders in Mongolia. Each script is an
abugida. Soyombo appears primarily in Buddhist texts in Central Asia. Zanabazar Square
has also been called “Horizontal Square” script, “Mongolian Horizontal Square” script and
“Xewtee Dörböljin Bicig.”
South and Central Asia-III 564
Old Sogdian and Sogdian are related scripts used in Central Asia. The Old Sogdian script
was used for a group of related writing systems dating from the third to the sixth century
ce. These writing systems were all used to write Sogdian, an eastern Iranian language. Old
Sogdian is a non-joining abjad. Its basic repertoire consists of 20 of the 22 letters of the Ara-
maic alphabet.
The Sogdian script, which derives from Old Sogdian, is also an abjad, and was used from
the seventh to the fourteenth century ce, also to write Sogdian. Its repertoire corresponds
to that of Old Sogdian, but has a number of differences in the glyphs and also has addi-
tional characters. The script was also used to write Chinese, Sanskrit, and Uyghur. Sogdian
is the ancestor of the Old Uyghur and Mongolian scripts.
South and Central Asia-III 565 14.1 Brahmi
14.1 Brahmi
Brahmi: U+11000–U+1106F
The Brahmi script is an historical script of India attested from the third century bce until
the late first millennium ce. Over the centuries Brahmi developed many regional varieties,
which ultimately became the modern Indian writing systems, including Devanagari, Tamil
and so on. The encoding of the Brahmi script in the Unicode Standard supports the repre-
sentation of texts in Indian languages from this historical period. For texts written in his-
torically transitional scripts—that is, between Brahmi and its modern derivatives—there
may be alternative choices to represent the text. In some cases, there may be a separate
encoding for a regional medieval script, whose use would be appropriate. In other cases,
users should consider whether the use of Brahmi or a particular modern script best suits
their needs.
Encoding Model. The Brahmi script is an abugida and is encoded using the Unicode
virama model. Consonants have an inherent vowel /a/. A separate character is encoded for
the virama: U+11046 brahmi virama. The virama is used between consonants to form
conjunct consonants. It is also used as an explicit killer to indicate a vowelless consonant.
Vowel Letters. Vowel letters are encoded atomically in Brahmi, even if they can be analyzed
visually as consisting of multiple parts. Table 14-1 shows the letters that can be analyzed,
the single code point that should be used to represent them in text, and the sequence of
code points resulting from analysis that should not be used.
+ + sva
11032 11046 1102F
+ + kṣa
11013 11046 11031
+ + jña
1101A 11046 1101C
[o:] in Prakrit and Sanskrit. As a consequence, in Tamil Brahmi text, the virama is used not
only after consonants, but also after the vowels e (U+1100F, U+11042) and o (U+11011,
U+11044). This pukki is represented using U+11046 brahmi virama.
Bhattiprolu Brahmi. Ten short Middle Indo-Aryan inscriptions from the second century
bce found at Bhattiprolu in Andhra Pradesh show an orthography that seems to be derived
from the Tamil Brahmi system. To avoid the phonetic ambiguity of the Tamil Brahmi
U+11038 brahmi vowel sign aa (standing for either [a] or [a:]), the Bhattiprolu inscrip-
tions introduced a separate vowel sign for long [a:] by adding a vertical stroke to the end of
the earlier sign. This is encoded as U+11039 brahmi vowel sign bhattiprolu aa.
Punctuation. There are seven punctuation marks in the encoded repertoire for Brahmi.
The single and double dandas, U+11047 brahmi danda and U+11048 brahmi double
danda, delimit clauses and verses. U+11049 brahmi punctuation dot, U+1104A
brahmi punctuation double dot, and U+1104B brahmi punctuation line delimit
smaller textual units, while U+1104C brahmi punctuation crescent bar and U+1104D
brahmi punctuation lotus separate larger textual units.
Numerals. Two sets of numbers, used for different numbering systems, are attested in
Brahmi documents. The first set is the old additive-multiplicative system that goes back to
the beginning of the Brahmi script. The second is a set of ten decimal digits that occurs side
by side with the earlier numbering system in manuscripts and inscriptions during the late
Brahmi period.
The set of additive-multiplicative numerals of the Brahmi script contains separate signs for
the digits from 1 to 9, the tens from 10 to 90, as well as signs for 100 and 1000. Numbers are
written additively, with the higher-valued signs preceding the lower-valued ones. Multiples
of 100 and of 1000 are expressed multiplicatively with character sequences consisting of the
sign for 100 or 1000, followed by U+1107F brahmi number joiner and then the multi-
plier. The component parts of additive numbers are rendered unligated, whereas multiples
are rendered in ligated form.
For example, the sequence <U+11064 brahmi number one hundred, U+11055 brahmi
number four> represents the number 100 + 4 = 104 and is rendered unligated, whereas
the sequence <U+11064 brahmi number one hundred, U+1107F brahmi number
joiner, U+11055 brahmi number four> represents the number 100 × 4 = 400 and is ren-
dered as a ligature.
U+1107F brahmi number joiner forms a ligature between the two numeral characters
surrounding it. It functions similarly to U+2D7F tifinagh consonant joiner, but is
intended to be used only with Brahmi numerals in the range U+11052 brahmi number
one through U+11065 brahmi number one thousand, and not with consonants or other
characters. Because U+1107F brahmi number joiner marks a semantic distinction
between additive numbers and multiples, it should be rendered with a visible fallback
glyph to indicate its presence in the text when it cannot be displayed by normal rendering.
In addition to the ligated forms of the multiples of 100 and 1000, other examples from the
middle and late Brahmi periods show the signs for 200, 300, and 2000 in special forms not
South and Central Asia-III 568 14.1 Brahmi
obviously connected with a ligature of the component parts. Such forms may be enabled in
fonts using a ligature substitution.
A special sign for zero was invented later, and the positional system came into use. This sys-
tem is the ancestor of modern decimal number systems. Due to the different systemic fea-
tures and shapes, the signs in this set are separately encoded in the range from U+11066
brahmi digit zero through U+1106F brahmi digit nine. These signs have the same
properties as the modern Indic digits. Examples are shown in Table 14-2. Brahmi decimal
digits are categorized as regular bases and can act as vowel carriers, whereas the numerals
U+11052 brahmi number one through U+11065 brahmi number one thousand and
their ligatures formed with U+1107F brahmi number joiner are not used as vowel carri-
ers.
1 1 11067
2 2 11068
3 3 11069
4 4 1106A
10 10 <11067, 11066>
14.2 Kharoshthi
Kharoshthi: U+10A00–U+10A5F
The Kharoshthi script, properly spelled as KharoDEhG, was used historically to write GFndh-
FrG and Sanskrit as well as various mixed dialects. Kharoshthi is an Indic script of the
abugida type. However, unlike other Indic scripts, it is written from right to left. The Khar-
oshthi script was initially deciphered around the middle of the 19th century by James Prin-
sep and others who worked from short Greek and Kharoshthi inscriptions on the coins of
the Indo-Greek and Indo-Scythian kings. The decipherment has been refined over the last
150 years as more material has come to light.
The Kharoshthi script is one of the two ancient writing systems of India. Unlike the pan-
Indian BrFhmG script, Kharoshthi was confined to the northwest of India centered on the
region of GandhZra (modern northern Pakistan and eastern Afghanistan, as shown in
Figure 14-2). Gandhara proper is shown on the map as the dark gray area near Peshawar.
The lighter gray areas represent places where the Kharoshthi script was used and where
manuscripts and inscriptions have been found.
The exact details of the origin of the Kharoshthi script remain obscure, but it is almost cer-
tainly related to Aramaic. The Kharoshthi script first appears in a fully developed form in
the AAokan inscriptions at Shahbazgarhi and Mansehra which have been dated to around
250 bce. The script continued to be used in Gandhara and neighboring regions, sometimes
alongside Brahmi, until around the third century ce, when it disappeared from its home-
land. Kharoshthi was also used for official documents and epigraphs in the Central Asian cit-
ies of Khotan and Niya in the third and fourth centuries ce, and it appears to have survived in
South and Central Asia-III 570 14.2 Kharoshthi
Kucha and neighboring areas along the Northern Silk Road until the seventh century. The
Central Asian form of the script used during these later centuries is termed Formal Kharo-
shthi and was used to write both Gandhari and Tocharian B. Representation of Kharoshthi in
the Unicode code charts uses forms based on manuscripts of the first century ce.
Directionality. Kharoshthi can be implemented using the rules of the Unicode Bidirec-
tional Algorithm. Both letters and digits are written from right to left. Kharoshthi letters do
not have positional variants.
Diacritical Marks and Vowels. All vowels other than a are written with diacritical marks in
Kharoshthi. In addition, there are six vowel modifiers and three consonant modifiers that
are written with combining diacritics. In general, only one combining vowel sign is applied
to each syllable (aksara). However, there are some examples of two vowel signs on aksaras
in the Kharoshthi of Central Asia.
Numerals. Kharoshthi employs a set of eight numeral signs unique to the script. Like the
letters, the numerals are written from right to left. Numbers in Kharoshthi are based on an
additive system. There is no zero, nor separate signs for the numbers five through nine.
The number 1996, for example, would logically be represented as 1000 4 4 1 100 20 20 20
20 10 4 2 and would appear as shown in Figure 14-3. The numerals are encoded in the
range U+10A40..U+10A47.
Punctuation. Nine different punctuation marks are used in manuscripts and inscriptions.
The punctuation marks are encoded in the range U+10A50..U+10A58.
Word Breaks, Line Breaks, and Hyphenation. Most Kharoshthi manuscripts are written
as continuous text with no indication of word boundaries. Only a few examples are known
where spaces have been used to separate words or verse quarters. Most scribes tried to fin-
ish a word before starting a new line. There are no examples of anything akin to hyphen-
ation in Kharoshthi manuscripts. In cases where a word would not completely fit into a
line, its continuation appears at the start of the next line. Modern scholarly practice uses
spaces and hyphenation. When necessary, hyphenation should follow Sanskrit practice.
Sorting. There is an ancient ordering connected with Kharoshthi called Arapacana, named
after the first five aksaras. However, there is no evidence that words were sorted in this
order, and there is no record of the complete Arapacana sequence. In modern scholarly
practice, Gandhari is sorted in much the same order as Sanskrit. Vowel length, even when
marked, is ignored when sorting Kharoshthi.
South and Central Asia-III 571 14.2 Kharoshthi
Rendering Kharoshthi
Rendering requirements for Kharoshthi are similar to those for Devanagari. This section
specifies a minimum set of combining rules that provide legible Kharoshthi diacritic and
ligature substitution behavior.
All unmarked consonants include the inherent vowel a. Other vowels are indicated by one
of the combining vowel diacritics. Some letters may take more than one diacritical mark. In
these cases the preferred sequence is Letter + {Consonant Modifier} + {Vowel Sign} +
{Vowel Modifier}. For example the Sanskrit word parZrdhyaiu might be rendered in Khar-
oshthi script as *parZrvaiu, written from right to left, as shown in Figure 14-4.
+ + +
+ 𐨗 + 𐨿 + + + +
}
} ¯
rjaih
rā
𐨗
pa
˙
Combining Vowels. The various combining vowels attach to characters in different ways. A
number of groupings have been determined on the basis of their visual types, such as hori-
zontal or vertical, as shown in Table 14-3.
The virama can follow only a consonant or a consonant modifier. It cannot follow a space,
a vowel, a vowel modifier, a number, a punctuation sign, or another virama. Examples of
the use of the Kharoshthi virama are given in Table 14-6.
14.3 Bhaiksuki
Bhaiksuki: U+11C00–U+11C6F
The Bhaiksuki script is a Brahmi-derived script used around 1000 ce, primarily in the area
of the present-day states of Bihar and West Bengal in India and northern Bangladesh. The
original name of the script was SaindhavS (the Sindhu script), but it is also known as the
Arrow-Headed script. Surviving Bhaiksuki texts are limited to a few Buddhist manuscripts
and inscriptions.
Structure. The structure of Bhaiksuki script is similar to that of other Brahmi-based Indic
scripts. It is an abugida that makes use of a virama. The script is written from left to right.
Rendering. Many of the vowel signs have contextual variants when they occur with certain
consonants. The consonants U+11C22 bhaiksuki letter pa, U+11C27 bhaiksuki let-
ter ya, and U+11C28 bhaiksuki letter ra have special combining forms when they
occur with certain vowel signs.
Virama and Conjuncts. The script includes a virama, U+11C3F bhaiksuki sign virama,
which functions to suppress the inherent vowel and to form conjuncts. Consonant clusters
are generally rendered as vertically stacked ligatures, with non-initial consonants attached
below the initial letter. Above-base vowel signs and consonant letters attach to the glyph of
the initial consonant, while below-base vowel signs attach to the glyph of the final conso-
nant. The letters ka, pa, ra, and ya take special forms when they occur in conjuncts.
The Bhaiksuki dependent vowel signs in the range U+11C38..U+11C3B, e, ai, o, and au,
are simply treated as above-base vowel signs. Although the historically cognate vowel signs
may be treated as having left-side parts, or as two- or three-part vowels in many other
scripts of India, the peculiarities of rendering for these vowel signs in the Bhaiksuki script
can be handled more easily with the above-base designations. The dependent vowel signs
ai, o, and au are not given formal canonical decompositions, but are encoded instead as
atomic characters.
The sequence <C, virama> is rendered using a visible virama by default. The combinations
<ta, virama>, <na, virama>, and <ma, virama> may also be displayed with special liga-
tures; there is no apparent semantic distinction between sequences containing the visible
virama and sequences displayed as ligatures.
Various Signs. Nasalization is represented by U+11C3C bhaiksuki sign candrabindu
and U+11C3D bhaiksuki sign anusvara. Post-vocalic aspiration in Sanskrit is indicated
by U+11C3E bhaiksuki sign visarga. Use of U+11C40 bhaiksuki sign avagraha indi-
cates elision of a word-initial a in Sanskrit as a result of sandhi.
Digits and Numbers. Bhaiksuki has a script-specific set of decimal digits. Because the
glyphs for zero and three have not been yet identified in the Bhaiksuki corpus, representa-
tive glyphs for U+11C50 bhaiksuki digit zero and U+11C53 bhaiksuki digit three
are based upon corresponding digits in other scripts that are contemporaneous with
Bhaiksuki.
South and Central Asia-III 576 14.3 Bhaiksuki
In addition to the decimal digits, the script has a distinct numerical notation system.
Bhaiksuki contains numbers for primary and tens units, and U+11C6C bhaiksuki hun-
dreds unit mark. The numbers are written vertically, with the largest number written
above smaller units. Control of vertical orientation is managed at the font level, but the
default rendering is horizontal left to right.
Punctuation. The script employs script-specific dandas, U+11C41 bhaiksuki danda and
U+11C42 bhaiksuki double danda. Words are separated by U+11C43 bhaiksuki word
separator. Two characters, U+11C44 bhaiksuki gap filler-1 and U+11C45 bhaiksuki
gap filler-2, are used as spacing or completion marks, especially to indicate the end of a
line. They also can indicate a deliberate elision or an otherwise missing portion of text.
South and Central Asia-III 577 14.4 Phags-pa
14.4 Phags-pa
Phags-pa: U+A840–U+A87F
The Phags-pa script is an historic script with some limited modern use. It bears some sim-
ilarity to Tibetan and has no case distinctions. It is written vertically in columns running
from left to right, like Mongolian. Units are often composed of several syllables and may be
separated by whitespace.
The term Phags-pa is often written with an initial apostrophe: ’Phags-pa. The Unicode
Standard makes use of the alternative spelling without an initial apostrophe because apos-
trophes are not allowed in the normative character and block names.
History. The Phags-pa script was devised by the Tibetan lama Blo-gros rGyal-mtshan
[lodoi jaltsan] (1235–1280 ce), commonly known by the title Phags-pa Lama (“exalted
monk”), at the behest of Khubilai Khan (reigned 1260–1294) when he assumed leadership
of the Mongol tribes in 1260. In 1269, the “new Mongolian script,” as it was called, was pro-
mulgated by imperial edict for use as the national script of the Mongol empire, which from
1279 to 1368, as the Yuan dynasty, encompassed all of China.
The new script was not only intended to replace the Uyghur-derived script that had been
used to write Mongolian since the time of Genghis Khan (reigned 1206–1227), but was
also intended to be used to write all the diverse languages spoken throughout the empire.
Although the Phags-pa script never succeeded in replacing the earlier Mongolian script
and had only very limited usage in writing languages other than Mongolian and Chinese, it
was used quite extensively during the Yuan dynasty for a variety of purposes. There are
many monumental inscriptions and manuscript copies of imperial edicts written in Mon-
golian or Chinese using the Phags-pa script. The script can also be found on a wide range
of artifacts, including seals, official passes, coins, and banknotes. It was even used for
engraving the inscriptions on Christian tombstones. A number of books are known to have
been printed in the Phags-pa script, but all that has survived are some fragments from a
printed edition of the Mongolian translation of a religious treatise by the Phags-pa Lama’s
uncle, Sakya Pandita. Of particular interest to scholars of Chinese historical linguistics is a
rhyming dictionary of Chinese with phonetic readings for Chinese ideographs given in the
Phags-pa script.
An ornate, pseudo-archaic “seal script” version of the Phags-pa script was developed spe-
cifically for engraving inscriptions on seals. The letters of the seal script form of Phags-pa
mimic the labyrinthine strokes of Chinese seal script characters. A great many official seals
and seal impressions from the Yuan dynasty are known. The seal script was also sometimes
used for carving the title inscription on stone stelae, but never for writing ordinary running
text.
Although the vast majority of extant Phags-pa texts and inscriptions from the thirteenth
and fourteenth centuries are written in the Mongolian or Chinese languages, there are also
examples of the script being used for writing Uyghur, Tibetan, and Sanskrit, including two
long Buddhist inscriptions in Sanskrit carved in 1345.
South and Central Asia-III 578 14.4 Phags-pa
After the fall of the Yuan dynasty in 1368, the Phags-pa script was no longer used for writ-
ing Chinese or Mongolian. However, the script continued to be used on a limited scale in
Tibet for special purposes such as engraving seals. By the late sixteenth century, a distinc-
tive, stylized variety of Phags-pa script had developed in Tibet, and this Tibetan-style
Phags-pa script, known as hor-yig, “Mongolian writing” in Tibetan, is still used today as a
decorative script. In addition to being used for engraving seals, the Tibetan-style Phags-pa
script is used for writing book titles on the covers of traditional style books, for architec-
tural inscriptions such as those found on temple columns and doorways, and for cal-
ligraphic samplers.
Basic Structure. The Phags-pa script is based on Tibetan, but unlike any other Brahmic
script Phags-pa is written vertically from top to bottom in columns advancing from left to
right across the writing surface. This unusual directionality is borrowed from Mongolian,
as is the way in which Phags-pa letters are ligated together along a vertical stem axis. In
modern contexts, when embedded in horizontally oriented scripts, short sections of Phags-
pa text may be laid out horizontally from left to right.
Despite the difference in directionality, the Phags-pa script fundamentally follows the
Tibetan model of writing, and consonant letters have an inherent /a/ vowel sound. How-
ever, Phags-pa vowels are independent letters, not vowel signs as is the case with Tibetan,
so they may start a syllable without being attached to a null consonant. Nevertheless, a null
consonant (U+A85D phags-pa letter a) is still needed to write an initial /a/ and is
orthographically required before a diphthong or the semivowel U+A867 phags-pa sub-
joined letter wa. Only when writing Tibetan in the Phags-pa script is the null consonant
required before an initial pure vowel sound.
Except for the candrabindu (which is discussed later in this section), Phags-pa letters read
from top to bottom in logical order, so the vowel letters i, e, and o are placed below the pre-
ceding consonant—unlike in Tibetan, where they are placed above the consonant they
modify.
Syllable Division. Text written in the Phags-pa script is broken into discrete syllabic units
separated by whitespace. When used for writing Chinese, each Phags-pa syllabic unit cor-
responds to a single Han ideograph. For Mongolian and other polysyllabic languages, a
single word is typically written as several syllabic units, each separated from each other by
whitespace.
For example, the Mongolian word tengri, “heaven,” which is written as a single ligated unit
in the Mongolian script, is written as two separate syllabic units, deng ri, in the Phags-pa
script. Syllable division does not necessarily correspond directly to grammatical structure.
For instance, the Mongolian word usun, “water,” is written u sun in the Phags-pa script, but
its genitive form usunu is written u su nu.
Within a single syllabic unit, the Phags-pa letters are normally ligated together. Most letters
ligate along a righthand stem axis, although reversed-form letters may instead ligate along
a lefthand stem axis. The letter U+A861 phags-pa letter o ligates along a central stem
axis.
South and Central Asia-III 579 14.4 Phags-pa
In traditional Phags-pa texts, normally no distinction is made between the whitespace used
in between syllables belonging to the same word and the whitespace used in between sylla-
bles belonging to different words. Line breaks may occur between any syllable, regardless
of word status. In contrast, in modern contexts, influenced by practices used in the pro-
cessing of Mongolian text, U+202F narrow no-break space (NNBSP) may be used to
separate syllables within a word, whereas U+0020 space is used between words—and line
breaking would be affected accordingly.
Candrabindu. U+A873 phags-pa letter candrabindu is used in writing Sanskrit man-
tras, where it represents a final nasal sound. However, although it represents the final
sound in a syllable unit, it is always written as the first glyph in the sequence of letters,
above the initial consonant or vowel of the syllable, but not ligated to the following letter.
For example, om is written as a candrabindu followed by the letter o. To simplify cursor
placement, text selection, and so on, the candrabindu is encoded in visual order rather than
logical order. Thus om would be represented by the sequence <U+A873, U+A861>, ren-
dered as shown in Figure 14-6.
As the candrabindu is separated from the following letter, it does not take part in the shap-
ing behavior of the syllable unit. Thus, in the syllable om, the letter o (U+A861) takes the
isolate positional form.
Alternate Letters. Four alternate forms of the letters ya, sha, ha, and fa are encoded for use
in writing Chinese under certain circumstances:
U+A86D phags-pa letter alternate ya
U+A86E phags-pa letter voiceless sha
U+A86F phags-pa letter voiced ha
U+A870 phags-pa letter aspirated fa
These letters are used in the early-fourteenth-century Phags-pa rhyming dictionary of Chi-
nese, Menggu ziyun, to represent historical phonetic differences between Chinese syllables
that were no longer reflected in the contemporary Chinese language. This dictionary fol-
lows the standard phonetic classification of Chinese syllables into 36 initials, but as these
had been defined many centuries previously, by the fourteenth century some of the initials
had merged together or diverged into separate sounds. To distinguish historical phonetic
characteristics, the dictionary uses two slightly different forms of the letters ya, sha, ha, and
fa.
The historical phonetic values that U+A86E, U+A86F, and U+A870 represent are indicated
by their character names, but this is not the case for U+A86D, so there may be some confu-
sion as to when to use U+A857 phags-pa letter ya and when to use U+A86D phags-pa
South and Central Asia-III 580 14.4 Phags-pa
letter alternate ya. U+A857 is used to represent historic null initials, whereas U+A86D
is used to represent historic palatal initials.
Numbers. There are no special characters for numbers in the Phags-pa script, so numbers
are spelled out in full in the appropriate language.
Punctuation. The vast majority of traditional Phags-pa texts do not make use of any punc-
tuation marks. However, some Mongolian inscriptions borrow the Mongolian punctua-
tion marks U+1802 mongolian comma, U+1803 mongolian full stop, and U+1805
mongolian four dots.
Additionally, a small circle punctuation mark is used in some printed Phags-pa texts. This
mark can be represented by U+3002 ideographic full stop, but for Phags-pa the ideo-
graphic full stop should be centered, not positioned to one side of the column. This follows
traditional, historic practice for rendering the ideographic full stop in Chinese text, rather
than more modern typography.
Tibetan Phags-pa texts also use head marks, U+A874 phags-pa single head mark
U+A875 phags-pa double head mark, to mark the start of an inscription, and shad
marks, U+A876 phags-pa mark shad and U+A877 phags-pa mark double shad, to
mark the end of a section of text.
Positional Variants. The four vowel letters U+A85E phags-pa letter i, U+A85F phags-
pa letter u, U+A860 phags-pa letter e, and U+A861 phags-pa letter o have different
isolate, initial, medial, and final glyph forms depending on whether they are immediately
preceded or followed by another Phags-pa letter (other than U+A873 phags-pa letter
candrabindu, which does not affect the shaping of adjacent letters). The code charts show
these four characters in their isolate form. The various positional forms of these letters are
shown in Table 14-7.
Consonant letters and the vowel letter U+A866 phags-pa letter ee do not have distinct
positional forms, although initial, medial, final, and isolate forms of these letters may be
distinguished by the presence or absence of a stem extender that is used to ligate to the fol-
lowing letter.
The invisible format characters U+200D zero width joiner (ZWJ) and U+200C zero
width non-joiner (ZWNJ) may be used to override the expected shaping behavior, in the
same way that they do for Mongolian and other scripts (see Chapter 23, Special Areas and
South and Central Asia-III 581 14.4 Phags-pa
Format Characters). For example, ZWJ may be used to select the initial, medial, or final
form of a letter in isolation:
<U+200D, U+A861, U+200D> selects the medial form of the letter o
<U+200D, U+A861> selects the final form of the letter o
<U+A861, U+200D> selects the initial form of the letter o
Conversely, ZWNJ may be used to inhibit expected shaping. For example, the sequence
<U+A85E, U+200C, U+A85F, U+200C, U+A860, U+200C, U+A861> selects the isolate
forms of the letters i, u, e, and o.
Mirrored Variants. The four characters U+A869 phags-pa letter tta, U+A86A phags-
pa letter ttha, U+A86B phags-pa letter dda, and U+A86C phags-pa letter nna are
mirrored forms of the letters U+A848 phags-pa letter ta, U+A849 phags-pa letter
tha, U+A84A phags-pa letter da, and U+A84B phags-pa letter na, respectively, and
are used to represent the Sanskrit retroflex dental series of letters. Because these letters are
mirrored, their stem axis is on the lefthand side rather than the righthand side, as is the
case for all other consonant letters. This means that when the letters tta, ttha, dda, and nna
occur at the start of a syllable unit, to correctly ligate with them any following letters nor-
mally take a mirrored glyph form. Because only a limited number of words use these letters,
only the letters U+A856 phags-pa letter small a, U+A85C phags-pa letter ha,
U+A85E phags-pa letter i, U+A85F phags-pa letter u, U+A860 phags-pa letter e,
and U+A868 phags-pa subjoined letter ya are affected by this glyph mirroring behav-
ior. The Sanskrit syllables that exhibit glyph mirroring after tta, ttha, dda, and nna are
shown in Table 14-8.
Glyph mirroring is not consistently applied to the letters U+A856 phags-pa letter small
a and U+A85E phags-pa letter i in the extant Sanskrit Phags-pa inscriptions. The letter i
may occur both mirrored and unmirrored after the letter ttha, although it always occurs
mirrored after the letter nna. Small a is not normally mirrored after the letters tta and ttha
as its mirrored glyph is identical in shape to U+A85A phags-pa letter sha. Nevertheless,
small a does sometimes occur in a mirrored form after the letter ttha, in which case context
indicates that this is a mirrored letter small a and not the letter sha.
South and Central Asia-III 582 14.4 Phags-pa
When any of the letters small a, i, u, e, ha, or subjoined ya immediately follow either tta,
ttha, dda, or nna directly or another mirrored letter, then a mirrored glyph form of the let-
ter should be selected automatically by the rendering system. Although small a is not nor-
mally mirrored in extant inscriptions, for consistency it is mirrored by default after tta,
ttha, dda, and nna in the rendering model for Phags-pa.
To override the default mirroring behavior of the letters small a, ha, i, u, e, and subjoined
ya, U+FE00 variation selector-1 (VS1) may be applied to the appropriate character, as
shown in Table 14-9. Note that only the variation sequences shown in Table 14-9 are valid;
any other sequence of a Phags-pa letter and VS1 is unspecified.
In Table 14-9, “reversed shaping” means that the appearance of the character is reversed
with respect to its expected appearance. Thus, if no mirroring would be expected for the
character in the given context, applying VS1 would cause the rendering engine to select a
mirrored glyph form. Similarly, if context would dictate glyph mirroring, application of
VS1 would inhibit the expected glyph mirroring. This mechanism will typically be used to
select a mirrored glyph for the letters small a, ha, i, u, e, or subjoined ya in isolation (for
example, in discussion of the Phags-pa script) or to inhibit mirroring of the letters small a
and i when they are not mirrored after the letters tta and ttha, as shown in Figure 14-7.
The first example illustrates the normal shaping for the syllable thi. The second example
shows the reversed shaping for i in that syllable and would be represented by a standard-
ized variation sequence: <U+A849, U+A85E, U+FE00>. Example 3 illustrates the normal
shaping for the Sanskrit syllable tthi, where the reversal of the glyph for the letter i is auto-
matically conditioned by the lefthand stem placement of the Sanskrit letter ttha. Example 4
shows reversed shaping for i in the syllable tthi and would be represented by a standardized
variation sequence: <U+A86A, U+A85E, U+FE00>.
Cursive Joining. Joining types are defined for Phags-pa characters in the file ArabicShap-
ing.txt. Joining types identify the joining behavior of characters in cursive joining scripts
South and Central Asia-III 583 14.4 Phags-pa
and were originally introduced for the Arabic script. Because the Phags-pa script is typi-
cally rendered from top to bottom, Joining_Type = L (Left_Joining) conventionally refers
to bottom joining that is, joining to a character which follows (is below) it. Joining_Type =
R (Right_Joining) is not used for the Phags-pa script, but would refer to top joining, that is,
joining to a character which precedes (is above) it. Most Phags-pa characters are Dual_-
Joining, as they may join on both top and bottom.
The L and R designations of the Joining_Type property should not be confused with the
left-hand and right-hand placement of stem axes in the Phags-pa script in vertical layout.
Whether a Phags-pa character joins on the left-hand or right-hand side in its stem axis is
not defined in ArabicShaping.txt.
South and Central Asia-III 584 14.5 Marchen
14.5 Marchen
Marchen: U+11C70–U+11CBF
The Marchen script (Tibetan sMar-chen) is a Brahmi-derived script used in the Tibetan
Bön liturgical tradition. Marchen is used to write Tibetan and also the historic Zhang-
zhung language. The script is said to originate in the ancient kingdom of Zhang-zhung,
which flourished in western and northern Tibet before Buddhism was introduced in the
area in the seventh century. Although few historical examples of the script have been
found, Marchen appears in modern-day inscriptions and is widely used in modern Bön lit-
erature.
Encoding Model. The encoding model for Marchen follows that of Tibetan. Marchen con-
tains thirty base consonants and thirty subjoined consonants, which can be used to form
vertical stacks of two or more consonants. Although not all subjoined consonants have
been identified in extant texts, the full set of subjoined forms is encoded, so that all possible
stack combinations can be represented.
Vowels and Consonants. As in Tibetan, two or more Marchen consonants can stack verti-
cally. Vowel signs are placed above, below, or alongside a stack of one or more consonants.
Other Signs. Marchen includes a vowel lengthener, U+11CB0 marchen vowel sign aa,
known as a-chung. Nasalization is represented by U+11CB6 marchen sign candrabindu
and U+11CB5 marchen sign anusvara.
Punctuation. There are two script-specific punctuation marks encoded. U+11C70
marchen head mark corresponds to U+0F04 tibetan mark initial yig mgo mdun ma.
The sentence-final shad mark, U+11C71 marchen mark shad, corresponds to U+0F0D
tibetan mark shad. Marchen does not use an explicit mark to separate syllables; this dif-
fers from the use of the Tibetan tsek (tsheg) mark.
South and Central Asia-III 585 14.6 Zanabazar Square
The consonants ya, ra, la, va have different representations when they occur in Sanskrit
and Tibetan conjuncts. Therefore, contextual forms of these letters are encoded as separate
characters.
Virama and Subjoiner. U+11A34 zanabazar square sign virama is used to silence the
inherent vowel of a consonant for writing Sanskrit and Tibetan. The virama is used only
with a consonant and behaves as other combining marks in the script, always with a visible
display.
Vowel-silencing characters in Brahmi-based scripts often have a secondary function of
controlling conjunct formation, however, the Zanabazar Square script does not follow this
pattern. A separate character, U+11A47 zanabazar square subjoiner, is used to control
conjunct formation.
The representation of a vertical conjunct stack uses the subjoiner character between each
consonant of the cluster. For example, the syllable mstu is represented with the sequence
<ma, subjoiner, sa, subjoiner, ta, vowel sign ue>, as shown in the second line of Figure 14-8.
To suppress the visual stacking of a cluster, the virama character is used instead, which
kills the vowel and results in a visual marking of the dead consonant which does not stack.
For example, if the syllable mstu is represented with the sequence <ma, virama, sa, virama,
ta, vowel sign ue>, the rendering is as shown in the first row of Figure 14-8.
𑨢$ 𑨴 𑨰$ 𑨴𑨙$𑨂 → 𑨢𑨴𑨰𑨴𑨙𑨂
𑨢 𑩇 𑨰 𑩇 𑨙 $𑨂 →
𑨢
𑨰
𑨙𑨂
Head Marks. There are four head marks in the Zanabazar Square script. These four head
marks are used in transliterations of Tibetan texts when written with the Zanabazar Square
script. They occur at the beginning of texts.
• U+11A3F zanabazar square initial head mark
• U+11A40 zanabazar square closing head mark
• U+11A45 zanabazar square initial double-lined head mark
• U+11A46 zanabazar square closing double-lined head mark
Both U+11A3F zanabazar square initial head mark and U+11A45 zanabazar
square initial double-lined head mark are used as a base for candrabindu and anus-
vara signs.
South and Central Asia-III 587 14.6 Zanabazar Square
The U+11A40 zanabazar square closing head mark and U+11A46 zanabazar
square closing double-lined head mark may be used for producing extended head
marks, similar to usage in Tibetan.
Other Marks. Two vowel modifiers are used to transliterate words of Sanskrit origin:
• U+11A38 zanabazar square sign anusvara indicates nasalization
• U+11A39 zanabazar square sign visarga indicates post-vocalic aspiration
In addition, three combining signs are used as nasalization marks and ornaments for the
head mark:
• U+11A35 zanabazar square sign candrabindu
• U+11A36 zanabazar square sign candrabindu with ornament
• U+11A37 zanabazar square sign candra with ornament
The U+11A33 zanabazar square final consonant mark marks syllable-final conso-
nants when writing Mongolian.
Numerals. There are no known script-specific numerals.
Punctuation. The Zanabazar Square script includes four punctuation marks used for writ-
ing Tibetan:
• U+11A41 zanabazar square mark tsheg indicates the end of a syllable
• U+11A42 zanabazar square mark shad indicates the end of the phrase or
sentence
• U+11A43 zanabazar square mark double shad marks the end of a text sec-
tion
• U+11A44 zanabazar square mark long tsheg behaves as a comma
South and Central Asia-III 588 14.7 Soyombo
14.7 Soyombo
Soyombo: U+11A50–U+11AAF
The Soyombo script is an historic script used to write Mongolian, Sanskrit, and Tibetan. It
was created in 1686 by Zanabazar (1635–1723), who also developed the Zanabazar Square
script. The script appears primarily in Buddhist texts in Central Asia. Most of these texts
consist of either handwritten manuscripts or inscriptions.
Structure. Soyombo is an abugida. Consonants generally include an inherent vowel /a/, as
is the case with many other Brahmi-derived scripts. The script also includes final conso-
nant signs and four cluster-initial letters. A special subjoiner is employed to create con-
juncts.
Soyombo text is typically written horizontally left-to-right. In vertically written text, char-
acters are oriented in columns laid out left-to-right, with upright glyphs.
The graphical structure of Soyombo letters consists of two parts: a frame, made up of a ver-
tical bar with a triangle at the top, and a nucleus that represents a phoneme. Together the
frame and the nucleus represent the atomic letter. Vowel signs, final consonants, and other
phonetic features appear as dependent signs attached to the letters. The signs may appear
above or to the right of the frame, or below the nucleus.
Vowels and Diphthongs. The vowel a is represented by U+11A50 soyombo letter a.
When it occurs with a vowel sign, soyombo letter a serves as a vowel-carrier, indicating
an independent vowel. Long vowels are represented by appending U+11A5B soyombo
vowel length mark. When used to write Mongolian, U+11A57 soyombo vowel sign ai
and U+11A58 soyombo vowel sign au are used with other vowel signs to represent diph-
thongs.
Consonants. Mongolian syllable-final consonants are represented by U+11A50 soyombo
letter a followed by a final consonant sign. To indicate geminated consonants, U+11A98
soyombo gemination mark is stacked above the triangle of the frame. In the backing
store, it occurs immediately after the base letter, but before any other combining mark.
Other above-base signs are shown above the gemination mark.
Generally, consonant clusters are written as a conjunct forms. Because Soyombo does not
have a native virama, a special subjoiner character, U+11A99 soyombo subjoiner, is used.
Conjuncts are represented by using a subjoiner between each pair of consonants in a clus-
ter. A conjunct is rendered as a vertical stack of the regular form of the initial letter and the
nucleus of each non-initial letter. Four cluster-initial letters have special forms: la, sha, sa
and ra. Depending upon the context, clusters involving these four letters may be rendered
using the stacked or prefixed forms. The consonant cluster kssa has the structure of an
atomic letter, and is separately encoded as U+11A83 soyombo letter kssa.
Character Names. The character names are based on their values for writing Tibetan, with
the exception of the final consonant signs, which reflect their Mongolian usage. The order
South and Central Asia-III 589 14.7 Soyombo
of the consonant letters follows the alphabetical order of the Tibetan script. This also
matches the order of letters in the Zanabazar Square script.
Other Marks. Two vowel modifiers are used to transliterate words of Sanskrit origin,
U+11A96 soyombo sign anusvara, which indicates nasalization, and U+11A97 soyombo
sign visarga, which is used to indicate post-vocalic aspiration. Independent forms of
these modifiers are represented by combining them with U+11A50 soyombo letter a.
Numerals. There are no known script-specific numerals.
Punctuation. The Soyombo script includes a number of punctuation marks. U+11A9A
soyombo mark tsheg indicates the end of a syllable, and corresponds to U+0F0B tibetan
mark intersyllabic tsheg. To indicate the end of a phrase or syllable, U+11A9B soy-
ombo mark shad may be employed. It corresponds to U+0F0D tibetan mark shad and
U+0964 devanagari danda. The end of a section is marked by U+11A9C soyombo mark
double shad, corresponding to U+0F0E tibetan mark nyis shad and U+0965 devana-
gari double danda.
The script also contains three head marks, similar to those used in Mongolian and Tibetan.
The Soyombo marks may be followed by a shad or double shad. The U+11A9E soyombo
head mark with moon and sun and triple flame, also known as the Svayambhu or
“Soyombo” sign, is the official symbol of Mongolia. In addition, the script includes termi-
nal marks, which appear at the end of text.
South and Central Asia-III 590 14.8 Old Turkic
14.10 Sogdian
Sogdian: U+10F30–U+10F6F
Derived from Old Sogdian, the Sogdian script was used from the seventh to the fourteenth
century ce in Central Asia to write the eastern Iranian language Sogdian. It was also used to
write Chinese, Sanskrit, and Uyghur. Sogdian is the ancestor of the Mongolian and Old
Uyghur scripts. It is attested in manuscripts and inscribed on coins, stone, pottery, and
other media. The script has two major styles: “formal,” used in Buddhist sutra manuscripts,
and a simplified, “cursive” style. The Old Uyghur script is believed to have derived from the
Sogdian cursive style in the eighth or ninth century ce.
Structure. Sogdian is an abjad that can be written horizontally from right to left, or verti-
cally from top to bottom, in columns running from left to right. When the script appears in
vertical orientation, the glyphs are rotated ninety degrees counter-clockwise. Unlike Old
Sogdian, Sogdian is a cursive joining script. Eleven combining signs in the range
U+10F46..U+10F50 are used for disambiguation and transcription.
The Sogdian repertoire corresponds to that of Old Sogdian, but has a number of differ-
ences in the glyphs and also has additional characters. Sogdian has a special form of ayin
for an Aramaic heterogram, and includes two characters not found in Old Sogdian, feth
and lesh. The letter feth is used to represent [f ]. Lesh or “hooked resh” is an extension of
resh-ayin with a below-base hook that has become an intrinsic part of the letter. The reper-
toire includes one phonogram, U+10F45 sogdian independent shin, an alternate form
of isolated shin, used to transcribe one Chinese character, U+6240 所. The glyph for ayin is
identical to the glyph for resh; therefore the two letters have been unified as a single char-
acter, U+10F40 sogdian letter resh-ayin.
Glyphs. The representative glyphs are generally based on the isolated or independent form
of letters found in the formal style of Sogdian. Fonts may be used to show the formal or
cursive style of a text. As in other abjads, the letters connect and change shape based on
their position within a word. In the later Sogdian styles, some letters, such as nun, gimel
and beth, remain unconnected from a following letter to distinguish them from similar
shapes.
Numbers. The Sogdian script includes script-specific numbers encoded in the range
U+10F51..U+10F54.
Punctuation. Five script-specific punctuation characters are included in the repertoire.
The four Sogdian punctuation characters, U+10F55 sogdian punctuation two verti-
cal bars, U+10F56 sogdian punctuation two vertical bars with dots, U+10F57
sogdian punctuation circle with dot and U+10F58 sogdian punctuation two cir-
cles with dots, delimit text segments and may vary in shape. U+10F59 sogdian punc-
tuation half circle with dot generally indicates the completion of a text. Various other
punctuation marks occur in Sogdian texts, and in some cases may be represented by punc-
tuation characters from other blocks, such as General Punctuation.
593
Chapter 15
South and Central Asia-IV 15
Other Historic Scripts
This chapter documents other historic scripts of South and Central Asia. The following
scripts are described in this chapter:
Most of these scripts are historically related to the other scripts of India, and most are ulti-
mately derived from the Brahmi script. None of them were standardized in ISCII. The
encoding for each script is done on its own terms, and the blocks do not make use of a
common pattern for the layout of code points.
This introduction briefly identifies each script, occasionally highlighting the most salient
distinctive attributes of the script. Details are provided in the individual block descriptions
that follow.
Syloti Nagri is used to write the modern Sylheti language of northeast Bangladesh and
southeast Assam in India.
Kaithi is a historic North Indian script, closely related to the Devanagari and Gujarati
scripts. It was used in the area of the present-day states of Bihar and Uttar Pradesh in
northern India, from the 16th century until the early 20th century.
Sharada is a historical script that was used to write Sanskrit, Kashmiri, and other languages
of northern South Asia; it was the principal inscriptional and literary script of Kashmir
from the 8th century ce until the 20th century. It has limited and specialized modern use.
Takri, descended from Sharada, is used in northern India and surrounding countries. It is
the traditional writing system for the Chambeali and Dogri languages, as well as several
“Pahari” languages. In addition to popular usage for commercial and informal purposes,
Takri served as the official script of several princely states of northern and northwestern
India from the 17th century until the middle of the 20th century.
Siddham is another Brahmi-based writing system related to Sharada, and structurally sim-
ilar to Devanagari. It originated in India, and was used across South, Central, and East
South and Central Asia-IV 594
Asia, and is presently predominantly used in East Asia. Originally used for writing Bud-
dhist manuscripts, the script is still used by Japanese Buddhist communities.
Mahajani is a Brahmi-based alphabet commonly used by bankers and money lenders
across northern India until the middle of the 20th century. It is a specialized commercial
script used for writing accounts and financial records. Mahajani has similarities to Landa,
Kaithi, and Devanagari.
Khojki is a writing system used by the Nizari Ismaili community of South Asia for record-
ing religious literature. It is one of two Landa scripts—the other being Gurmuhki—that
were developed into formal liturgical scripts for use by religious communities. It is still
used today.
Khudawadi is a Landa-based script that was used to write the Sindhi language spoken in
India and Pakistan. It is related to Sharada. Known as the shopkeeper and merchant script,
it was used for routine writing, accounting, and other commercial purposes.
The Multani script was used write the Seraiki language of eastern and southeastern Paki-
stan during the 19th and 20th centuries. Multani is related to Gurmukhi and more dis-
tantly related to Khudawadi and Khojki. It was used for routine writing and commercial
activities.
Tirhuta, another Brahmi-based script, is related to the Bengali, Newari, and Oriya scripts.
Tirhuta is the traditional writing system for the Maithili language, which is spoken by more
than 35 million people in parts of India and Nepal. Maithili is an official regional language
of India and the second most spoken language in Nepal.
Modi is another Brahmi-based script mainly used to write Marathi, a language spoken in
western and central India. It emerged in the 16th century and derives from the Nagari
scripts. It is still used some today.
Nandinagari is a Brahmi-based abugida that was used in southern India between the 11th
and 19th centuries for manuscripts and inscriptions in Sanskrit. It is related to Devanagari.
The script was also used for writing Kannada in Karnataka.
Grantha, a script with a long history, is used to write the Sanskrit language in parts of South
India, Sri Lanka and elsewhere. It is in daily use by Vedic scholars and Hindu temple priests.
Ahom is a script of northeast India that dates to about the 16th century and was used pri-
marily to write the Tai Ahom language. The script has seen a revival in the 20th century,
and continues in some use today.
Sora Sompeng is used to write the Sora language spoken by the Sora people, who live in
eastern India between the Oriya- and Telugu-speaking populations. The script was created
in 1936 and is used in religious contexts.
During the 17th century, the Brahmi-based Dogra script was used to write the Dogri lan-
guage in Jammu and Kashmir in the northern region of the Indian subcontinent. The
Dogra script was standardized in the 1860s, and is closely related to the Takri script. Dogri
is now usually written with the Devanagari script.
South and Central Asia-IV 595 15.1 Syloti Nagri
Poetry Marks. Four native poetry marks are included in the Syloti Nagri block. The script
also makes use of U+2055 X flower punctuation mark (in the General Punctuation
block) as a poetry mark.
South and Central Asia-IV 597 15.2 Kaithi
15.2 Kaithi
Kaithi: U+11080–U+110CF
Kaithi, properly transliterated KaithS, is a North Indian script, related to the Devanagari
and Gujarati scripts. It was used in the area of the present-day states of Bihar and Uttar
Pradesh in northern India.
Kaithi was employed for administrative purposes, commercial transactions, correspon-
dence, and personal records, as well as to write religious and literary materials. As a means
of administrative communication, the script was in use at least from the 16th century until
the early 20th century, when it was eventually eclipsed by Devanagari. Kaithi was used to
write Bhojpuri, Magahi, Awadhi, Maithili, Urdu, and other languages related to Hindi.
Standards. There is no preexisting character encoding standard for the Kaithi script. The
repertoire encoded in this block is based on the standard form of Kaithi developed by the
British government of Bihar and the British provinces of northwest India in the 19th cen-
tury. A few additional Kaithi characters found in manuscripts, printed books, alphabet
charts, and other inventories of the script are also included.
Styles. There are three presentation styles of the Kaithi script, each generally associated
with a different language: Bhojpuri, Magahi, or Maithili. The Magahi style was adopted for
official purposes in the state of Bihar, and is the basis for the representative glyphs in the
code charts.
Rendering Behavior. Kaithi is a Brahmi-derived script closely related to Devanagari. In
general, the rules for Devanagari rendering apply to Kaithi as well. For more information,
see Section 12.1, Devanagari.
Vowel Letters. An independent Kaithi letter for vocalic r is represented by the consonant-
vowel combination: U+110A9 kaithi letter ra and U+110B2 kaithi vowel sign ii.
In print, the distinction between short and long forms of i and u is maintained. However,
in handwritten text, there is a tendency to use the long vowels for both lengths.
Consonant Conjuncts. Consonant clusters were handled in various ways in Kaithi. Some
spoken languages that used the Kaithi script simplified clusters by inserting a vowel
between the consonants, or through metathesis. When no such simplification occurred,
conjuncts were represented in different ways: by ligatures, as the combination of the half-
form of the first consonant and the following consonant, with an explicit virama (U+110B9
kaithi sign virama) between two consonants, or as two consonants without a virama.
Consonant conjuncts in Kaithi are represented with a virama between the two consonants
in the conjunct. For example, the ordinary representation of the conjunct mba would be by
the sequence:
U+110A7 kaithi letter ma + U+110B9 kaithi sign virama +
U+110A5 kaithi letter ba
South and Central Asia-IV 598 15.2 Kaithi
Consonant conjuncts may be rendered in distinct ways. Where there is a need to render
conjuncts in the exact form as they appear in a particular source document, U+200C zero
width non-joiner and U+200D zero width joiner can be used to request the appropri-
ate presentation by the rendering system. For example, to display the explicitly ligated
glyph V for the conjunct mba, U+200D zero width joiner is inserted after the virama:
U+110A7 kaithi letter ma + U+110B9 kaithi sign virama +
U+200D zero width joiner + U+110A5 kaithi letter ba
To block use of a ligated glyph for the conjunct, and instead to display the conjunct with an
explicit virama, U+200C zero width non-joiner is inserted after the virama:
U+110A7 kaithi letter ma + U+110B9 kaithi sign virama +
U+200C zero width non-joiner + U+110A5 kaithi letter ba
Conjuncts composed of a nasal and a consonant may be written either as a ligature with
the half-form of the appropriate class nasal letter, or the full form of the nasal letter with an
explicit virama (U+110B9 kaithi sign virama) and consonant. In Grierson’s Linguistic
Survey of India, however, U+110A2 kaithi letter na is used for all articulation classes,
both in ligatures and when the full form of the nasal appears with the virama.
Ruled Lines. Kaithi, unlike Devanagari, does not employ a headstroke. While several man-
uscripts and books show a headstroke similar to that of Devanagari, the line is actually a
ruled line used for emphasis, titling or sectioning, and is not broken between individual let-
ters. Some Kaithi fonts, however, were designed with a headstroke, but the line is not bro-
ken between individual letters, as would occur in Devanagari.
Nukta. Kaithi includes a nukta sign, U+110BA kaithi sign nukta, a dot which is used as
a diacritic below various consonants to form new letters. For example, the nukta is used to
distinguish the sound va from ba. The precomposed character U+110AB kaithi letter
va is separately encoded, and has a canonical decomposition into the sequence of
U+110A5 kaithi letter ba plus U+110BA kaithi sign nukta. Precomposed characters
are also encoded for two other Kaithi letters, rha and dddha.
The glyph for U+110A8 kaithi letter ya may appear with or without a nukta. Because
the form without the nukta is considered a glyph variant, it is not separately encoded as a
character. The representative glyph used in the chart contains the dot. The nukta diacritic
also marks letters representing some sounds in Urdu or sounds not native to Hindi. No
precomposed characters are encoded in those cases, and such letters must be represented
by a base character followed by the nukta.
Punctuation. A number of Kaithi-specific punctuation marks are encoded. Two marks
designate the ends of text sections: U+110BE kaithi section mark, which generally indi-
cates the end of a sentence, and U+110BF kaithi double section mark, which delimits
larger blocks of text, such as paragraphs. Both section marks are generally drawn so that
their glyphs extend to the edge of the text margins, particularly in manuscripts.
The character U+110BD kaithi number sign is a format control that interacts with digits.
It occurs below a digit or sequence of digits, indicating a numerical reference. The related
South and Central Asia-IV 599 15.2 Kaithi
character U+110CD kaithi number sign above occurs above a digit or sequence of dig-
its, and indicates a number in an itemized list, similar to U+2116 numero sign. Like
U+0600 arabic number sign and the other Arabic signs that span numbers (see
Section 9.2, Arabic), these Kaithi format controls precede the numbers they graphically
interact with, rather than following them. U+110BC kaithi enumeration sign is a stand-
alone, spacing symbol for inline usage.
U+110BB kaithi abbreviation sign, shaped like a small circle, is used in Kaithi to indi-
cate abbreviations. This mark is placed at the point of elision or after a ligature to indicate
common words or phrases that are abbreviated, in a similar way to U+0970 devanagari
abbreviation sign.
Kaithi makes use of two script-specific dandas: U+110C0 kaithi danda and U+110C1
kaithi double danda.
For other punctuation marks occurring in Kaithi texts, available Unicode characters may
be used. A cross-shaped character, used to mark phrase boundaries, can be represented by
U+002B plus sign. For hyphenation, users should follow whatever is the recommended
practice found in similar Indic script traditions, which might be U+2010 hyphen or
U+002D hyphen-minus. For dot-like marks that appear as word-separators, U+2E31
word separator middle dot, or, if the word boundary is more like a dash, U+2010
hyphen can be used.
Digits. The digits in Kaithi are considered to be stylistic variants of those used in Devana-
gari. Hence the Devanagari digits located at U+0966..U+096F should be employed. To
indicate fractions and unit marks, Kaithi uses characters encoded in the Common Indic
Number Forms block, U+A830..U+A839.
South and Central Asia-IV 600 15.3 Sharada
15.3 Sharada
Sharada: U+11180–U+111DF
Sharada is a historical script that was used to write Sanskrit, Kashmiri, and other languages
of northern South Asia. It served as the principal inscriptional and literary script of Kash-
mir from the 8th century ce until the 20th century. In the 19th century, expanded use of the
Arabic script to write Kashmiri and the growth of Devanagari contributed to the marginal-
ization of Sharada. Today the script is employed in a limited capacity by Kashmiri pandits
for horoscopes and ritual purposes.
Rendering Behavior. Sharada is a Brahmi-based script, closely related to Devanagari. In
general, the rules for Devanagari rendering apply to Sharada as well. For more informa-
tion, see Section 12.1, Devanagari.
Ruled Lines. While the headstroke is an important structural feature of a character’s glyph
in Sharada, there is no rule governing the joining of headstrokes of characters to other
characters. The variation was probably due to scribal preference, and should be handled at
the font level.
Virama. The U+111C0 a sharada sign virama is a spacing mark, written to the right of
the consonant letter it modifies. Semantically, it is identical to the Devanagari virama and
other similar Indic scripts.
Candrabindu and Avagraha. U+11180 b sharada sign candrabindu indicates nasal-
ization of a vowel. It may appear in manuscripts in an inverted form but with no semantic
difference. Such glyph variants should be handled in the font. U+111C1 c sharada sign
avagraha represents the elision of a word-initial a. Unlike the usual practice in Devana-
gari in which the avagraha is written at the normal letter height and attaches to the top
stroke of the following character, the avagraha in Sharada is written at or below the base-
line and does not connect to the neighboring letter.
Jihvamuliya and Upadhmaniya. The velar and labial allophones of /h/, followed by voice-
less velar and labial stops respectively, are written in Sharada with separate signs, U+111C2
d sharada sign jihvamuliya and U+111C3 e sharada sign upadhmaniya. These two
signs have the properties of a letter and appear only in stacked conjuncts without the use of
virama. Jihvamuliya is used to represent the velar fricative [x] in the context of a following
voiceless velar stop:
U+111C2 d jihvamuliya + U+11191 f ka → #
U+111C2 d jihvamuliya + U+11192 g kha → $
Upadhmaniya is used to represent the bilabial fricative [s] in the context of a following
voiceless labial stop:
U+111C3 e upadhmaniya + U+111A5 h pa → o
U+111C3 e upadhmaniya + U+111A6 i pha → p
South and Central Asia-IV 601 15.3 Sharada
15.4 Takri
Takri: U+11680–U+116CF
Takri is a script used in northern India and surrounding countries in South Asia, including
the areas that comprise present-day Jammu and Kashmir, Himachal Pradesh, Punjab, and
Uttarakhand. It is the traditional writing system for the Chambeali and Dogri languages, as
well as several “Pahari” languages, such as Jaunsari, Kulvi, and Mandeali. It is related to the
Gurmukhi, Landa, and Sharada scripts. Like other Brahmi-derived scripts, Takri is an
abugida, with consonants taking an inherent vowel unless accompanied by a vowel marker
or the virama (vowel killer).
Takri is descended from Sharada through an intermediate form known as DevQ0e1a, which
emerged in the 14th century. DevQ0e1a was a script used for religious and official purposes,
while its popular form, known as Takri, was used for commercial and informal purposes.
Takri became differentiated from DevQ0e1a during the 16th century. In its various regional
manifestations, Takri served as the official script of several princely states of northern and
northwestern India from the 17th century until the middle of the 20th century. Until the
late 19th century, Takri was used concurrently with Devanagari, but it was gradually
replaced by the latter.
Owing to its use as both an official and a popular script, Takri appears in numerous
records, from manuscripts to inscriptions to postage stamps. There are efforts to revive the
use of Takri for languages such as Dogri, Kishtwari, and Kulvi as a means of preserving
access to these language’s literatures.
There is no universal, standard form of Takri. Where Takri was standardized, the reformed
script was limited to a particular polity, such as a kingdom or a princely state. The repre-
sentative glyphs shown in the code charts are taken mainly from the forms used in a variant
established as the official script for writing the Chambeali language in the former Chamba
State, now in Himachal Pradesh, India. There are a number of other regional varieties of
Takri that have varying letterforms, sometimes quite different from the representative
forms shown in the code charts. Such regional forms are considered glyphic variants and
should be handled at the font level.
Vowel Letters. Vowel letters are encoded atomically in Unicode, even if they can be ana-
lyzed visually as consisting of multiple parts. Table 15-1 shows the letters that can be ana-
lyzed, the single code point that should be used to represent them in text, and the sequence
of code points resulting from analysis that should not be used.
Consonant Conjuncts. Conjuncts in Takri are infrequent and, when written, consist of two
consonants, the second of which is always ya, ra, or ha. Takri ya is written as a subjoining
form; Takri ra can be written as a ligature or a subjoining form; and Takri ha is written as
a half-form.
South and Central Asia-IV 603 15.4 Takri
Nukta. A combining nukta character is encoded as U+116B7 takri sign nukta. Charac-
ters that use this sound, mainly loan words and words from other languages, may be repre-
sented using the base character plus nukta.
Headlines. Unlike Devanagari, headlines are not generally used in Takri. However, head-
lines do appear in the glyph shapes of certain Takri letters. The headline is an intrinsic fea-
ture of glyph shapes in some regional varieties such as Dogra Akkhar, where it appears to
be inspired by the design of Devanagari characters. There are no fixed rules for the joining
of headlines. For example, the headlines of two sequential characters possessing headlines
are left unjoined in Chambeali, while the headlines of a letter and a vowel sign are joined in
printed Dogra Akkhar.
Punctuation. Takri uses U+0964 devanagari danda and U+0965 devanagari double
danda from Devanagari.
Fractions. Fraction signs and currency marks found in Takri documents use the characters
in the Common Indic Number Forms block (U+A830..U+A83F).
South and Central Asia-IV 604 15.5 Siddham
15.5 Siddham
Siddham: U+11580–U+115FF
Siddham is a Brahmi-based writing system that originated in India, and is presently used
primarily in East Asia. The script is also known as SiddhamQtMkQ and KuPila. The name Sid-
dhamatrika has broad historic and regional usage throughout India and East Asia. How-
ever, modern usage is most strongly associated with the Shingon and Tendai Buddhist
traditions in Japan, where the script is also known as Bonji. The representative glyphs in
the code charts are based upon Japanese forms of Siddham characters.
The historical record shows the use of Siddham in Central Asia, but the predominant
examples are of its use for writing Sanskrit in China, Japan, and Korea, notably for Bud-
dhist manuscripts. Today, it is mainly used for ceremonial and ritualistic purposes associ-
ated with esoteric Buddhist practices.
Siddham is most closely related to Sharada, another Brahmi-based script that originated in
Kashmir.
Nukta. The sign U+115C0 A siddham sign nukta is used for transcribing sounds that are
not native to the writing system. The nukta sign is not a traditional Siddham character, but
it is part of modern Siddham, so that it can accommodate the writing of Japanese and
English.
Vowels. The Siddham vowel signs for u and uu may appear in two forms. The regular
forms, called “cloud” forms, are represented by U+115B2 siddham vowel sign u and
U+115B3 siddham vowel sign uu. Alternate vowel sign forms, referred to as “warbler”
forms, are represented instead by U+115DC siddham vowel sign alternate u and
U+115DD siddham vowel sign alternate uu.
The combination of ra and u should be written with the sequence <U+115A8 T siddham
letter ra, U+115DC U siddham vowel sign alternate u> and rendered as t. For the
combination ra and uu, the form u should be employed, represented by the sequence
<U+115A8 siddham letter ra, U+115DD siddham vowel sign alternate uu>.
Virama and Conjuncts. The virama, U+115BF B siddham sign virama, is identical to the
corresponding character in Devanagari and silences the inherent vowel of a consonant.
The default rendering of the Siddham virama is as a visible sign.
Consonant clusters in Siddham are written as conjuncts and follow the same model as con-
juncts in Devanagari. Conjuncts are represented using the Siddham virama, which is writ-
ten between each consonant in the cluster. Conjuncts may be written vertically,
horizontally, or as independent ligatures. There are traditional Chinese and Japanese tabu-
lations for Siddham conjuncts.
Siddham conjuncts may represent clusters with a large number of consonants. For exam-
ple, rkXvrya is a conjunct cluster produced by a sequence of six conjuncts, as shown in
Figure 15-1.
South and Central Asia-IV 605 15.5 Siddham
r k s v r ya
˙
𑖨$𑖎$𑖬$𑖤$𑖨$𑖧 →
Head Marks. The mark U+115C1 C siddham sign siddham is written at the beginning
of a text. Paleographically, the sign corresponds to characters used in other scripts, such as
U+0FD3 D tibetan mark initial brda rnying yig mgo mdun ma. It represents the San-
skrit word siddham, “accomplished,” and the phrase siddhirastu, “may there be success.” A
vertically-oriented glyph variant is used for vertical text layout.
Repetition Marks. Three marks, U+115C6 E siddham repetition mark-1, U+115C7 F
siddham repetition mark-2, and U+115C8 G siddham repetition mark-3 are used to
indicate the text repetition. They are written after the text that is to be repeated.
Section Signs. A set of fourteen section marks are used in Siddham to indicate the ends of
sentences, phrases, verses, and sections. They appear in manuscripts and script manuals.
According to the Shingon philosophy, the characters possess esoteric qualities that relay
information regarding the interpretation of the text.
Punctuation. There are five other punctuation marks encoded for Siddham, as shown in
Table 15-2. Both Siddham danda and Siddham double danda have graphical variants used
in informal Japanese writing of Siddham.
15.6 Mahajani
Mahajani: U+11150–U+1117F
Mahajani is a Brahmi-based writing system that was commonly used across northern India
until the middle of the 20th century. It is a specialized commercial script used for writing
accounts and financial records. It was used for recording several languages: Hindi, Mar-
wari, and Punjabi. Mahajani was taught and used as a medium of education in Punjab,
Rajasthan, Uttar Pradesh, Bihar, and Madhya Pradesh in schools where students from
merchant and trading communities learned the script and other writing skills required for
business. The name “Mahajani” refers to bankers and money lenders, who were the pri-
mary users of the script. The majority of Mahajani records are account books. Although the
Mahajani script is no longer in general use, it is an important key to the historical financial
records of northern India.
Mahajani has similarities to Landa, Kaithi, and Devanagari. In structure and orthography,
Mahajani resembles scripts of the Landa family used in Punjab and Sindh, which are
related to Sharada.
Structure. Mahajani is written from left to right. It is based upon the Brahmi model, but it
is structurally simpler and behaves as an alphabet. Vowel signs are not used, and there is no
virama. Consonant clusters are not written in Mahajani using half-forms or ligatures
(except for one ligature for shri), or even a visible virama. The elements of a consonant
cluster are written sequentially using regular consonant letters.
Vowel signs are not written. Consonant letters theoretically bear the inherent vowel /a/,
but the glyph for ka for example represents not only ka, but also any one of the syllables ka,
kZ, ki, k\, ke, and so on. In cases where greater precision is required, a vowel letter may be
written after a consonant to convey the intended vocalic context. In general, the value of a
consonant letter must be inferred at the morphological level.
Nasalization is not represented using special signs, such as anusvara. Instead U+11167
mahajani letter na is used in cases where nasalization is explicitly recorded. In several
cases, words are written simply with nasalization deleted.
U+11173 mahajani sign nukta is used for writing sounds that are not represented by a
unique character, such as allophonic variants and sounds that occur in local dialects or in
loanwords. It has limited use in Mahajani.
Several letters have glyphic variants. Those variants are not separately encoded.
Digits. Mahajani does not have distinctive script-specific digits. The Devanagari digits
located at U+0966..U+096F should be used.
Other Symbols. Fraction signs and unit marks are found in Mahajani documents, and may
be represented using the characters encoded in the “Common Indic Number Forms”
block.
South and Central Asia-IV 607 15.6 Mahajani
Punctuation. Mahajani employs a dash, middle dot, and colon, which should be repre-
sented by the corresponding Latin characters. For the dandas, Mahajani employs U+0964
devanagari danda and U+0965 devanagari double danda. Mahajani also contains two
other script-specific punctuation signs, U+11174 abbreviation sign and U+11175 sec-
tion mark. There are no formal rules for punctuation and word spacing is not generally
observed.
South and Central Asia-IV 608 15.7 Khojki
15.7 Khojki
Khojki: U+11200–U+1124F
Khojki is a writing system used by the Nizari Ismaili community of South Asia for record-
ing religious literature. It was developed in Sindh, now in Pakistan, for representing the
Sindhi language. The script spread to surrounding regions and was used for writing Guja-
rati, Punjabi, and Siraiki, as well as several languages related to Hindi. It was also used for
writing Arabic and Persian. Popular Nizari Ismaili tradition states that Khojki was
invented and propagated by Pir Sadruddin, an Ismaili missionary.
Khojki is one of two Landa scripts that were developed into formal liturgical scripts for use
by religious communities; the other is Gurmukhi, which was developed for writing the
sacred literature of the Sikh tradition.
Khojki is also called “Sindhi” and “Khwajah Sindhi.” Khojki was in use by the 16th century
ce, as attested by manuscript evidence. The printing of Khojki books flourished after Lal-
jibhai Devraj produced metal types for Khojki in Germany for use at his Khoja Sindhi
Printing Press in Mumbai.
While usage of Khojki has declined over the past century, it is used wherever Nizari Ismaili
Muslims of South Asian origin reside. The largest communities are found in Pakistan,
India, Canada, United States, the United Kingdom, Kenya, Tanzania, and Uganda. Khojki
primers continue to be published in Pakistan for teaching the script. Khojki manuscripts
and books are used in Ismaili ceremonies not only in South Asia, but in east and south
Africa, where large diaspora communities formed by the 19th century. The script was also
used by communities related to the Nizari Ismailis, such as the Imamshahis of Gujarat.
Structure. The general structure of Khojki is similar to that of other Brahmi-derived Indic
scripts. It is written from left-to-right.
Khojki has a smaller repertoire of independent vowel letters than other Brahmi-derived
scripts. The letters U+11202 khojki letter i and U+11203 khojki letter u are used for
writing both short and long forms of i and u, respectively. The letters U+11205 khojki
letter ai and U+11207 khojki letter au represent diphthongs. Although they are
attested in manuscripts and books, Khojki originally did not have unique letters for these
vowels. In early Khojki records, diphthongs are generally represented as digraphs. Several
variant forms of vowel letters are also attested.
The repertoire of dependent vowel signs is larger than that of independent vowel letters.
There are separate signs for U+1122D khojki vowel sign i and U+1122E khojki vowel
sign ii, but no form for uu. Instead, the single sign U+1122F khojki vowel sign u is used
for both short and long forms. U+11232 khojki vowel sign o is often written by placing
the U+11230 khojki vowel sign e element above the consonant letter.
Geminate consonants are marked by the U+11237 khojki sign shadda, written above the
consonant letter that is doubled. The positioning may change in relation to vowel signs.
South and Central Asia-IV 609 15.7 Khojki
Nasalization is indicated by the sign U+11234 khojki sign anusvara. It is written to the
right of the letter or sign with which it combines.
U+11235 khojki sign virama is identical in function to corresponding characters in
other Indic scripts. It is written to the right of a consonant letter.
U+11236 khojki sign nukta is used for producing characters to represent sounds not
native to Sindhi. The sign may be written with vowel letters, vowel signs, and consonant
letters. The nukta is written above a letter.
Punctuation. Khojki separates words using U+1123A khojki word separator. U+11238
khojki danda and U+11239 khojki double danda are used to mark the end of sen-
tences. The double danda is also used to mark verse sections. Typically, double danda is
written with U+1123A khojki word separator to the left and right of verse numbers.
Section marks appear frequently in Khojki manuscripts as punctuation that delimits the
end of a section or another larger block of text. The U+1123B khojki section mark is
generally used to mark the end of a sentence, while U+1123C double section mark is
used to delimit larger blocks of text, such as paragraphs. Both generally extend to the mar-
gin of the text-block.
Latin punctuation marks are also used in printed Khojki.
U+1123D khojki abbreviation sign is used for marking abbreviations.
Digits. Khojki makes use of Gujarati digits U+0AE6 through U+0AEF.
South and Central Asia-IV 610 15.8 Khudawadi
15.8 Khudawadi
Khudawadi: U+112B0–U+112FF
Khudawadi is a script used historically for writing the Sindhi language, which is spoken in
India and Pakistan. Official forms of Khudawadi are known as “Hindi Sindhi,” “Hindu
Sindhi,” and “Standard Sindhi.” Khudawadi is a Landa-based script and related to Sharada.
Like other Landa writing systems, Khudawadi is a mercantile script used for routine writ-
ing, accounting, and other commercial purposes and was known as the shopkeeper and
merchant script. It is associated with the merchant communities of Hyderabad, Sindh. In
addition to mercantile records, Khudawadi was used in education, book printing, and for
court records.
In the 1860s, Khudawadi was chosen as the basis for a written standard for education and
administration in Sindh and was developed as an official language. Official Khudawadi
possesses unique characters for each vowel and consonant sound of the Sindhi language, as
well as vowel signs. In the late 19th century, an Arabic-based script became the official writ-
ing system for Sindhi in Pakistan and India. Sindhi is also written in the Devanagari script
in India. Khudawadi is now obsolete.
Structure. The general structure of Khudawadi is similar to that of other Brahmi-based
Indic scripts. It is written from left-to-right.
Vowel Letters. Some independent vowel letters may be represented using a combination of
a base vowel letter and a dependent vowel sign. This practice is not recommended. The
atomic character for the independent vowel letter should always be used.
Consonant Conjuncts. Consonant clusters generally consist of two consonants. These are
written using a visible virama. The encoded representation is <C1 + virama + C2>. Half-
forms and ligated conjunct forms are not attested.
Nasalization. U+112DF S khudawadi sign anusvara is used for indicating nasalization.
Nukta. U+112E9 T khudawadi sign nukta is used for representing sounds not native to
Sindhi, such as those that may occur in Persian and Arabic loanwords. Attested Khudawadi
letters with nukta are shown in Table 15-4, along with the Arabic letters for which they sub-
stitute. ja + nukta, pronounced za, corresponds to a number of distinct Arabic letters.
South and Central Asia-IV 611 15.8 Khudawadi
In principle, the nukta may be written with any Khudawadi vowel or consonant letter. If
other combining marks, such as a dependent vowel sign or anusvara, also occur in a com-
bining sequence applied to that base character, then the convention is to represent the
nukta first in the combining sequence.
Punctuation. The Khudawadi uses dandas and European punctuation, such as periods,
dashes, colons, and semi-colons. Khudawadi dandas are unified with those of Devanagari.
Line breaking for Khudawadi characters follows the rules for Devanagari.
Digits. Khudawadi has a full set of decimal digits. Fraction signs and currency marks are
attested in Khudawadi records. These may be represented using characters in the Common
Indic Number Forms block found at U+A830..U+A83F.
South and Central Asia-IV 612 15.9 Multani
15.9 Multani
Multani: U+11280–U+112AF
The Multani script was used to write the Seraiki language, an Indo-Aryan language spoken
in the Punjab in eastern Pakistan and the northern Sindh area of southeastern Pakistan.
Multani is a Landa-based script, related to Gurmukhi, and distantly related to Khudawadi
and Khojki. The script, also known as Karikki or Sarai, was used for routine writing and
commercial activities. The first book in the Multani script was published in 1819. By the
latter half of the 19th century, the British administration introduced the Arabic script as the
standard for writing the languages of the Sindh, which led to the demise of various non-
Arabic scripts, including Multani. The script continued to be used into the 20th century.
Today Seraiki is written in the Arabic script.
There is no standard form of the Multani script. The representative glyphs shown in the
code charts are based on printed forms from an 1819 version of the New Testament, with
additional characters that are found only in handwritten documents. Such variant forms
are considered glyphic variants and should be handled at the font level.
The script underwent orthographic changes in the first quarter of the 20th century, with a
reduction in the character repertoire. The repertoire encoded in this block is based on the
set of all characters that are distinctly attested.
Structure. Although Multani is based on the Brahmi model, it is closer in structure to an
abjad than an abugida. There are four independent vowel letters, a, i, u and e, and no
dependent vowel signs. Consonants theoretically possess the inherent /a/ vowel, but as
vowels are not marked, the actual syllabic vowel of a consonant in running text is ambigu-
ous and must be inferred from context. Consonant clusters are written using independent
letters, rather than with conjuncts. There is no virama. Vowels are generally not written
unless they occur in isolation, in word initial position, or in the final position of monosyl-
labic words.
The letter I a is used to represent /a/, /a:/ and in some sources /e/ and /æ/. The letter J i
represents /i/ and /i:/ and commonly the semivowel /j/. The letter K u represents /u/, /u:/
and /o/. The letter L e represents /e/, and in some sources /æ/ and /o/.
Digits. The Gurmukhi digits U+0A66..U+0A6F should be employed to represent digits in
Multani.
Punctuation. Multani has only one script-specific punctuation mark, U+112A9 multani
section mark, which indicates the end of a sentence.
South and Central Asia-IV 613 15.10 Tirhuta
15.10 Tirhuta
Tirhuta: U+11480–U+114DF
Tirhuta is the traditional writing system for the Maithili language, which is spoken by more
than 35 million people in the state of Bihar in India, and in the Narayani and Janakpur
zones of Nepal. Maithili is an official regional language of India and the second most spo-
ken language in Nepal. Tirhuta is a Brahmi-based script derived from GauKS, or “Proto-
Bengali,” which evolved from the KuPila branch of Brahmi by the 10th century. It is related
to the Bengali, Newari, and Oriya scripts, which are also descended from GauKS, and
became differentiated from them by the 14th century.
Tirhuta remained the primary writing system for Maithili until the late 20th century, when
it was replaced by Devanagari. The Tirhuta script forms the basis of scholarly and religious
scribal traditions that have been associated with the Maithili and Sanskrit languages since
the 14th century. Tirhuta continues to be used for writing manuscripts of religious and lit-
erary texts, as well as personal correspondence. Since the 1950s, various literary societies,
such as the Maithili Akademi and Chetna Samiti, have been publishing literary, educa-
tional, and linguistic materials in Tirhuta. The script is also used in signage in Darbhanga
and other districts of north Bihar, and as an optional script for writing the civil services
examination in Bihar.
Although several Tirhuta characters, ligatures or combined shapes bear resemblance to
those of Bengali, these similarities are superficial.
Structure. The general structure (phonetic order, matra reordering, use of virama, and so
on) of Tirhuta is similar to that of other Brahmi-based Indic scripts. The script is written
from left-to-right.
Vowels. Tirhuta uses independent vowel letters and corresponding combining vowel signs.
The signs U+114BA tirhuta vowel sign short e and U+114BD tirhuta vowel sign
short o do not have corresponding independent forms, because the sounds they represent
do not occur in word initial position.
Vowel letters are encoded atomically in Unicode, even if they can be analyzed visually as
consisting of multiple parts. Table 15-5 shows the letters that can be analyzed, the single
code point that should be used to represent them in text, and the sequence of code points
resulting from analysis that should not be used.
Consonants. Some of the 33 consonants look like Bengali consonants, but represent differ-
ent sounds. For example, U+114A9 tirhuta letter ra has the same form as U+09AC
bengali letter ba, and U+114AB tirhuta letter va has the same shape as U+09B0
bengali letter ra.
Consonants combined with vowel signs, combined in conjuncts, or appearing at the end of
a word commonly use context-dependent ligatures or glyph combinations. These shapes
also contrast with usage in Bengali. For example, the consonant-vowel combination
<U+1149E tirhuta letter ta, U+114B3 tirhuta vowel sign u> in Tirhuta produces
South and Central Asia-IV 614 15.10 Tirhuta
the same shape as the conjunct <U+09A4 bengali letter ta, U+09CD bengali sign
virama, U+09A4 bengali letter ta> in the Bengali script.
All variant forms for letters, character elements and conjuncts in Tirhuta should be man-
aged at the font level.
Virama. U+114C2 tirhuta sign virama is identical in function to the corresponding
character in other Indic scripts.
Nasalization. Nasalization is indicated by U+114BF tirhuta sign candrabindu and
U+114C0 tirhuta sign anusvara. These signs are written centered above the base. If
written with an above-base sign or a letter with a graphical element that extends past the
headstroke, they are placed to the right of such signs and elements.
Characters for Representing Sanskrit. Two characters are attested in Vedic and classical
Sanskrit manuscripts written in Tirhuta. U+114C1 tirhuta sign visarga represents an
allophone of ra or sa at word-final position in Sanskrit orthography. U+114C5 tirhuta
gvang represents nasalization. It belongs to the same class of characters as U+1CE9 vedic
sign anusvara antargomukha, U+1CEA vedic sign anusvara bahirgomukha, and
so on.
Tirhuta also uses U+1CF2 vedic sign ardhavisarga which can be found in the Vedic
Extensions block.
Nukta. U+114C3 tirhuta sign nukta is used for writing sounds that are not represented
by a unique character, such as allophonic variants and sounds that occur in local dialects or
in loanwords. The nukta may be written with any vowel or consonant letter. If other com-
bining marks, such as a vowel sign or anusvara, also appear with the base character, then
the nukta is written first.
U+114A5 tirhuta letter ba and U+114AB tirhuta letter va have shapes that include
a dot, but this is not semantically equivalent to a nukta. These letters do not decompose to
nukta, and are treated as atomic characters.
Punctuation. Tirhuta uses U+0964 devanagari danda and U+0965 devanagari double
danda from the Devanagari block.
Special Signs. U+114C6 tirhuta abbreviation sign denotes abbreviations. There are
also two special script-specific signs in Tirhuta. The first, U+11480 tirhuta anji, is used
South and Central Asia-IV 615 15.10 Tirhuta
in the invocations of letters, manuscripts, books, and charts of the script. The sign anji is
said to represent the tusk of the deity Ganesa, patron of learning. The second, U+114C7
tirhuta om, contrasts with the Bengali sign for om, the latter being a simple combination
of U+0993 bengali letter o plus U+0981 bengali sign candrabindu.
Digits. Tirhuta has a full set of decimal digits.
Fractions. Number forms and unit marks are also found in Tirhuta documents. The most
common of these are signs for writing fractions and currency, and they are represented
using characters in the Common Indic Number Forms block (U+A830..U+A83F). They
include U+A831 north indic fraction one half, U+A832 north indic fraction
three quarters, and so on, as well as U+A838 north indic rupee mark. Tirhuta also
uses Bengali “currency numerators,” such as U+09F4 bengali currency numerator
one.
South and Central Asia-IV 616 15.11 Modi
15.11 Modi
Modi: U+11600–U+1165F
Modi is a Brahmi-based script used mainly for writing Marathi. Modi was also used to
write other regional languages such as Hindi, Gujarati, Kannada, Konkani, Persian, Tamil,
and Telugu. According to an old legend, the Modi script was brought to India from Sri
Lanka by Hemadri Pandit, known also as Hemadpant, who was the chief minister of
Ramacandra, the last king of the Yadava dynasty, who reigned from 1271 to about 1309.
Another tradition credits the creation of the script to Balaji Avaji, secretary of state to the
late 17th-century Maratha king Shivaji Raje Bhonsle, also known as Chhatrapati Shivaji
Maharaj. While the veracity of such accounts is difficult to ascertain, it is clear that Modi
derives from the Nagari family of scripts and is a modification of the Nagari model
intended for continuous writing.
Modi emerged as an administrative writing system in the 16th century before the rise of the
Maratha dynasties. It was adopted by the Marathas as an official script beginning in the
17th century and was used in such a capacity in Maharashtra until the middle of the 20th
century. In the 1950s the use of Modi was formally discontinued and the Devanagari script,
known as “Balbodh,” was promoted as the standard writing system for Marathi.
There are thousands of Modi documents preserved in South Asia and Europe. The major-
ity of these are in various archives in Maharashtra, while smaller collections are kept in
Denmark and other countries, because of European presence in Tanjore, Pondicherry, and
other regions in South Asia through the 19th century. The earliest extant Modi document
dates from the early 17th century. While the majority of Modi documents are official let-
ters, land records, and other administrative documents, the script was also used in educa-
tion, journalism, and other routine activities before the 1950s. Printing in Modi began in
the early 19th century after Charles Wilkins cut the first metal fonts for the script in Cal-
cutta. Newspapers were published in Modi; primers were produced to teach the script in
schools, and various personal papers and diaries were kept in the script.
Structure. Modi is a Brahmi-based script related to Devanagari. It is written from left-to-
right. In general, the rules for Devanagari rendering also apply to Modi (see Section 12.1,
Devanagari). However, one characteristic feature of Modi is a large number of context-
dependent forms of consonants and vowel-signs. Shaping and glyph substitutions for these
contextual forms are managed in the font.
Vowel Letters. Generally, the distinction between regular and long forms of i and u is not
preserved in Modi. The letter U+11603 modi letter ii may represent both i and \, and
U+11604 modi letter u may be used for writing both u and ^. The same can be said of the
corresponding dependent vowel signs. Both regular and long forms appear in the Modi
block, because they are attested in documentation about Modi.
The vocalic letters in the range U+11635..U+11638 are included in the encoding, but are
not in modern use, as is the case in other Indic scripts. Modi vocalic r may alternatively be
written as the sequence <U+11628 modi letter ra, U+11632 modi vowel sign ii> r\.
South and Central Asia-IV 617 15.11 Modi
Vowel letters are encoded atomically in Unicode, even if they can be analyzed visually as
consisting of multiple parts. Table 15-6 shows the letters that can be analyzed, the single
code point that should be used to represent them in text, and the sequence of code points
resulting from analysis that should not be used.
+ →
1160E 11628 ka-ra ligature
+ + →
1160E 11630 11628 kā-ra ligature
+ →
1162D 11628 sa-ra ligature
+ e + → e
1162E 11639 11628 he-ra ligature
South and Central Asia-IV 618 15.11 Modi
Unusually, the shape of ra is also influenced at the word level, depending upon the charac-
ters in the preceding syllable. See the last example in Figure 15-2. This influence on the
shape of ra may even occur preceding punctuation; in certain environments, ra following a
danda or double danda is written using a special contextual form. For example:
U+11642 q double danda + U+11628 r ra → s
To produce this behavior, the danda and double danda characters in the Modi block
should be used instead of the ones in the Devanagari block.
Punctuation and Word Boundaries. Traditionally, word boundaries are not marked in
Modi because it is an administrative script, characterized by the practice of rapid writing
without lifting the pen. Paragraph and other section boundaries are, however, indicated in
some Modi documents through the use of whitespace. Modern practice uses spaces and
various punctuation conventions, including danda and Western punctuation marks. Some
printed books use a period instead of a danda to indicate a sentence boundary.
Various Signs. Nasalization is indicated by U+1163D modi sign anusvara, and abbrevia-
tions are indicated using U+11643 modi abbreviation sign. U+1163E modi sign vis-
arga represents an allophone of ra or sa at word-final position in Sanskrit orthography.
U+11640 modi sign ardhacandra is used for transcribing sounds used in English names
and loanwords.
U+11644 modi sign huva is written as an invocation in several Modi documents. It is
derived from the Arabic huwa.
Currency values are written using U+A838 north indic rupee mark.
Numbers. Modi has a full set of decimal digits. Several number forms and unit marks are
used for writing Modi and are represented using characters in the Common Indic Number
Forms block. They include the base-16 fraction signs U+A830..U+A835. The absence of
intermediate units is indicated by U+A837 north indic placeholder mark, which is
called ali in Marathi. U+A836 north indic quarter mark is used for representing anna
values.
South and Central Asia-IV 619 15.12 Nandinagari
15.12 Nandinagari
Nandinagari: U+119A0–U+119FF
Nandinagari is a Brahmi-based script that was used in southern India between the 11th
and 19th centuries for manuscripts and inscriptions in Sanskrit in south Maharashtra, Kar-
nataka and Andhra Pradesh. It is related to Devanagari, and was the official script of the
Vijayanagara kingdom of southern India (1336–1646 ce). There are numerous manu-
scripts and inscriptions containing Nandinagari text. This script was also used for writing
Kannada in Karnataka.
Structure. With minor historical exceptions, Nandinagari is an abugida written from left to
right where there is a consonant plus an inherent vowel (usually the sound /a/), similar to
Devanagari. The absence of the inherent vowel is frequently marked with a virama. The
virama sign that suppresses the inherent vowel of the consonant is a combining character.
Headstrokes. These are an inherent feature of Nandinagari letters, but their behavior dif-
fers from headstrokes in modern Devanagari. Headstroke connections in Nandinagari
generally are restricted to an aksara (orthographic syllable) and do not extend to neighbor-
ing syllables. The headstroke connects vowel or consonant letters and spacing dependent
vowels of an aksara, while spaces separate individual aksaras.
Vowels. There are 12 vowel letters in the range U+119A0..U+119AD and 11 dependent
vowel signs in the range U+119D1..U+119DD. U+119D2 nandinagari vowel sign i is
positioned at the top-left edge of letters that have headstrokes. For other letters U+119D2
hangs above the top-left portion of the body. However, the style of writing the sign varies
considerably, particularly in handwriting.
Consonants. There are 35 consonant letters. U+119D0 nandinagari letter rra appears
to have been introduced in the 11th century for transcribing the Kannada letter RRA, and
is not part of the traditional repertoire of Nandinagari.
Virama. U+119E0 nandinagari sign virama has two functions, similar to the corre-
sponding Devanagari character. Used as a halanta, it marks the absence of the inherent
vowel of a consonant letter. U+119E0 is also a format character used to produce conjuncts.
Vowel Modifiers. U+119DE nandinagari sign anusvara indicates nasalization. It is
placed to the right of a base letter or right-side vowel sign. U+119DF nandinagari sign
visarga represents post-vocalic aspiration in words of Sanskrit origin.
Other Signs. U+119E1 nandinagari sign avagraha marks the elision of word-initial a in
Sanskrit as a result of sandhi. The auspicious sign U+119E2 nandinagari sign siddham
indicates an invocation at the beginning of documents.
Punctuation. U+119E3 nandinagari headstroke is used as a sign of spacing or joining a
word. It may connect a word that is broken on account of imperfections on a writing sur-
face. U+119E3 can also serve as a gap filler. Nandinagari uses the danda and double danda
marks encoded in the Devanagari block.
South and Central Asia-IV 620 15.12 Nandinagari
Digits. The Nandinagari digits are glyph variants of the Kannada digits U+0CE6..U+0CEF.
No script specific digits are encoded for Nandinagari.
South and Central Asia-IV 621 15.13 Grantha
15.13 Grantha
Grantha: U+11300–U+1137F
The Grantha script descends from Brahmi. The modern form is chiefly used to write the
Sanskrit language, including Vedic Sanskrit. It is used primarily in Tamil Nadu, and to a
lesser extent in Sri Lanka and other parts of South India.
The Grantha script is frequently mixed with the Tamil script to write Sanskrit words.
Grantha has also been used to write the Sanskrit words of Tamil Manipravalam—a mixed
Sanskrit-Tamil language—though this usage has become rare. In addition, Grantha char-
acters may occasionally be employed with the Tamil script in the writing systems of
minority languages of southern India.
Historically, intermediate forms which gave rise to the Grantha script are attested as of the
fourth century ce. The earliest examples are found in inscriptions of the early Pallava kings
who ruled over parts of what is currently northern Tamil Nadu and southern Andhra
Pradesh. Modern Grantha, which this encoding represents, belongs to the period after the
thirteenth century ce.
Modern Grantha is frequently used by Tamil speakers to represent Sanskrit because
Grantha’s large set of letters can represent all the sounds of Sanskrit without the use of dia-
critical marks. The Tamil script has a smaller repertoire of letters that requires diacritical
marks to represent Sanskrit directly. This use of diacritical marks often leads to confusion
regarding the pronunciation of Sanskrit when written in the Tamil script.
Rendering Grantha
Although the Grantha script is visually similar to Tamil, its structure is similar to other
Indic scripts that are used to write Sanskrit. Written Sanskrit requires support for stacked
consonant structures.
Consonant Clusters. Some consonant clusters are stacks, some consonant structures are a
combination of ligatures and stacks, and some are just ligatures. Ligatures are often used
instead of stacks, and consonant clusters are frequently written as a combination of liga-
tures and stacking.
The typical stack height found in print in non-Vedic Sanskrit is two elements, but it is three
in Vedic Sanskrit. Stacks, like ligatures, are equivalent to single consonants for the purpose
of application of vowel signs.
Instances requiring more than three elements in a stack require special handling. In these
cases, the initial elements are pushed out of the consonant stack and may form their own
stacks. Such special cases are illustrated in Figure 15-3. In this situation, a single phonolog-
ical consonant cluster followed by a vowel may be represented by more than one
orthographic cluster.
South and Central Asia-IV 622 15.13 Grantha
Virama. Grantha follows the same virama model as Telugu and Kannada, in which the
sequence consonant + virama should be rendered as the vowelless form of the consonant in
the desired orthographic style. For example, in the prevalent orthographic style used in
modern printing, ta, na, and ma consistently fuse with the virama; ra and la superficially
connect with it, and the virama stands apart for all other consonants, as shown in
Table 15-7.
These visual distinctions in the rendering of explicit viramas also apply to the various
ligated conjuncts of Grantha.
Vowels. There are two forms of the au vowel sign: U+11357 grantha au length mark is
the modern one-part form, while the two-part form U+1134C grantha vowel sign au, is
somewhat archaic, but is found in manuscripts.
Only two vowel signs touch their base consonant in printed Grantha: U+1133F grantha
vowel sign i and U+11340 grantha vowel sign ii. U+11347 grantha vowel sign ee
and U+11348 grantha vowel sign ai are rendered to the left of their base. U+1134B
grantha vowel sign oo and the archaic U+1134C grantha vowel sign au are two-part
South and Central Asia-IV 623 15.13 Grantha
vowels with one part placed to the left of the base and one part to the right. All other vowel
signs are placed to the right of the base.
Manuscripts written in Grantha will show archaic ligatures of consonants with vowel signs.
The vowel signs U+11362 grantha vowel sign vocalic l and U+11363 grantha
vowel sign vocalic ll are sometimes placed below and sometimes placed to the right of
the base consonant. In contemporary printing practice, vowel signs are placed to the right.
Signs. Grantha uses the pluta sign to denote vowel lengthening. The pluta is not in current
use, but it is found in Vedic manuscripts. The nukta is not used to write Sanskrit, but is
used to transcribe words from other languages, such as Irula.
Cantillation Marks. Grantha uses a number of cantillation marks to represent tone, stress,
and breathing in Vedic texts. These marks include the twelve marks encoded in the
Grantha block in the range from U+11366..U+11374, and many encoded in other blocks as
well, including those listed in Table 15-8.
These nonspacing marks are normally applied to independent vowels, to consonants with
an inherent vowel, and to consonants with vowel signs. Sometimes they are also applied to
dead consonants which are displayed with a visible virama.
The preferred placement of svara marks in Grantha is horizontally centered relative to the
syllable. These marks should not extend beyond the horizontal span of the base syllable.
The svara marks can be applied to either syllables or digits, and used in combination with
each other.
Punctuation. Danda and double danda marks used with Grantha are found in the Devana-
gari block; see Section 12.1, Devanagari.
Numbers. Grantha makes use of the Tamil digits U+0BE6 through U+0BEF.
South and Central Asia-IV 624 15.14 Ahom
15.14 Ahom
Ahom: U+11700–U+1173F
The Ahom script is used in northeast India, primarily to write the Tai Ahom language. The
oldest surviving Ahom text is the “Snake Pillar” inscription which was inscribed in the time
of King Siuw Hum Miung (1497-1539). The script also appears on other stone inscriptions,
coins, brass plates and a large corpus of manuscripts. Although the use of the Tai Ahom
language declined in the late 17th century, traditional priests used the language and the
Ahom script in their religious practices throughout the 19th century.
Modern use of the Ahom script is considered to have begun in 1920 with the publication of
an Ahom-Assamese-English dictionary. This was followed by publication of other dictio-
naries, word lists, and primers. The publication of Ahom texts has progressed more rapidly
in recent decades, thanks to the availability of computers. Today there are large numbers of
books published in Assam that contain some Ahom content.
Structure. Like most other Brahmi-derived scripts, Ahom is an abugida, for which conso-
nant letters are associated with an inherent vowel “a”. The encoding also includes three
medial consonants, in the range U+1171D..U+1171F, which follow and graphically attach
to an initial consonant letter. In addition, Ahom has a visible virama that functions as a
vowel killer, U+1172B ahom sign killer. The use of the killer is only obligatory in mod-
ern Ahom.
Vowels. Ahom has no independent vowels, but instead uses U+11712 ahom letter a fol-
lowed by the corresponding dependent vowel sign (or signs).
Syllabic Structure. Ahom has closed syllables, and optional medials may occur after initial
consonants. Vowels can occur in sequences of U+11712 ahom letter a and dependent
vowel signs, or a series of dependent vowel signs. Final consonants take U+1172B ahom
sign killer.
Numerals. The original Ahom numeral system was not a decimal radix system; however, in
modern use a digit zero has been added, and the digits can be used to express decimal radix
numerals. In traditional use, the digits may also be mixed with word spellings when writing
out numbers.
The forms of the Ahom digits are derived from several sources. U+11732 ahom digit two
is visually identical to U+11701 ahom letter kha and probably derives from it. The digits
3, 4, and 5 are usually expressed by the Ahom words for those numbers spelled out.
U+1173B ahom number twenty is also just the Ahom word for 20 spelled out.
Punctuation. Ahom uses two punctuation characters which function similarly to dandas:
U+1173C ahom sign small section and U+1173D ahom sign section. The script also
uses a paragraph mark, U+1173E ahom sign rulai, and a symbol that indicates an excla-
mation, U+1173F ahom symbol vi.
South and Central Asia-IV 625 15.14 Ahom
Modern Ahom uses spaces to indicate word boundaries. This convention is seen in some
early Ahom manuscripts, but is not consistent in the early material.
Variant Forms. A number of variant letterforms are found in manuscripts, but are no lon-
ger used in modern Ahom. Specific characters are encoded to represent the historic vari-
ants of ta, ga, ba, and the medial ligating ra.
South and Central Asia-IV 626 15.15 Sora Sompeng
15.16 Dogra
Dogra: U+11800–U+1184F
In the 17th century, the Dogra script was used to write the Dogri language in Jammu and
Kashmir in the northern region of the Indian subcontinent. Dogri is an Indo-Aryan lan-
guage now usually written with the Devanagari script. The Dogra script was standardized
in the 1860s, and is closely related to the Takri script. The official form, known as “Name
Dogra Akkar” or “New Dogra Script,” appears in administrative documents, on currency,
postcards, postage stamps, and in literary works. The unofficial, common written form of
the script is called “Old Dogra.” The glyphs in the code chart are based on New Dogra.
Structure. Dogra is an abugida, based on Brahmi. It is written left to right. The script
includes a virama, U+11838 dogra sign virama, to create conjuncts and to suppress the
inherent vowel.
Vowels. Because the glyphs for Dogra vowel letters changed over time, the phonetic value
of three vowel letters varies between New and Old Dogra. Old Dogra uses U+11802 dogra
vowel letter i for u, U+11803 dogra vowel letter ii for i, and U+11804 dogra vowel
letter u for o and au. The shapes of the vowel signs also vary between Old and New
Dogra. Distinct fonts can be used to reflect the Old Dogra vowel shapes, as opposed to the
New Dogra shapes.
A feature of Dogra is that the dependent vowel may be represented either by the indepen-
dent vowel letter, or by the dependent vowel sign. For example, the syllable ke may be rep-
resented by AB <ka, e> or C <ka, vowel sign e>.
Characters Used to Represent Sanskrit. U+11831 dogra vowel sign vocalic r and
U+11828 dogra letter ssa are used in New Dogra to represent sounds of Sanskrit origin.
Consonant Conjuncts. Consonant clusters in Dogra may be rendered in different ways.
The most common method is to place a virama beneath each bare consonant. For example,
D pra is represented with the sequence <pa, virama, ra>. The second method is to use
half-forms if the graphical structure permits it. For example, the conjunct shra, contained
in the Sanskrit honorific shrii, appears regularly with the half-form E. It is represented by
the sequence <sha, virama, ra>. The third way to form consonant clusters is with ligatures.
The ligature may be an atomic ligature, such as F ksha, which is represented with the
sequence <ka, virama, ssa>, or with the combination of two consonants, in which the indi-
vidual shapes of each letter is visible. For example, G sṭa is written by the sequence <sa,
virama, tta>. There is no evidence that special conjunct forms of ra occur.
Other Symbols. U+11836 dogra sign anusvara indicates nasalization, and U+11837
dogra sign visarga indicates post-vocalic aspiration in words of Sanskrit origin, while
U+11839 dogra sign nukta is used to transcribe sounds that are not native to the Dogri
language.
South and Central Asia-IV 628 15.16 Dogra
Chapter 16
Southeast Asia 16
This chapter documents the following scripts of Southeast Asia:
The scripts of Southeast Asia are written from left to right; many use no interword spacing
but use spaces or marks between phrases. They are mostly abugidas, but with various idio-
syncrasies that distinguish them from the scripts of South Asia.
Thai and Lao are the official scripts of Thailand and Laos, respectively, and are closely
related. These scripts are unusual for Brahmi-derived scripts in the Unicode Standard,
because for various implementation reasons they depart from logical order in the represen-
tation of consonant-vowel sequences. Vowels that occur to the left side of their consonant
are represented in visual order before the consonant in a string, even though they are pro-
nounced afterward.
Myanmar is the official script of Myanmar, and is used to write the Burmese language, as
well as many minority languages of Myanmar and Northern Thailand. It has a mixed
encoding model, making use of both a virama and a killer character, and having explicitly
encoded medial consonants.
The Khmer script is used for the Khmer and related languages in the Kingdom of Cambo-
dia.
The term “Tai” refers to a family of languages spoken in Southeast Asia, including Thai,
Lao, and Shan. This term is also part of the name of a number of scripts encoded in the
Unicode Standard. The Tai Le script is used to write the language of the same name, which
is spoken in south central Yunnan (China). The New Tai Lue script, also known as Xish-
uangbanna Dai, is unrelated to the Tai Le script, but is also used in south Yunnan. New Tai
Lue is a simplified form of the more traditional Tai Tham script, which is also known as
Lanna. The Tai Tham script is used for the Northern Thai, Tai Lue, and Khün languages.
The Tai Viet script is used for the Tai Dam, Tai Dón, and Thai Song languages of north-
western Vietnam, northern Laos, and central Thailand. Unlike the other Tai scripts, the
Tai Viet script makes use of a visual order model, similar to that for the Thai and Lao
scripts.
Southeast Asia 630
Kayah Li is a relatively recently invented script, used to write the Kayah Li languages of
Myanmar and Thailand. Although influenced by the Myanmar script, Kayah Li is basically
an alphabet in structure.
Cham is a Brahmi-derived script used by the Austronesian language Cham, spoken in the
southern part of Vietnam and in Cambodia. It does not use a virama. Instead, the encoding
makes use of medial consonant signs and explicitly encoded final consonants.
Pahawh Hmong is an alphabetic script devised for writing the Hmong language in the lat-
ter half of the 20th century. Its development includes several revisions. The script is used by
Hmong communities in several countries, including the United States and Australia.
Nyiakeng Puachue Hmong is a writing system created in the 1980s to write the White
Hmong and Green Hmong languages. It is also called the Ntawv Txawjvaag or Chervang
script, and was devised for use in the United Christians Liberty Evangelical church in the
United States. The script is written from left to right, and is reported to be used in Laos,
Thailand, Vietnam, France and Australia.
The Pau Cin Hau alphabet is a liturgical script of the Laipian religious tradition, which
emerged in the Chin Hills region of present-day Chin State, Myanmar at the turn of the
20th century.
Hanifi Rohingya is an alphabetic script used to write the Rohingya language, an Indo-
Aryan language spoken by one million people primarily in Myanmar and Bangladesh. The
script was developed in the 1980s and shows Arabic influence in its general appearance and
structure.
Southeast Asia 631 16.1 Thai
16.1 Thai
Thai: U+0E00–U+0E7F
The Thai script is used to write Thai and other Southeast Asian languages, such as Kuy,
Lanna Tai, and Pali. It is a member of the Indic family of scripts descended from Brahmi.
Thai modifies the original Brahmi letter shapes and extends the number of letters to
accommodate features of the Thai language, including tone marks derived from super-
script digits. At the same time, the Thai script lacks the conjunct consonant mechanism
and independent vowel letters found in most other Brahmi-derived scripts. As in all scripts
of this family, the predominant writing direction is from left to right.
Standards. Thai layout in the Unicode Standard is based on the Thai Industrial Standard
620-2529, and its updated version 620-2533.
Encoding Principles. In common with most Brahmi-derived scripts, each Thai consonant
letter represents a syllable possessing an inherent vowel sound. For Thai, that inherent
vowel is /o/ in the medial position and /a/ in the final position.
The consonants are divided into classes that historically represented distinct sounds, but in
modern Thai indicate tonal differences. The inherent vowel and tone of a syllable are then
modified by addition of vowel signs and tone marks attached to the base consonant letter.
Some of the vowel signs and all of the tone marks are rendered in the script as diacritics
attached above or below the base consonant. These combining signs and marks are
encoded after the modified consonant in the memory representation.
Most of the Thai vowel signs are rendered by full letter-sized inline glyphs placed either
before (that is, to the left of ) , after (to the right of ), or around (on both sides of ) the glyph
for the base consonant letter. In the Thai encoding, the letter-sized glyphs that are placed
before (left of ) the base consonant letter, in full or partial representation of a vowel sign,
are, in fact, encoded as separate characters that are typed and stored before the base conso-
nant character. This encoding for left-side Thai vowel sign glyphs (and similarly in Lao and
in Tai Viet) differs from the conventions for all other Indic scripts, which uniformly encode
all vowels after the base consonant. The difference is necessitated by the encoding practice
commonly employed with Thai character data as represented by the Thai Industrial Stan-
dard.
The glyph positions for Thai syllables are summarized in Table 16-1.
Southeast Asia 632 16.1 Thai
Rendering of Thai Combining Marks. The canonical combining classes assigned to tone
marks (ccc = 107) and to other combining characters displayed above (ccc = 0) do not fully
account for their typographic interaction.
Southeast Asia 633 16.1 Thai
For the purpose of rendering, the Thai combining marks above (U+0E31,
U+0E34..U+0E37, U+0E47..U+0E4E) should be displayed outward from the base charac-
ter they modify, in the order in which they appear in the text. In particular, a sequence con-
taining <U+0E48 thai character mai ek, U+0E4D thai character nikhahit> should
be displayed with the nikhahit above the mai ek, and a sequence containing <U+0E4D
thai character nikhahit, U+0E48 thai character mai ek> should be displayed with
the mai ek above the nikhahit.
This does not preclude input processors from helping the user by pointing out or correct-
ing typing mistakes, perhaps taking into account the language. For example, because the
string <mai ek, nikhahit> is not useful for the Thai language and is likely a typing mistake,
an input processor could reject it or correct it to <nikhahit, mai ek>.
When the character U+0E33 thai character sara am follows one or more tone marks
(U+0E48..U+0E4B), the nikhahit that is part of the sara am should be displayed below those
tone marks. In particular, a sequence containing <U+0E48 thai character mai ek,
U+0E33 thai character sara am> should be displayed with the mai ek above the nikhahit.
Thai Punctuation. Thai uses a variety of punctuation marks particular to this script.
U+0E4F thai character fongman is the Thai bullet, which is used to mark items in lists
or appears at the beginning of a verse, sentence, paragraph, or other textual segment.
U+0E46 thai character maiyamok is used to mark repetition of preceding letters.
U+0E2F thai character paiyannoi is used to indicate elision or abbreviation of letters;
it is itself viewed as a kind of letter, however, and is used with considerable frequency
because of its appearance in such words as the Thai name for Bangkok. Paiyannoi is also
used in combination (U+0E2F U+0E25 U+0E2F) to create a construct called paiyanyai,
which means “et cetera, and so forth.” The Thai paiyanyai is comparable to its analogue in
the Khmer script: U+17D8 khmer sign beyyal.
U+0E5A thai character angkhankhu is used to mark the end of a long segment of text.
It can be combined with a following U+0E30 thai character sara a to mark a larger seg-
ment of text; typically this usage can be seen at the end of a verse in poetry. U+0E5B thai
character khomut marks the end of a chapter or document, where it always follows the
angkhankhu + sara a combination. The Thai angkhankhu and its combination with sara a
to mark breaks in text have analogues in many other Brahmi-derived scripts. For example,
they are closely related to U+17D4 khmer sign khan and U+17D5 khmer sign
bariyoosan, which are themselves ultimately related to the danda and double danda of
Devanagari.
Spacing. Thai words are not separated by spaces. Instead, text is laid out with spaces intro-
duced at text segments where Western typography would typically make use of commas or
periods. However, Latin-based punctuation such as comma, period, and colon are also
used in text, particularly in conjunction with Latin letters or in formatting numbers,
addresses, and so forth. If explicit word break or line break opportunities are desired—for
example, for the use of automatic line layout algorithms—the character U+200B zero
width space should be used to place invisible marks for such breaks. The zero width
space can grow to have a visible width when justified. See Table 23-2.
Southeast Asia 634 16.1 Thai
Thai Transcription of Pali and Sanskrit. The Thai script is frequently used to write Pali
and Sanskrit. When so used, consonant clusters are represented by the explicit use of
U+0E3A thai character phinthu (virama) to mark the removal of the inherent vowel.
There is no conjoining behavior, unlike in other Indic scripts. U+0E4D thai character
nikhahit is the Pali nigghahita and Sanskrit anusvara. U+0E30 thai character sara a
is the Sanskrit visarga. U+0E24 thai character ru and U+0E26 thai character lu are
vocalic /r/ and /l/, with U+0E45 thai character lakkhangyao used to indicate their
lengthening.
Patani Malay. The Patani Malay orthography makes use of additional diacritics. A line
below a consonant indicates that its sound differs from Thai. The line below is represented
using U+0331 combining macron below. Nasalization is indicated by U+0303 combin-
ing tilde. Glottalization is marked with the character U+02BC modifier letter apos-
trophe. The character U+02D7 modifier letter minus sign indicates an elision
between two vowel sequences.
In Thai script, use of marks from the Combining Diacritical Marks block, such as U+0331
combining macron below and U+0303 combining tilde, imposes additional con-
straints for rendering systems. This is because the canonical ordering of these marks with
respect to Thai vowels and tone marks may put them in an order that requires rearrange-
ment during rendering.
In particular, when used as a consonant diacritic, U+0331 combining macron below can
occur with vowel signs U+0338 thai character sara u or U+0339 thai character
sara uu. These vowel signs have a fixed-position canonical combining class of 103. A char-
acter sequence would normally be entered in the order consonant + macron below + vowel
sign. However, in normalized text, these combining marks would be re-ordered, resulting
in a sequence consonant + vowel sign + macron below. Thai rendering implementations
must ensure that the vowel signs sara u and sara uu are less-closely bound to the conso-
nant letter than consonant diacritics. In other words, sara u and sara uu must appear
below combining macron below in normalized text, and not vice versa.
Likewise, Thai tone marks U+0E48..U+0E4B have a fixed-position canonical combining
class of 107. If a combining mark such as U+0303 combining tilde is used as a vowel sign,
then it can potentially occur with the tone marks. Characters would likely be entered in the
order consonant + tilde + tone, but in normalized text these would be reordered as conso-
nant + tone + tilde. Thai rendering implementations must ensure that the tone marks dis-
play above the combining tilde.
Southeast Asia 635 16.2 Lao
16.2 Lao
Lao: U+0E80–U+0EFF
The Lao language and script are closely related to Thai. The Unicode Standard encodes the
characters of the Lao script in roughly the same relative order as the Thai characters.
Encoding Principles. Lao contains fewer letters than Thai because by 1960 it was simpli-
fied to be fairly phonemic, whereas Thai maintains many etymological spellings that are
homonyms. Unlike in Thai, Lao consonant letters are conceived of as simply representing
the consonant sound, rather than a syllable with an inherent vowel. The vowel [a] is always
represented explicitly with U+0EB0 lao vowel sign a.
Punctuation. Regular word spacing is not used in Lao; spaces separate phrases or sen-
tences instead.
Glyph Placement. The glyph placements for Lao syllables are summarized in Table 16-2.
When the character U+0EB3 lao vowel sign am follows one or more tone marks
(U+0EC8..U+0ECB), the niggahita that is part of the sara am should be displayed below
those tone marks. In particular, a sequence containing <U+0EC8 lao tone mai ek,
U+0EB3 lao vowel sign am> should be displayed with the mai ek above the niggahita.
Lao Aspirated Nasals. The Unicode character encoding includes two ligatures for Lao:
U+0EDC lao ho no and U+0EDD lao ho mo. They correspond to sequences of [h] plus
[n] or [h] plus [m] without ligating. Their function in Lao is to provide versions of the [n]
and [m] consonants with a different inherent tonal implication.
Transcription of Pali and Sanskrit. Traditionally the Lao script is not used to write Pali
and Sanskrit. The Lao consonant repertoire originally contained only the letters needed by
the modern Lao language. An extended writing system was designed in the 1930s by Maha
Sila Viravong to transcribe consonant clusters and additional consonants of Pali. The addi-
tional characters required by the extension are listed in Table 16-3.
U+0EBA lao sign pali virama marks the removal of the inherent vowel of a consonant
letter, and does not indicate conjoining or stacking behavior. U+0EA8 lao letter san-
skrit sha and U+0EA9 lao letter sanskrit ssa are used only in Sanskrit.
Implementations should not assume transliteration mappings or a cognate relationship
between all Lao and Thai characters based on their relative locations in the blocks. For
example, Pali nya, a cognate of U+0E0D thai character yo ying, is encoded at U+0E8E
instead of the corresponding location U+0E8D because the latter is already occupied by
Lao nyo, a phonetically related non-cognate of Thai yo ying.
Southeast Asia 638 16.3 Myanmar
16.3 Myanmar
Myanmar: U+1000–U+109F
The Myanmar script is used to write Burmese, the majority language of Myanmar (for-
merly called Burma). Variations and extensions of the script are used to write other lan-
guages of the region, such as Mon, Karen, Kayah, Shan, and Palaung, as well as Pali and
Sanskrit. The Myanmar script was formerly known as the Burmese script, but the term
“Myanmar” is now preferred.
The Myanmar writing system derives from a Brahmi-related script borrowed from South
India in about the eighth century to write the Mon language. The first inscription in the
Myanmar script dates from the eleventh century and uses an alphabet almost identical to
that of the Mon inscriptions. Aside from rounding of the originally square characters, this
script has remained largely unchanged to the present. It is said that the rounder forms were
developed to permit writing on palm leaves without tearing the writing surface of the leaf.
The Myanmar script shares structural features with other Brahmi-based scripts such as
Khmer: consonant symbols include an inherent “a” vowel; various signs are attached to a
consonant to indicate a different vowel; medial consonants are attached to the consonant;
and the overall writing direction is from left to right.
Standards. There is not yet an official national standard for the encoding of Myanmar/
Burmese. The current encoding was prepared with the consultation of experts from the
Myanmar Information Technology Standardization Committee (MITSC) in Yangon
(Rangoon). The MITSC, formed by the government in 1997, consists of experts from the
Myanmar Computer Scientists’ Association, Myanmar Language Commission, and Myan-
mar Historical Commission.
Encoding Principles. As with Indic scripts, the Myanmar encoding represents only the
basic underlying characters; multiple glyphs and rendering transformations are required
to assemble the final visual form for each syllable. Characters and combinations that may
appear visually identical in some fonts, such as U+101D ! myanmar letter wa and
U+1040 ! myanmar digit zero, are distinguished by their underlying encoding.
Composite Characters. As is the case in many other scripts, some Myanmar letters or signs
may be analyzed as composites of two or more other characters and are not encoded sepa-
rately. The following are three examples of Myanmar letters represented by combining
character sequences:
U+1000 . ka + U+1031 & vowel sign e + U+102C " vowel sign aa →
) /kàw/
U+1000 . ka + U+1031 & vowel sign e + U+102C " vowel sign aa +
U+103A ' asat → * /kaw/
U+1000 . ka + U+102D $ vowel sign i + U+102F % vowel sign u → ( /ko/
Southeast Asia 639 16.3 Myanmar
Encoding Subranges. The basic consonants, medials, independent vowels, and dependent
vowel signs required for writing the Myanmar language are encoded at the beginning of
the Myanmar block. Those are followed by script-specific digits, punctuation, and various
signs. The last part of the block contains extensions for consonants, medials, vowels, and
tone marks needed to represent historic text and various other languages. These extensions
support Pali and Sanskrit, as well as letters and tone marks for Mon, Karen, Kayah, and
Shan. The extensions include two tone marks for Khamti Shan and two vowel signs for
Aiton and Phake, but the majority of the additional characters needed to support those lan-
guages are found in the Myanmar Extended-A block.
Conjuncts. As in other Indic-derived scripts, conjunction of two consonant letters is indi-
cated by the insertion of a virama U+1039 A myanmar sign virama between them. It
causes the second consonant to be displayed in a smaller form below the first; the virama is
not visibly rendered.
Kinzi. The conjunct form of U+1004 + myanmar letter nga is rendered as a superscript
sign called kinzi. That superscript sign is not encoded as a separate mark, but instead is
simply the rendering form of the nga in a conjunct context. The nga is represented in logi-
cal order first in the sequence, before the consonant which actually bears the visible kinzi
superscript sign in final rendered form. For example, kinzi applied to U+1000 . myan-
mar letter ka would be written via the following sequence:
U+1004 + nga + U+103A ' asat + U+1039 A virama + U+1000 . ka
→ - ka
Note that this sequence includes both U+103A asat and U+1039 virama between the nga
and the ka. Use of the virama alone would ordinarily indicate stacking of the consonants,
with a small ka appearing under the nga. Use of the asat killer in addition to the virama
gives a sequence that can be distinguished from normal stacking: the sequence <U+1004,
U+103A, U+1039> always maps unambiguously to a visible kinzi superscript sign on the
following consonant.
Medial Consonants. The Myanmar script traditionally distinguishes a set of “medial” con-
sonants: forms of ya, ra, wa, and ha that are considered to be modifiers of the syllable’s
vowel. Graphically, these medial consonants are sometimes written as subscripts, but
sometimes, as in the case of ra, they surround the base consonant instead. In the Myanmar
encoding, the medial consonants are encoded separately. For example, the word ,
[kjwei] (“to drop off ”) would be written via the following sequence:
U+1000 . ka + U+103C & medial ra + U+103D ( medial wa + U+1031
& vowel sign e → , /kjwei/
In Pali and Sanskrit texts written in the Myanmar script, as well as in older orthographies
of Burmese, the consonants ya, ra, wa, and ha are sometimes rendered in subjoined form.
In those cases, U+1039 A myanmar sign virama and the regular form of the consonant
are used.
Southeast Asia 640 16.3 Myanmar
Asat. The asat, or killer, is a visibly displayed sign. In some cases it indicates that the inher-
ent vowel sound of a consonant letter is suppressed. In other cases it combines with other
characters to form a vowel letter. Regardless of its function, this visible sign is always repre-
sented by the character U+103A ' myanmar sign asat.
Contractions. In a few Myanmar words, the repetition of a consonant sound is written
with a single occurrence of the letter for the consonant sound together with an asat sign.
This asat sign occurs immediately after the double-acting consonant in the coded repre-
sentation:
U+101A Z ya + U+1031 & vowel sign e + U+102C " vowel sign aa +
U+1000 . ka + U+103A ' asat + U+103B % medial ya + U+102C "
vowel sign aa + U+1038 5 visarga → ? man, husband
Great sa. The great sa is encoded as U+103F \ myanmar letter great sa. This letter
should be represented with <U+103F>, while the sequence <U+101E, U+1039, U+101E>
should be used for the regular conjunct form of two sa, ., and the sequence <U+101E,
U+103A, U+101E> should be used for the form with an asat sign, -.
Tall aa. The two shapes # and " are both used to write the sound /a/. In Burmese
orthography, both shapes are used, depending on the visual context. In S’gaw Karen
orthography, only the tall form is used. For this reason, two characters are encoded:
U+102B # myanmar vowel sign tall aa and U+102C " myanmar vowel sign aa. In
Burmese texts, the coded character appropriate to the visual context should be used.
Ordering of Syllable Components. Dependent vowels and other signs are encoded after
the consonant to which they apply, except for kinzi, which precedes the consonant. Char-
acters occur in the relative order shown in Table 16-4.
medial ra F U+103C
medial wa H U+103D
medial ha G U+103E
anusvara P U+1036
visarga S U+1038
U+1031 & myanmar vowel sign e is encoded after its consonant (as in the earlier exam-
ple), although in visual presentation its glyph appears before (to the left of ) the consonant
form.
Table 16-4 nominally refers to the character sequences used in representing the syllabic
structure of the modern Burmese language proper. Canonical normalization may result in
a different ordering, specifically with some occurrences of U+103A Q myanmar sign asat
reordered after U+1037 R myanmar sign dot below. As such reorderings are canonically
equivalent, implementations should support both orders and treat them as fundamentally
the same text.
Table 16-4 would require further extensions and modifications to cover various other lan-
guages, such as Karen, Mon, Shan, Sanskrit, and Old Burmese, which also use the Myan-
mar script. For some such extensions and modifications, refer to Unicode Technical Note
#11, “Representing Myanmar in Unicode: Details and Examples,” or also Microsoft Typog-
raphy’s “Creating and Supporting OpenType Fonts for Myanmar Script.” Note that those
documents are not normative for the Unicode Standard, and they also differ from each
other in some details.
Spacing. Myanmar does not use any whitespace between words. If explicit word break or
line break opportunities are desired—for example, for the use of automatic line layout algo-
rithms—the character U+200B zero width space should be used to place invisible marks
Southeast Asia 642 16.3 Myanmar
for such breaks. The zero width space can grow to have a visible width when justified.
Spaces are used to mark phrases. Some phrases are relatively short (two or three syllables).
Khamti Shan
The Khamti Shan language has a long literary tradition which has largely been lost, for a
variety of reasons. The old script did not mark tones, and it had a scribal tradition that
encouraged restriction to a reading elite whose traditions have not been passed on. The
script has recently undergone a revival, with plans for it to be taught throughout the
Khamti-Shan-speaking regions in Myanmar. A new version of the script has been adopted
by the Khamti in Myanmar. The Khamti Shan characters in the Myanmar Extended-A
block supplement those in the Myanmar block and provide complete support for the mod-
ern Khamti Shan writing system as written in Myanmar. Another revision of the old script
was made in India under the leadership of Chau Khouk Manpoong in the 1990s. That revi-
sion has not gained significant popularity, although it enjoys some currency today.
Consonants. Approximately half of the consonants used in Khamti Shan are encoded in
the Myanmar block. Following the conventions used for Shan, Mon, and other extensions
to the Myanmar script, separate consonants are encoded specifically for Khamti Shan in
this block when they differ significantly in shape from corresponding letters conveying the
same consonant sounds in Myanmar proper. Khamti Shan also uses the three Myanmar
medial consonants encoded in the range U+101B..U+101D.
The consonants in this block are displayed in the code charts using a Burmese style, so that
glyphs for the entire Myanmar script are harmonized in a single typeface. However, the
local style preferred for Khamti Shan is slightly different, typically adding a small dot to
each character.
Vowels. The vowels and dependent vowel signs used in Khamti Shan are located in the
Myanmar block.
Tones. Khamti Shan has eight tones. Seven of these are written with explicit tone marks;
one is unmarked. All of the explicit tone marks are encoded in the Myanmar block. Khamti
Shan makes use of four of the Shan tone marks and the visarga. In addition, two Khamti
Shan-specific tone marks are separately encoded. These tone marks for Khamti Shan are
listed in Table 16-5.
Southeast Asia 643 16.3 Myanmar
The vertical positioning of the small circle in some of these tone marks is considered dis-
tinctive. U+109A myanmar sign khamti tone-1 (with a high position) is not the same as
U+108B myanmar sign shan council tone-2 (with a mid-level position). Neither of
those should be confused with U+1089 myanmar sign shan tone-5 (with a low position).
The tone mark characters in Shan fonts are typically displayed with open circles. However,
in Khamti Shan, the circles in the tone marks normally are filled in (black).
Digits. Khamti Shan uses the Shan digits from the range U+1090..U+109A.
Other Symbols. Khamti Shan uses the punctuation marks U+104A myanmar sign little
section and U+104B myanmar sign section. The repetition mark U+AA70 myanmar
modifier letter khamti reduplication is functionally equivalent to U+0E46 thai
character maiyamok.
Three logogram characters are also used. These logograms can take tone marks, and their
meaning varies according to the tone they take. They are used when transcribing speech
rather than in formal writing. For example, U+AA75 myanmar logogram khamti qn
takes three tones and means “negative,” “giving” or “yes,” according to which tone is
applied. The other two logograms are U+AA74 myanmar logogram khamti oay and
U+AA76 myanmar logogram khamti hm.
Subjoined Characters. Khamti Shan does not use subjoined characters.
Historical Khamti Shan. The characters of historical Khamti Shan are for the most part
identical to those used in the New Khamti Shan orthography. Most variation is merely sty-
listic. There were no Pali characters. The only significant character difference lies with ra—
which follows Aiton and Phake in using a la with medial ra (U+AA7A myanmar letter
aiton ra).
During the development of the New Khamti Shan orthography a few new character shapes
were introduced that were subsequently revised. Because materials have been published
using these shapes, and these shapes cannot be considered stylistic variants of other char-
acters, these characters are separately encoded in the range U+AA71..U+AA73.
Southeast Asia 644 16.3 Myanmar
16.4 Khmer
Khmer: U+1780–U+17FF
Khmer, also known as Cambodian, is the official language of the Kingdom of Cambodia.
Mutually intelligible dialects are also spoken in northeastern Thailand and in the Mekong
Delta region of Vietnam. Although Khmer is not an Indo-European language, it has bor-
rowed much vocabulary from Sanskrit and Pali, and religious texts in those languages have
been both transliterated and translated into Khmer. The Khmer script is also used to ren-
der a number of regional minority languages, such as Tampuan, Krung, and Cham.
The Khmer script, called aksaa khmae (“Khmer letters”), is also the official script of Cam-
bodia. It is descended from the Brahmi script of South India, as are Thai, Lao, Myanmar,
Old Mon, and others. The exact sources have not been determined, but there is a great sim-
ilarity between the earliest inscriptions in the region and the Pallawa script of the Coro-
mandel coast of India. Khmer has been a unique and independent script for more than
1,400 years. Modern Khmer has two basic styles of script: the aksaa crieng (“slanted
script”) and the aksaa muul (“round script”). There is no fundamental structural difference
between the two. The slanted script (in its “standing” variant) is chosen as representative in
the code charts.
The subscript consonant signs in the Khmer script can be used to denote a final consonant,
although this practice is uncommon.
Examples of subscript consonant signs for a closing consonant follow:
^ht to + aa + nikahit + coeng + ngo [tr'}] “both” (= ^h&) (≠ *^hh [t}m'm])
cBZ, ha + oe + coeng + yo [ha'i] “already” (= cBZ;) (≠ *cB,Z [hja'])
While these subscript consonant signs are usually attached to a consonant character, they
can also be attached to an independent vowel character. Although this practice is relatively
rare, it is used in one very common word, meaning “to give.”
Examples of subscript consonant signs attached to an independent vowel character follow:
S, qoo-1 + coeng + yo [paoi] “to give” (= S; and also T,)
S+ qoo-1 + coeng + mo [paom] “exclamation of solemn affirmation” (=
S:)
Subscript Independent Vowel Signs. Some independent vowel characters also have corre-
sponding subscript independent vowel signs, although these are rarely used today.
Examples of subscript independent vowel signs follow:
7B: pha + coeng + qe + mo [pspaem] “sweet” (= d75: pha + coeng + qa +
ae + mo)
Consonant Registers. The Khmer language has a richer set of vowels than the languages
for which the ancestral script was used, although it has a smaller set of consonant sounds.
The Khmer script takes advantage of this situation by assigning different characters to rep-
resent the same consonant using different inherent vowels. Khmer consonant characters
and signs are organized into two series or registers, whose inherent vowels are nominally -
a in the first register and -o in the second register, as shown in Table 16-7.
The register of a consonant character is generally reflected on the last letter of its transliter-
ated name. Some consonant characters and signs have a counterpart whose consonant
sound is the same but whose register is different, as ka and ko in the first row of the table.
For the other consonant characters and signs, two “shifter” signs are available. U+17C9
khmer sign muusikatoan converts a consonant character and sign from the second to
the first register, while U+17CA khmer sign triisap converts a consonant from the first
register to the second (rows 2–4). To represent pa, however, muusikatoan is attached not
to po but to ba, in an exceptional use (row 5). The phonetic value of a dependent vowel sign
may also change depending on the context of the consonant(s) to which it is attached (row
6).
Southeast Asia 648 16.4 Khmer
5
6k: ba + muusikatoan + mo [ptqm] 8: po + mo [pmqm] “to put into the
“blockhouse” mouth”
6 "^< ka + u + ro [koq] “to stir” $^< ko + u + ro [kuq] “to sketch”
Encoding Principles. Like other related scripts, the Khmer encoding represents only the
basic underlying characters; multiple glyphs and rendering transformations are required
to assemble the final visual form for each orthographic syllable. Individual characters, such
as U+1789 khmer letter nyo, may assume variant forms depending on the other charac-
ters with which they combine.
Subscript Consonant Signs. In the way that many Cambodians analyze Khmer today, sub-
script consonant signs are considered to be different entities from consonant characters.
The Unicode Standard does not assign independent code points for the subscript conso-
nant signs. Instead, each of these signs is represented by the sequence of two characters: a
special control character (U+17D2 khmer sign coeng) and a corresponding consonant
character. This is analogous to the virama model employed for representing conjuncts in
other related scripts. Subscripted independent vowels are encoded in the same manner.
Because the coeng sign character does not exist as a letter or sign in the Khmer script, the
Unicode model departs from the ordinary way that Khmer is conceived of and taught to
native Khmer speakers. Consequently, the encoding may not be intuitive to a native user of
the Khmer writing system, although it is able to represent Khmer correctly.
U+17D2 A khmer sign coeng is not actually a coeng but a coeng generator, because coeng
in Khmer refers to the subscript consonant sign. The glyph for U+17D2 A khmer sign
coeng shown in the code charts is arbitrary and is not actually rendered directly; the dot-
ted box around the glyph indicates that special rendering is required. To aid Khmer script
users, a listing of typical Khmer subscript consonant letters has been provided in
Table 16-8 together with their descriptive names following preferred Khmer practice.
While the Unicode encoding represents both the subscripts and the combined vowel let-
ters with a pair of code points, they should be treated as a unit for most processing pur-
poses. In other words, the sequence functions as if it had been encoded as a single
character. A number of independent vowels also have subscript forms, as shown in
Table 16-8.
Southeast Asia 649 16.4 Khmer
As noted earlier, <U+17D2, U+17A1> represents a subscript form of la that is not used in
Cambodia, although it is employed in Thailand.
Dependent Vowel Signs. Most of the Khmer dependent vowel signs are represented with a
single character that is applied after the base consonant character and optional subscript
consonant signs. Three of these Khmer vowel signs are not encoded as single characters in
in the Unicode Standard. The vowel sign am is encoded as a nasalization sign, U+17C6
khmer sign nikahit. Two vowel signs, om and aam, have not been assigned independent
code points. They are represented by the sequence of a vowel (U+17BB khmer vowel
sign u and U+17B6 khmer vowel sign aa, respectively) and U+17C6 khmer sign nika-
hit.
The nikahit is superficially similar to anusvara, the nasalization sign in the Devanagari
script, although in Khmer it is usually regarded as a vowel sign am. Anusvara not only rep-
resents a special nasal sound, but also can be used in place of one of the five nasal conso-
nants homorganic to the subsequent consonant (velar, palatal, retroflex, dental, or labial,
respectively). Anusvara can be used concurrently with any vowel sign in the same
orthographic syllable. Nikahit, in contrast, functions differently. Its final sound is [m], irre-
spective of the type of the subsequent consonant. It is not used concurrently with the vow-
els ii, e, ua, oe, oo, and so on, although it is used with the vowel signs aa and u. In these
cases the combination is sometimes regarded as a unit—aam and om, respectively. The
sound that aam represents is [m'm], not [aqm]. The sequences used for these combinations
are shown in Table 16-9.
Other Signs as Syllabic Components. The Khmer sign robat historically corresponds to
the Devanagari repha, a representation of syllable-initial r-. However, the Khmer script can
treat the initial r- in the same way as the other initial consonants—namely, a consonant
character ro and as many subscript consonant signs as necessary. Some old loan words
from Sanskrit and Pali include robat, but in some of them the robat is not pronounced and
is preserved in a fossilized spelling. Because robat is a distinct sign from the consonant
character ro, the Unicode Standard encodes U+17CC khmer sign robat, but it treats the
Devanagari repha as a part of a ligature without encoding it. The authoritative Chuon Nath
dictionary sorts robat as if it were a base consonant character, just as the repha is sorted in
scripts that use it. The consonant over which robat resides is then sorted as if it were a sub-
script.
Examples of consonant clusters beginning with ro and robat follow:
g)<2Z ro + aa + co + ro + coeng + sa + ii [rè'crsei] “king hermit”
o;n qa + aa + yo + robat [paqrja] “civilized” (= o<, qa + aa + ro + coeng
+ yo)
Southeast Asia 652 16.4 Khmer
Consonant Shifters. U+17C9 khmer sign muusikatoan and U+17CA khmer sign tri-
isap are consonant shifters, also known as register shifters. In the presence of other super-
script glyphs, both of these signs are usually rendered with the same glyph shape as that of
U+17BB khmer vowel sign u, as shown in the last two examples of Figure 16-3.
Although the consonant shifter in handwriting may be written after the subscript, the con-
sonant shifter should always be encoded immediately following the base consonant, except
when it is preceded by U+200C zero width non-joiner. This provides Khmer with a
fixed order of character placement, making it easier to search for words in a document.
e:kt mo + muusikatoan + coeng + ngo + ai [m}ai] “one day”
d:l31y mo + triisap + coeng + ha + ae + ta + lek too [mhrqtmhrqt] “bland”
If either muusikatoan or triisap needs to keep its superscript shape (as an exception to the
general rule that states other superscripts typically force the alternative subscript glyph for
either character), U+200C zero width non-joiner should be inserted before the conso-
nant shifter to show the normal glyph for a consonant shifter when the general rule
requires the alternative glyph. In such cases, U+200C zero width non-joiner is inserted
before the vowel sign, as shown in the following examples:
6lkd;< ba + Ã + triisap + ii + yo + ae + ro [bijrq] “beer”
-61\&Dlli ba + coeng + ro + ta + yy + ngo + qa + Ã + triisap + y + reah-
muk [prtt'q}poh] “urgent, too busy”
-61\&D][i ba + coeng + ro + ta + yy + ngo + qa + triisap + y + reahmuk
Ligature Control. In the askaa muul font style, some vowel signs ligate with the consonant
characters to which they are applied. The font tables should determine whether they form
a ligature; ligature use in muul fonts does not affect the meaning. However, U+200C zero
width non-joiner may be inserted before the vowel sign to explicitly suppress such a lig-
ature, as shown in Figure 16-4 for the word “savant,” pronounced [vitu:].
Spacing. Khmer does not use whitespace between words, although it does use whitespace
between clauses and between parts of a name. If word boundary indications are desired—
for example, as part of automatic line layout algorithms—the character U+200B zero
width space should be used to place invisible marks for such breaks. The zero width
space can grow to have a visible width when justified. See Table 23-2.
16.5 Tai Le
Tai Le: U+1950–U+197F
The Tai Le script has a history of 700–800 years, during which time several orthographic
conventions were used. The modern form of the script was developed in the years follow-
ing 1954; it rationalized the older system and added a systematic representation of tones
with the use of combining diacritics. The new system was revised again in 1988, when spac-
ing tone marks were introduced to replace the combining diacritics. The Unicode encod-
ing of Tai Le handles both the modern form of the script and its more recent revision.
The Tai Le language is also known as Tai Nüa, Dehong Dai, Tai Mau, Tai Kong, and Chi-
nese Shan. Tai Le is a transliteration of the indigenous designation, HIJ KLM [tai2 l'6] (in
older orthography HN KLO). The modern Tai Le orthographies are straightforward: initial
consonants precede vowels, vowels precede final consonants, and tone marks, if any, fol-
low the entire syllable. There is a one-to-one correspondence between the tone mark letters
now used and existing nonspacing marks in the Unicode Standard. The tone mark is the
last character in a syllable string in both orthographies. When one of the combining dia-
critics follows a tall letter P, Q, R, S, T or L, it is displayed to the right of the letter, as shown in
Table 16-11.
Digits. In China, European digits (U+0030..U+0039) are mainly used, although Myanmar
digits (U+1040..U+1049) are also used with slight glyph variants. Note the differences, in
particular, for the digits 2, 6, 8, and 9, as shown in Table 16-12.
0 U U
1 V V
2 W _
3 X X
4 Y Y
5 Z Z
6 [ `
7 \ \
8 ] a
9 ^ b
Punctuation. Both CJK punctuation and Western punctuation are used. Typographically,
European digits are about the same height and depth as the tall characters L and S. In some
fonts, the baseline for punctuation is the depth of those characters.
Southeast Asia 659 16.6 New Tai Lue
Implementers should note that the visual order model for New Tai Lue was formally intro-
duced as of Unicode 8.0. When New Tai Lue was added to the Unicode Standard in Ver-
sion 4.1, the text model for the script followed the normal Indic practice: all dependent
vowels were intended to follow their consonant, regardless of visual placement. However,
in practice, the majority of New Tai Lue text data using Unicode characters prior to Uni-
code 8.0 already uses visual ordering, and many extant New Tai Lue fonts also assume
visual ordering. As a result, the model change for New Tai Lue as of Unicode 8.0 should not
pose a substantial migration issue for data or fonts. However, implementations may have
glitches in some algorithmic behavior until underlying libraries and platform support
catch up to the character property changes for New Tai Lue as of Unicode 8.0 or later ver-
sions.
Two-Part Vowels. Some vowels in New Tai Lue are represented with two vowel letters—
one to the left of the consonant letter and one to the right. In these cases, the characters are
simply stored in visual order: first the vowel letter on the left, then the consonant letter,
and finally the vowel letter on the right. U+19B6 new tai lue vowel sign ae is considered
a single letter and is displayed to the left of its consonant letter. It is not represented by a
sequence of two characters for U+19B5 new tai lue vowel sign e. If a tone mark appears
in a syllable, it occurs last in the representation, after any right side vowel, again in visual
order. Table 16-13 shows several examples of these ordering relations.
Final Consonants. A virama or killer character is not used to create conjunct consonants in
New Tai Lue, because clusters of consonants do not regularly occur. New Tai Lue has a
limited set of final consonants, which are modified with a hook showing that the inherent
vowel is killed.
Tones. Similar to the Thai and Lao scripts, New Tai Lue consonant letters come in pairs
that denote two tonal registers. The tone of a syllable is indicated by the combination of the
tonal register of the consonant letter plus a tone mark written at the end of the syllable, as
shown in Table 16-14.
Digits. The New Tai Lue script adapted its digits from the Tai Tham (or Lanna) script. Tai
Tham used two separate sets of digits, one known as the hora set, and one known as the
tham set. The New Tai Lue digits are adapted from the hora set.
The one exception is the additional New Tai Lue digit for one: U+19DA m new tai lue
tham digit one. The regular hora form for the digit, U+19D1 n new tai lue digit one,
Southeast Asia 661 16.6 New Tai Lue
has the exact same glyph shape as a common New Tai Lue vowel, U+19B1 n new tai lue
vowel sign aa. For this reason, U+19DA is often substituted for U+19D1 in contexts
which are not obviously numeric, to avoid visual ambiguity. Implementations of New Tai
Lue digits need to be aware of this usage, as U+19DA may occur frequently in text.
Southeast Asia 662 16.7 Tai Tham
U+1A5B tai tham consonant sign high ratha or low pa represents high ratha in san-
thZn “shape” and low pa in sappa “omniscience”.
Dependent Vowel Signs. Dependent vowel signs are used in a manner similar to that
employed by other Brahmi-derived scripts, although Tai Tham uses many of them in com-
bination.
U+1A63 tai tham vowel sign aa and U+1A64 tai tham vowel sign tall aa are sepa-
rately encoded because the choice of which form to use cannot be reliably predicted from
context.
The Khün character U+1A6D tai tham vowel sign oy is not used in Northern Thai.
Khün vowel order is quite different from that of Northern Thai.
Tone Marks. Tai Tham has two combining tone marks, U+1A75 tai tham sign tone-1
and U+1A76 tai tham sign tone-2, which are used in Tai Lue and in Northern Thai.
These are rendered above the vowel over the base consonant. Three additional tone marks
are used in Khün: U+1A77 tai tham sign khuen tone-3, U+1A78 tai tham sign khuen
tone-4, and U+1A79 tai tham sign khuen tone-5, which are rendered above and to the
right of the vowel over the base consonant. Tone marks are represented in logical order fol-
lowing the vowel over the base consonant or consonant stack. If there is no vowel over a
base consonant, then the tone is rendered directly over the consonant; this is the same way
tones are treated in the Thai script.
Other Combining Marks. U+1A7A tai tham sign ra haam is used in Northern Thai to
indicate that the character or characters it follows are not sounded. The precise range of
characters not to be sounded is indeterminant; it is defined instead by reading rules. In Tai
Lue, ra haam is used as a final -n.
The mark U+1A7B tai tham sign mai sam has a range of uses in Northern Thai:
• It is used as a repetition mark, stored as the last character in the word to be
repeated: tang “be different”, tangtang “be different in my view”.
• It is used to disambiguate the use of a subjoined letters. A subjoined letter may
be a medial or final, or it may be the start of a new syllable.
• It is used to mark “double-acting” consonants. It is stored where the consonant
would be stored if there were a separate consonant used.
U+1A7F tai tham combining cryptogrammic dot is used singly or multiply beneath
letters to give each letter a different value according to some hidden agreement between
reader and writer.
Digits. Two sets of digits are in common use: a secular set (Hora) and an ecclesiastical set
(Tham). European digits are also found in books.
Punctuation. The four signs U+1AA8 tai tham sign kaan, U+1AA9 tai tham sign
kaankuu, U+1AAA tai tham sign satkaan, and U+1AAB tai tham sign satkaankuu,
are used in a variety of ways, with progressive values of finality. U+1AAB tai tham sign
satkaankuu is similar to U+0E5A thai character angkhankhu.
Southeast Asia 664 16.7 Tai Tham
At the end of a section, U+1AA9 tai tham sign kaankuu and U+1AAC tai tham sign
hang may be combined with U+1AA6 tai tham sign reversed rotated rana in a num-
ber of ways. The symbols U+1AA1 tai tham sign wiangwaak, U+1AA0 tai tham sign
wiang, and U+1AA2 tai tham sign sawan are logographs for “village,” “city,” and
“heaven,” respectively.
The three signs U+1AA3 tai tham sign keow, “courtyard,” U+1AA4 tai tham sign hoy,
“oyster,” and U+1AA5 tai tham sign dokmai, “flower” are used as dingbats and as section
starters. The mark U+1AA7 tai tham sign mai yamok is used in the same way as its Thai
counterpart, U+0E46 thai character maiyamok.
European punctuation like question mark, exclamation mark, parentheses, and quotation
marks is also used.
Collating Order. There is no firmly established sorting order for the Tai Tham script. The
order in the code charts is based on Northern Thai and Thai. U+1A60 tai tham sign
sakot is ignored for sorting purposes.
Line Breaking. Opportunities for line breaking are lexical, but a line break may not be
inserted between a base letter and a combining diacritic. There is no insertion of visible
hyphens at line boundaries.
Southeast Asia 665 16.8 Tai Viet
to define the tone of closed syllables (those ending /p/, /t/, /k/, or /p/), in that these sylla-
bles are restricted to tones 2 and 5.
Traditionally, the Tai Viet script did not use any further marking for tone. The reader had
to determine the tone of unchecked syllables from the context. Recently, several groups
have introduced tone marks into Tai Viet writing. Tai Dam speakers in the United States
began using Lao tone marks with their script in the 1970s, and those marks are included in
SIL’s Tai Heritage font. These symbols are written as combining marks above the initial
consonant, or above a combining vowel, and are identified by their Laotian names, mai ek
and mai tho. These marks are also used by the Song Petburi font (developed for the Thai
Song language), although they were probably borrowed from the Thai alphabet rather than
the Lao.
The Tai community in Vietnam invented their own tone marks written on the base line at
the end of the syllable, which they call mai nueng and mai song.
When combined with the consonant class, two tone marks are sufficient to unambiguously
mark the tone. No tone is written on loan words or on the unstressed initial syllable of a
native word.
Final Consonants. U+AA9A tai viet letter low bo and U+AA92 tai viet letter low
do are used to write syllable-final /p/ and /t/, respectively, as is the practice in many Tai
scripts. U+AA80 tai viet letter low ko is used for both final /k/ and final /p/. The high-
tone class symbols are used for writing final /j/ and the final nasals, /m/, /n/, and /}/.
U+AAAB tai viet letter high vo is used for final /w/.
There are a number of exceptions to the above rules in the form of vowels which carry an
inherent final consonant. These vary from region to region. The ones included in the Tai
Viet block are the ones with the broadest usage: /-aj/, /-am/, /-an/, and /-'w/.
Symbols and Punctuation. There are five special symbols in Tai Viet. The meaning and
use of these symbols is summarized in Table 16-15.
U+AADB tai viet symbol kon and U+AADC tai viet symbol nueng may be regarded as
word ligatures. They are, however, encoded as atomic symbols, without decompositions. In
the case of kon, the word ligature symbol is used to distinguish the common word “person”
from otherwise homophonous words.
Southeast Asia 667 16.8 Tai Viet
Word Spacing. Traditionally, the Tai Viet script was written without spaces between
words. In the last thirty years, users in both Vietnam and the United States have started
writing spaces between words, in both handwritten and machine produced texts. Most
users now use interword spacing. Polysyllabic words may be written without space
between the syllables.
Collating Order. The Tai Viet script does not have an established standard for sorting.
Sequences have sometimes been borrowed from neighboring languages. Some sources use
the Lao order, adjusted for differences between the Tai Dam and Lao character repertoires.
Other sources prefer an order based on the Vietnamese alphabet. It is possible that com-
munities in different countries will want to use different orders.
Southeast Asia 668 16.9 Kayah Li
16.9 Kayah Li
Kayah Li: U+A900–U+A92F
The Kayah Li script was invented in 1962 by Htae Bu Phae (also written Hteh Bu Phe), and
is used to write the Eastern and Western Kayah Li languages of Myanmar and Thailand.
The Kayah Li languages are members of the Karenic branch of the Sino-Tibetan family,
and are tonal and mostly monosyllabic. There is no mutual intelligibility with other
Karenic languages.
The term Kayah Li is an ethnonym referring to a particular Karen people who speak these
languages. Kayah means “person” and li means “red,” so Kayah Li literally means “red
Karen.” This use of color terms in ethnonyms and names for languages is a common pat-
tern in this part of Southeast Asia.
Structure. Although Kayah Li is a relatively recently invented script, its structure was
clearly influenced by Brahmi-derived scripts, and in particular the Myanmar script, which
is used to write other Karenic languages. The order of letters is a variant of the general
Brahmic pattern, and the shapes and names of some letters are Brahmi-derived. Other let-
ters are innovations or relate more specifically to Myanmar-based orthographies.
The Kayah Li script resembles an abugida such as the Myanmar script, in terms of the der-
ivation of some vowel forms, but otherwise Kayah Li is closer to a true alphabet. Its conso-
nants have no inherent vowel, and thus no virama is needed to remove an inherent vowel.
Vowels. Four of the Kayah Li vowels (a, ơ, i, ô) are written as independent spacing letters.
Five others (ư, e, u, ê, o) are written by means of diacritics applied above the base letter
U+A922 kayah li letter a, which thus serves as a vowel-carrier. The same vowel diacrit-
ics are also written above the base letter U+A923 kayah li letter oe to represent sounds
found in loanwords.
Tones. Tone marks are indicated by combining marks which subjoin to the four indepen-
dent vowel letters. The vowel diacritic U+A92A kayah li vowel o and the mid-tone mark,
U+A92D kayah li tone calya plophu, are each analyzable as composite signs, but encod-
ing of each as a single character in the standard reflects usage in didactic materials pro-
duced by the Kayah Li user community.
Digits. The Kayah Li script has its own set of distinctive digits.
Punctuation. Kayah Li text makes use of modern Western punctuation conventions, but
the script also has two unique punctuation marks: U+A92E kayah li sign cwi and
U+A92F kayah li sign shya. The shya is a script-specific form of a danda mark.
Southeast Asia 669 16.10 Cham
16.10 Cham
Cham: U+AA00–U+AA5F
Cham is a Austronesian language of the Malayo-Polynesian family. The Cham language
has two major dialects: Eastern Cham and Western Cham. Eastern Cham speakers live pri-
marily in the southern part of Vietnam and number about 73,000. Western Cham is spoken
mostly in Cambodia, with about 220,000 speakers there and about 25,000 in Vietnam. The
Cham script is used more by the Eastern Cham community.
Structure. Cham is a Brahmi-derived script. Consonants have an inherent vowel. The
inherent vowel is -a in the case of most consonants, but is -L in the case of nasal conso-
nants. There is no virama and hence no killing of the inherent vowel. Dependent vowels
(matras) are used to modify the inherent vowel and separately encoded, explicit final con-
sonants are used where there is no inherent vowel. The script does not have productive for-
mation of consonant conjuncts.
Independent Vowel Letters. Six of the initial vowels in Cham are represented with unique,
independent vowels. These separately-encoded characters always indicate a syllable-initial
vowel, but they may occur word-internally at a syllable break. Other Cham vowels which
do not have independent forms are instead represented by dependent vowels (matras)
applied to U+AA00 cham letter a. Four of the other independent vowel letters are also
attested bearing matras.
Consonants. Cham consonants can be followed by consonant signs to represent the glides:
-ya, -ra, -la, or -wa. U+AA33 cham consonant sign ya, in particular, normally ligates
with the base consonant it modifies. When it does so, any dependent vowel is graphically
applied to it, rather than to the base consonant.
The independent vowel U+AA00 cham letter a can cooccur with two of the medial con-
sonant signs: -ya or -wa. The writing system distinguishes these sequences from single let-
ters which are pronounced the same. Thus, <a, -ya> [ja] contrasts with U+AA22 cham
letter ya, also pronounced [ja], and <a, -wa> [wa] contrasts with U+AA25 cham letter
va, also pronounced [wa].
Four medial clusters of two consonant signs in a row occur: <-ra, -ya> [-rja], <-ra, -wa>
[-rwa], <-la, -ya> [-lja], and <-la, -wa> [-lwa].
There are three types of final consonants. The majority are simply encoded as separate base
characters. Graphically, those final forms appear similar to the corresponding non-final
consonants, but typically have a lengthened stroke at the right side of their glyphs. The sec-
ond type consist of combining marks to represent final -ng, -m, and -h. Finally, U+AA25
cham letter va occurs unchanged either in initial or final positions. Final consonants
may occur word-internally, in which case they indicate the presence of a syllable boundary.
Ordering of Syllable Components. Dependent vowels and other signs are encoded after
the consonant to which they apply. The ordering of elements is shown in more detail in
Table 16-16.
Southeast Asia 670 16.10 Cham
The left-side dependent vowels U+AA2F cham vowel sign o and U+AA30 cham vowel
sign ai occur in logical order after the consonant (and any medial consonant signs), but in
visual presentation their glyphs appear before (to the left of ) the consonant. U+AA2F cham
vowel sign o, in particular, may occur together in a sequence with another dependent
vowel, the vowel lengthener, or both. In such cases, the glyph for U+AA2F appears to the
left of the consonant, but the glyphs for the second dependent vowel and the vowel length-
ener are rendered above or to the right of the consonant.
Digits. The Cham script has its own set of digits, which are encoded in this block. How-
ever, European digits are also known and occur in Cham texts because of the influence of
Vietnamese.
Punctuation. Cham uses danda marks to indicate text units. Three levels are recognized,
marked respectively with danda, double danda, and triple danda.
U+AA5C cham punctuation spiral often begins a section of text. It can be compared to
the usage of Tibetan head marks. The spiral may also occur in combination with a danda.
Modern Cham text also makes use of European punctuation marks, such as the question
mark, hyphen and colon.
Line Breaking. Opportunities for line breaks occur after any full orthographic syllable in
Cham. Modern Cham text makes use of spaces between words, and those are also line
break opportunities. Line breaks occur after dandas.
Southeast Asia 671 16.11 Pahawh Hmong
+ $𖬰 + + $ → 𖬖
𖬰
16B16 16B30 16B1D 16B35
The example in Figure 16-5 uses Second Stage Reduced Version conventions. The repre-
sentation of the syllable is in straightforward visual order. U+16B16 pahawh hmong
vowel kab is the base character representing the [a] vowel of the syllable. The combining
mark U+16B30 represents the tone mark for the vowel. U+16B1D pahawh hmong con-
sonant ntsau is the base character representing the initial consonant of the syllable. The
combining mark U+16B35 is a diacritical mark which changes the sound of the consonant
from [nts] to [ph]. Altogether, the sequence represents the syllable [phâ].
Because the order of characters in memory matches the visual written order of the text, dis-
play rendering does not require any reordering of glyphs. However, implementations such
as text-to-speech need to be aware that Pahawh Hmong has unusual reading rules, because
initial consonants for syllables graphically follow the vowels which they precede in pro-
nunciation.
Southeast Asia 672 16.11 Pahawh Hmong
Vowels. The characters in the range U+16B00..U+16B1B represent vowels. The addition of
a diacritic alters the tone of the vowel. The special characters U+16B1A pahawh hmong
vowel kaab and U+16B1B pahawh hmong vowel kaav are atomic characters and do
not decompose.
Consonants. U+16B1C..U+16B2F represent consonants. These are phonologically initial
in a syllable, but occur after the vowel in written order. .
Combining Marks. The combining marks in the range U+16B30..U+16B36 are used as
tone marks. They combine with the vowel letters to indicate particular tones for the sylla-
ble. The use for representation of particular tones differs for the two different stages.
U+16B30 pahawh hmong mark cim tub and U+16B35 pahawh hmong mark cim hom
also combine with initial consonant letters. When used this way, these marks function as
diacritics and indicate a different sound for the consonant letter. Usually the resultant
sound is unrelated to that of the unmodified base letter—the particular modification by the
diacritic is not predictable.
Punctuation and Other Symbols. Pahawh Hmong makes use of common European punc-
tuation marks as well as script-specific punctuation marks (U+16B37..U+16B3B and
U+16B44..U+16B45). The script employs script-specific mathematical operators
(U+16B3C..U+16B3F). It also includes a set of modifiers that have various uses:
U+16B42..U+16B43 indicate reduplication, U+16B40 identifies the chanting nature of a
text, and U+16B41 indicates the following syllable has a non-Hmong pronunciation.
Digits and Numbers. The decimal digits 0–9 are encoded from U+16B50..U+16B59. The
representative glyph for U+16B50 pahawh hmong digit zero resembles an “I”, and is
found in the Second Stage Reduced Version orthography. In contrast, the Third Stage
Reduced Version orthography has a circular glyph.
A non-decimal system also exists in Pahawh Hmong and is taught today, however, it is not
used for arithmetic calculation. The non-decimal numbers are encoded in the range from
U+16B5B..U+16B61. The Second Stage Reduced Version glyph for U+16B5B pahawh
hmong number tens resembles a “W”. The Third Stage Reduced Version glyph looks like
an “I”, and should be distinguished in fonts from U+16B50 pahawh hmong digit zero.
Logographs. Characters encoded from U+16B63..U+16B8F are logographs. These include
a grammatical classifier (U+16B63). Also included are characters designating periods of
time (U+16B64..U+16B6C), correspondence (U+16B6D..U+16B77), and clan names
(U+16B7E..U+16B8F). The clan names are encoded for historical reasons, and are not in
widespread current use.
Southeast Asia 673 16.12 Nyiakeng Puachue Hmong
Chapter 17
Indonesia and Oceania 17
The scripts described in this chapter are:
Four traditional Philippine scripts are described here: Tagalog, Hanunóo, Buhid, and Tag-
banwa. They have limited current use. Each is a very simplified abugida which makes use
of two nonspacing vowel signs.
Although the official language of Indonesia, Bahasa Indonesia, is written in the Latin
script, Indonesia has many local, traditional scripts, most of which are ultimately derived
from Brahmi. Six of these scripts are documented in this chapter. Buginese is used for sev-
eral different languages on the island of Sulawesi. Balinese and Javanese are closely related,
highly ornate scripts; Balinese is used for the Balinese language on the island of Bali, and
Javanese for the Javanese language on the island of Java. Sundanese is used to write the
Sundanese language on the island of Java. The Rejang script is used to write the Rejang lan-
guage in southwest Sumatra, and the Batak script is used to write several Batak dialects,
also on the island of Sumatra.
Like the other scripts in this chapter, the Makasar script is a Brahmi-derived abugida.
Makasar is thought to have evolved from Rejang, and was used in South Sulawesi, Indone-
sia for writing the Makasar language. It has some similarities to the Buginese script, which
superseded it in the 19th century.
Indonesia and Oceania 678 17.1 Philippine Scripts
ered the script adequate without it (they preferred !!" kakapi to !!#" kakampi). A
similar reform for the Hanunóo script seems to have been better received. The Hanunóo
pamudpod was devised by Antoon Postma, who went to the Philippines from the Nether-
lands in the mid-1950s. In traditional orthography, $ %& ' ()* si apu ba upada is,
with the pamudpod, rendered more accurately as $ %+&, '+ ()*- si aypud
bay upadan; the Hanunóo pronunciation is si aypod bay upadan. The Tagalog virama and
Hanunóo pamudpod cancel only the inherent -a. No conjunct consonants are employed in
the Philippine scripts.
Directionality. The Philippine scripts are read from left to right in horizontal lines run-
ning from top to bottom. They may be written or carved either in that manner or in vertical
lines running from bottom to top, moving from left to right. In the latter case, the letters
are written sideways so they may be read horizontally. This method of writing is probably
due to the medium and writing implements used. Text is often scratched with a sharp
instrument onto beaten strips of bamboo, which are held pointing away from the body and
worked from the proximal to distal ends, in columns from left to right.
Rendering. In Tagalog and Tagbanwa, the vowel signs simply rest over or under the conso-
nants. In Hanunóo and Buhid, ligatures are often formed, as shown in Table 17-1.
Punctuation. Punctuation has been unified for the Philippine scripts. In the Hanunóo
block, U+1735 philippine single punctuation and U+1736 philippine double punc-
tuation are encoded. Tagalog makes use of only the latter; Hanunóo, Buhid, and Tag-
banwa make use of both marks.
Indonesia and Oceania 681 17.2 Buginese
17.2 Buginese
Buginese: U+1A00–U+1A1F
The Buginese script is used on the island of Sulawesi, mainly in the southwest. A variety of
traditional literature has been printed in it. As of 1971, as many as 2.3 million speakers of
Buginese were reported in the southern part of Sulawesi. The Buginese script is one of the
easternmost of the Brahmi scripts and is perhaps related to Javanese. It is attested as early
as the fourteenth century ce. Buginese bears some affinity to Tagalog and, like Tagalog,
does not traditionally record final consonants. The Buginese language, an Austronesian
language with a rich traditional literature, is one of the foremost languages of Indonesia.
The script was previously also used to write the Makassar, Bimanese, and Madurese lan-
guages.
Repertoire. The repertoire contained in the Buginese block is intended to represent the
core set of Buginese characters in standard printing fonts developed in the mid 19th cen-
tury for the Bugis and Makassarese languages. Variant letterforms and other extensions
seen in palm leaf manuscripts or additional letters used in some languages are not yet
encoded in this block. A visible virama symbol has also been attested, but is not needed for
this core repertoire for Buginese.
Structure. Buginese vowel signs are used in a manner similar to that seen in other Brahmi-
derived scripts. Consonants have an inherent /a/ vowel sound. Consonant conjuncts are
not formed.
Ligature. One ligature is found in the Buginese script. It is formed by the ligation of <a, -i>
+ ya to represent îya, as shown in the first line of Figure 17-1. The ligature takes the shape
of the Buginese letter ya, but with a dot applied at the far left side. Contrast that with the
normal representation of the syllable yi, in which the dot indicating the vowel sign occurs
in a centered position, as shown in the second line of Figure 17-1. The ligature for îya is not
obligatory; it would be requested by inserting a zero width joiner.
+ $ + Ä + → R
1A15 1A17 200D 1A10
T + $
1A10 1A17
→ U
Order. Several orderings are possible for Buginese. The Unicode Standard encodes the
Buginese characters in the Matthes order.
Indonesia and Oceania 682 17.2 Buginese
Punctuation. Buginese uses spaces between certain units. One punctuation symbol,
U+1A1E buginese pallawa, is functionally similar to the full stop and comma of the Latin
script. There is also another separation mark, U+1A1F buginese end of section.
U+A9CF javanese pangrangkep or a doubling of the vowel sign (especially U+1A19
buginese vowel sign e and U+1A1A buginese vowel sign o) is sometimes used to
denote word reduplication. The shape of the Buginese reduplication sign is based on the
Arabic digit two. The functionally similar U+A9CF javanese pangrangkep which has the
same shape, is recommended for this sign in Buginese, rather than U+0662 arabic-indic
digit two, to avoid potential problems for text layout.
Numerals. There are no known digits specific to the Buginese script.
Indonesia and Oceania 683 17.3 Balinese
17.3 Balinese
Balinese: U+1B00–U+1B7F
The Balinese script, or aksara Bali, is used for writing the Balinese language, the native lan-
guage of the people of Bali, known locally as basa Bali. It is a descendant of the ancient
Brahmi script of India, and therefore it has many similarities with modern scripts of South
Asia and Southeast Asia, which are also members of that family. The Balinese script is used
to write Kawi, or Old Javanese, which strongly influenced the Balinese language in the elev-
enth century ce. A slightly modified version of the script is used to write the Sasak lan-
guage, which is spoken on the island of Lombok to the east of Bali. Some Balinese words
have been borrowed from Sanskrit, which may also be written in the Balinese script.
Structure. Balinese consonants have an inherent -a vowel sound. Consonants combine
with following consonants in the usual Brahmic fashion: the inherent vowel is “killed” by
U+1B44 balinese adeg adeg (virama), and the following consonant is subjoined or post-
fixed, often with a change in shape. Table 17-2 shows the base consonants and their con-
junct forms.
; W
dha
+
ra
+ $v +
adeg adeg
S
ma
→
;Sw
dha-rma
(Kawi)
; W
dha
+
ra
+ $v +
adeg adeg
S
ma
→
;WT
*
dha-rma
(Balinese)
; S
dha
+
ma
+ $w
surang
→
;Sw
dha-mar
1B25 1B2B 1B03
;
dha
+ $w +
surang
S
ma
→
;wS
dhar-ma
1B25 1B03 1B2B
Behavior of ra repa. The unique behavior of balinese letter ra repa (vocalic L) results
from a reanalysis of the independent vowel letter as a consonant. In a compound word in
which the first element ends in a consonant and the second element begins with an original
ra + pepet, such as Pak Rërëh K!fex e
“Mr Rërëh”, the postfixed form of ra repa is
used; this particular sequence is encoded ka + adeg adeg + ra repa. However, in other con-
texts where the ra repa represents the original Sanskrit vowel, U+1B3A balinese vowel
sign ra repa is used, as in Krësna !y_ @ .
Rendering. The vowel signs /u/ and /u:/ take different forms when combined with sub-
scripted consonant clusters, as shown in Table 17-4. The upper limit of consonant clusters
is three, the last of which can be -ya, -wa, or -ra.
Nukta. The combining mark U+1B34 balinese sign rerekan (nukta) and a similar sign
in Javanese are used to extend the character repertoire for foreign sounds. In recent times,
Sasak users have abandoned the Javanese-influenced rerekan in favor of the series of mod-
ified letters shown in Table 17-3, also making use of some unused Kawi letters for these
Arabic sounds.
Indonesia and Oceania 687 17.3 Balinese
tems use other spacing letters, such as U+1B09 balinese letter ukara and U+1B27 bali-
nese letter pa, which are not separately encoded for musical use. The U+1B01 balinese
sign ulu candra (candrabindu) can also be used with U+1B62 balinese musical sym-
bol deng and U+1B68 balinese musical symbol deung, and possibly others. balinese
sign ulu candra can be used to indicate modre symbols as well.
A range of diacritical marks is used with these musical notation base characters to indicate
metrical information. Some additional combining marks indicate the instruments used;
this set is encoded at U+1B6B..U+1B73. A set of symbols describing certain features of per-
formance are encoded at U+1B74..U+1B7C. These symbols describe the use of the right or
left hand, the open or closed hand position, the “male” or “female” drum (of the pair)
which is struck, and the quality of the striking.
Modre Symbols. The Balinese script also includes a range of “holy letters” called modre
symbols. Most of these letters can be composed from the constituent parts currently
encoded, including U+1B01 balinese sign ulu candra.
Indonesia and Oceania 689 17.4 Javanese
17.4 Javanese
Javanese: U+A980–U+A9DF
The Javanese script, or aksara Jawa, is used for writing the Javanese language, known
locally as basa Jawa. The script is a descendent of the ancient Brahmi script of India, and so
has many similarities with the modern scripts of South Asia and Southeast Asia which are
also members of that family. The Javanese script is also used for writing Sanskrit, Jawa
Kuna (a kind of Sanskritized Javanese), and transcriptions of Kawi, as well as the Sun-
danese language, also spoken on the island of Java, and the Sasak language, spoken on the
island of Lombok.
The Javanese script was in current use in Java until about 1945; in 1928 Bahasa Indonesia
was made the national language of Indonesia and its influence eclipsed that of other lan-
guages and their scripts. Traditional Javanese texts are written on palm leaves; books of
these bound together are called lontar, a word which derives from ron “leaf” and tal “palm”.
Consonants. Consonants have an inherent -a vowel sound. Consonants combine with fol-
lowing consonants in the usual Brahmic fashion: the inherent vowel is “killed” by U+A9C0
javanese pangkon, and the following consonant is subjoined or postfixed, often with a
change in shape.
Vocalic liquids (W and k) are treated as consonant letters in Javanese; they are not indepen-
dent vowels with dependent vowel equivalents, as is the case in Balinese or Devanagari.
Short and long versions of the vocalic-k are separately encoded, as U+A98A javanese let-
ter nga lelet and U+A98B javanese letter nga lelet raswadi. In contrast, the long
version of the vocalic-W is represented by a sequence of the short vowel U+A989 javanese
letter pa cerek followed by the dependent vowel sign -aa, U+A9B4 javanese vowel
sign tarung, serving as a length mark in this case.
U+A9B3 javanese sign cecak telu is a diacritic used with various consonantal base let-
ters to represent foreign sounds. Typically these diacritic-marked consonants are used for
sounds borrowed from Arabic.
Independent Vowels. Independent vowel letters are used essentially as in other Brahmic
scripts. Modern Javanese uses U+A986 javanese letter i and U+A987 javanese letter ii
for short and long i, but the Kawi orthography instead uses U+A985 javanese letter i
kawi and U+A986 javanese letter i for short and long i, respectively.
The long versions of the u and o vowels are written as sequences, using U+A9B4 javanese
vowel sign tarung as a length mark.
Dependent Vowels. Javanese—unlike Balinese—represents multi-part dependent vowels
with sequences of characters, in a manner similar to the Myanmar script. The Balinese
community considers it important to be able to directly transliterate Sanskrit into Balinese,
so multi-part dependent vowels are encoded as single, composite forms in Balinese, as is
done in Devanagari. In contrast, for the Javanese script, the correspondence with Sanskrit
letters is not so critical, and a different approach to the encoding has been taken. Similar to
Indonesia and Oceania 690 17.4 Javanese
the treatment of long versions of Javanese independent vowels, the two-part dependent
vowels are explicitly represented with a sequence of two characters, using U+A9B4 java-
nese vowel sign tarung, as shown in Figure 17-3.
ꦏ + $ꦼ + $ꦴ →
ꦏꦼ ꦴ
ka pepet tarung keu
A98F A9BC A9B4
ꦏ + $ꦺ + $ꦴ →
ꦏꦴ
ꦺ
ka taling tarung ko
A98F A9BA A9B4
ꦏ + $ꦻ + $ꦴ →
ꦏꦴ
ꦻ
ka dirga mure tarung kau
A98F A9BB A9B4
Consonant Signs. The characters U+A980 javanese sign panyangga, U+A981 javanese
sign cecak, and U+A983 javanese sign wignyan are analoguous to U+0901 devanagari
sign candrabindu, U+0902 devanagari sign anusvara, and U+0903 devanagari sign
visarga and behave in much the same way.
There are two medial consonant signs, U+A9BE javanese consonant sign pengkal and
U+A9BF javanese consonant sign cakra, which represent -y- and -r- respectively. These
medial consonant signs contrast with the subjoined forms of the letters ya and ra. The sub-
joined forms may indicate a syllabic boundary, whereas pengkal and cakra are used in ordi-
nary consonant clusters.
Rendering. There are many conjunct forms in Javanese, though most are fairly regular and
easy to identify. Subjoined consonants and vowel signs rendered below them usually inter-
act typographically. For example, the vowel signs [u] and [u:] take different forms when
combined with subscripted consonant clusters. Consonant clusters may have up to three
elements. In three-element clusters, the last element is always one of the medial glides: -ya,
-wa, or -ra.
Digits. The Javanese script has its own set of digits, seven of which (1, 2, 3, 6, 7, 8, 9) look
just like letters of the alphabet. Implementations with concerns about security issues need
to take this into account. The punctuation mark U+A9C7 javanese pada pangkat is often
used with digits in order to help to distinguish numbers from sequences of letters.
Punctuation. A large number of punctuation marks are used in Javanese. Titles may be
flanked by the pair of ornamental characters, U+A9C1 javanese left rerenggan and
U+A9C2 javanese right rerenggan; glyphs used for these may vary widely.
Indonesia and Oceania 691 17.4 Javanese
U+A9C8 javanese pada lingsa is a danda mark that corresponds functionally to the use
of a comma. The doubled form, U+A9C9 javanese pada lungsi, corresponds functionally
to the use of a full stop. It is also used as a “ditto” mark in vertical lists. U+A9C7 javanese
pada pangkat is used much like the European colon.
U+A9C7 javanese pada pangkat is used to abbreviate personal names and is placed at the
end of the abbreviation.
The doubled U+A9CB javanese pada adeg adeg typically begins a paragraph or section,
while the simple U+A9CA javanese pada adeg is used as a common divider though it can
be used in pairs marking text for attention. The two characters, U+A9CC javanese pada
piseleh and U+A9CD javanese turned pada piseleh, are used similarly, either both
together or with U+A9CC javanese pada piseleh simply repeated.
The punctuation ring, U+A9C6 javanese pada windu, is not used alone, a situation simi-
lar to the pattern of use for its Balinese counterpart U+1B5C balinese windu. When used
with U+A9CB javanese pada adeg adeg this windu sign is called pada guru, pada bab, or
uger-uger, and is used to begin correspondence where the writer does not desire to indicate
a rank distinction as compared to his audience. More formal letters may begin with one of
the three signs: U+A9C3 javanese pada andap (for addressing a higher-ranked person),
U+A9C4 javanese pada madya (for addressing an equally-ranked person), or U+A9C5
javanese pada luhur (for addressing a lower-ranked person).
Reduplication. U+A9CF javanese pangrangkep is used to show the reduplication of a
syllable. The character derives from U+0662 arabic-indic digit two but in Javanese it
does not have a numeric use. The Javanese reduplication mark is encoded as a separate
character from the Arabic digit, because it differs in its Bidi_Class property value.
Ordering of Syllable Components. The order of components in an orthographic syllable
as expressed in BNF is:
{C F} C {{R}Y} {V{A}} {Z}
where
C is a letter (consonant or independent vowel), or a consonant followed
by the diacritic U+A9B3 javanese sign cecak telu
F is the virama, U+A9C0 javanese pangkon
R is the medial -ra, U+A9BF javanese consonant sign cakra
Y is the medial -ya, U+A9BE javanese consonant sign pengkal
V is a dependent vowel sign
A is the dependent vowel sign -aa, U+A9B4 javanese vowel sign
tarung
Z is a consonant sign: U+A980, U+A981, U+A982, or U+A983
Indonesia and Oceania 692 17.4 Javanese
Line Breaking. Opportunities for line breaking occur after any full orthographic syllable.
Hyphens are not used.
In some printed texts, an epenthetic spacing U+A9BA javanese vowel sign taling is
placed at the end of a line when the next line begins with the glyph for U+A9BA javanese
vowel sign taling, which is reminiscent of a specialized hyphenation (or of quire mark-
ing). This practice is nearly impossible to implement in a free-flowing text environment.
Typographers wishing to duplicate a printed page may manually insert U+00A0 no-break
space before U+A9BA javanese vowel sign taling at the end of a line, but this would not
be orthographically correct.
Indonesia and Oceania 693 17.5 Rejang
17.5 Rejang
Rejang: U+A930–U+A95F
The Rejang language is spoken by about 200,000 people living on the Indonesian island of
Sumatra, mainly in the southwest. There are five major dialects: Lebong, Musi, Kebana-
gun, Pesisir (all in Bengkulu Province), and Rawas (in South Sumatra Province). Most
Rejang speakers live in fairly remote rural areas, and slightly less than half of them are liter-
ate.
The Rejang script was in use prior to the introduction of Islam to the Rejang area. The ear-
liest attested document appears to date from the mid-18th century ce. The traditional
Rejang corpus consists chiefly of ritual texts, medical incantations, and poetry.
Structure. Rejang is a Brahmi-derived script. It is related to other scripts of the Indonesian
region, such as Batak and Buginese.
Consonants in Rejang have an inherent /a/ vowel sound. Vowel signs are used in a manner
similar to that employed by other Brahmi-derived scripts. There are no consonant con-
juncts. The basic syllabic structure is C(V )(F): a consonant, followed by an optional vowel
sign and an optional final consonant sign or virama.
Rendering. Rejang texts tend to have a slanted appearance typified by the appearance of
U+A937 rejang letter ba. This sense that the script is tilted to the right affects the place-
ment of the combining marks for vowel signs. Vowel signs above a letter are offset to the
right, and vowel signs below a letter are offset to the left, as the “above” and “below” posi-
tions for letters are perceived in terms of the overall slant of the letters.
Ordering. The ordering of the consonants and vowel signs for Rejang in the code charts
follows a generic Brahmic script pattern. The Brahmic ordering of Rejang consonants is
attested in numerous sources. There is little evidence one way or the other for preferences
in the relative order of Rejang vowel signs and consonant signs.
Digits. There are no known script-specific digits for the Rejang script.
Punctuation. European punctuation marks such as comma, full stop, and colon, are used
in modern writing. U+A95F rejang section mark may be used at the beginning and end
of paragraphs.
Traditional Rejang texts tend not to use spaces between words, but their use does occur in
more recent texts. There is no known use of hyphenation.
Indonesia and Oceania 694 17.6 Batak
17.6 Batak
Batak: U+1BC0–U+1BFF
The Batak script is used on the island of Sumatra to write the five Batak dialects: Karo,
Mandailing, Pakpak, Simalungun, and Toba. The script is called si-sia-sia or surat na sam-
pulu sia, which means “the nineteen letters.” The script is taught in schools mainly for cul-
tural purposes, and is used on some signs for shops and government offices.
Structure. Batak is a Brahmi-derived script. It is written left to right.
Consonants in Batak have an inherent /a/ vowel sound. Batak uses a vowel killer which is
called pangolat in Mandailing, Pakpak, and Toba. In Karo the killer is called penengen, and
in Simalungen it is known as panongonan. The appearance of the killer differs between
some of the dialects.
Batak has three independent vowels and makes use of a number of vowel signs and two
consonant signs. Some vowel signs are only used by certain language communities. There
are no consonant conjuncts. The basic syllabic structure is C(V )(Cs|Cd): a consonant, fol-
lowed by an optional vowel sign, which may be followed either by a consonant sign Cs (-ng
or -h) or a killed final consonant Cd.
Rendering. Most vowel signs and the two killers, U+1BF2 batak pangolat and U+1BF3
batak panongonan, are spacing marks. U+1BEE batak vowel sign u can ligate with its
base consonant.
The two consonant signs, U+1BF0 batak consonant sign ng and U+1BF1 batak conso-
nant sign h, are nonspacing marks, usually rendered above the spacing vowel signs.
When U+1BF0 batak consonant sign ng occurs together with the nonspacing mark,
U+1BE9 batak vowel sign ee, both are rendered above the base consonant, with the
glyph for the ee at the top left and the glyph for the ng at the top right.
The main peculiarity of Batak rendering concerns the reordering of the glyphs for vowel
signs when one of the two killers, pangolat or panongonan, is used to close the syllable by
killing the inherent vowel of a final consonant. This reordering for display is entirely regu-
lar. So, while the representation of the syllable /tip/ is done in logical order: <ta, vowel sign
i, pa, pangolat>, when rendered for display the glyph for the vowel sign is visually applied
to the final consonant, pa, rather than to the ta. The glyph for the pangolat always stays at
the end of the syllable.
Punctuation. Punctuation is not normally used; instead all letters simply run together.
However, a number of bindu characters are occasionally used to disambiguate similar
words or phrases. U+1BFF batak symbol bindu pangolat is trailing punctuation, follow-
ing a word, surrounding the previous character somewhat.
The minor mark used to begin paragraphs and stanzas is U+1BFC batak symbol bindu
na metek, which means “small bindu.” It has a shape-based variant, U+1BFD batak sym-
bol bindu pinarboras (“rice-shaped bindu”), which is likewise used to separate sections
Indonesia and Oceania 695 17.6 Batak
of text. U+1BFE batak symbol bindu judul (“title bindu”) is sometimes used to separate
a title from the main text, which normally begins on the same line.
Line Breaking. Opportunities for a line break occur after any full orthographic syllable.
Indonesia and Oceania 696 17.7 Sundanese
17.7 Sundanese
Sundanese: U+1B80–U+1BBF
The Sundanese script, or aksara Sunda, is used for writing the Sundanese language, one of
the languages of the island of Java in Indonesia. It is a descendent of the ancient Brahmi
script of India, and so has similarities with the modern scripts of South Asia and Southeast
Asia which are also members of that family. The script has official support. It is taught in
schools and used on road signs.
The Sundanese language has been written using a number of different scripts over the
years. Pallawa or Pra-Nagari was first used in West Java to write Sanskrit from the fifth to
the eighth centuries ce. Sunda Kuna or Old Sundanese was derived from Pallawa and was
used in the Sunda Kingdom from the 14th to the 18th centuries. The earliest example of
Old Sundanese is the Prasasti Kawali stone. The Javanese script was used to write Sun-
danese from the 17th to the 19th centuries, and the Arabic script was used from the 17th to
the 20th centuries. The Latin script has been in wide use since the 20th century. The mod-
ern Sundanese script, called Sunda Baku or Official Sundanese, became official in 1996.
This modern script was derived from Old Sundanese.
Structure. Sundanese consonants have an inherent vowel /a/. This inherent vowel can be
modified by the addition of dependent vowel signs (matras). The script also has indepen-
dent vowels.
In the modern orthography, an explicit vowel killer character, U+1BAA sundanese sign
pamaaeh, is used to indicate the absence, or “killing,” of the inherent vowel, but does not
build consonant conjuncts. In Old Sundanese, however, consonant conjuncts do appear,
and are formed with U+1BAB sundanese sign virama.
Medials. In the modern orthography, initial Sundanese consonants can be followed by one
of the three consonant signs for medial consonants, -ya, -ra, or -la. These medial conso-
nants are graphically displayed as subjoined elements to their base consonants, and are not
considered conjuncts proper, because they are not formed using a virama. In Old Sun-
danese, a subjoined ma, U+1BAC sundanese consonant sign pasangan ma, and a sub-
joined wa, U+1BAD sundanese consonant sign pasangan wa, occur. They contrast
with the conjunct forms created with the virama.
Final Consonants. Sundanese historical texts employ two final consonants, U+1BBE sun-
danese letter final k and U+1BBF sundanese letter final m, which are distinct from
the modern representation of these final consonants with the explicit killer U+1BAA sun-
danese sign pamaaeh.
Combining Marks. Three final consonants are separately encoded as combining marks:
-ng, -r, -h. These are analogues of Brahmic anusvara, repha, and visarga, respectively.
Historic Characters. Additional historic consonants appear only in old texts: reu, leu, and
bha. Another historic character, U+1BBA sundanese avagraha, kills the vowel of the pre-
ceding consonant, and causes hiatus before an initial a.
Indonesia and Oceania 697 17.7 Sundanese
Additional Consonants. Two supplemental consonant letters are used in the modern
script: U+1BAE sundanese letter kha and U+1BAF sundanese letter sya. These are
used to represent the borrowed sounds denoted by the Arabic letters kha and sheen, respec-
tively.
Digits. Sundanese has its own script-specific digits, which are separately encoded in this
block.
Punctuation. Sundanese uses European punctuation marks, such as comma, full stop,
question mark, and quotation marks. Spaces are used in text. Opportunities for hyphen-
ation occur after any full orthographic syllable.
Ordering. The order of characters in the code charts follows the Brahmic ordering. The
ha-na-ca-ra-ka order found in Javanese and Balinese does not seem to be used in Sun-
danese.
Ordering of Syllable Components. Dependent vowels and other signs are encoded after
the consonant to which they apply. The ordering of elements for the modern Sundanese
orthography is shown in more detail in Table 17-5.
The killer (pamaaeh) occupies the same logical position as a dependent vowel, but indi-
cates the absence, rather than the presence of a vowel. It cannot be followed by a combin-
ing mark for a final consonant, nor can it be preceded by a consonant sign.
The left-side dependent vowel U+1BA6 sundanese vowel sign panaelaeng occurs in
logical order after the consonant (and any medial consonant sign), but in visual presenta-
tion its glyph appears before (to the left of ) the consonant.
Rendering. When more than one sign appears above or below a consonant, the two are
rendered side-by-side, rather than being stacked vertically.
17.8 Makasar
Makasar: U+11EE0–U+11EFF
The Makasar script was used historically in South Sulawesi, Indonesia for writing the
Makasar language. It is sometimes spelled “Makassar,” and is also referred to as “Old
Makassarese” or “Makassarese bird script.” The script was maintained for official purposes
in the kingdoms of Makasar in the 17th century, and it was used for writing a number of
historical accounts, such as the “Chronicles of Gowa and Tallo’,” but it was superseded by
the Buginese script in the 19th century and is no longer used. Although Makasar is thought
to have evolved from Rejang, it shares several similarities with Buginese.
Structure. Makasar is a Brahmi-derived abugida. It is written horizontally, from left to
right. Consonant signs carry an inherent /a/ vowel sign. Alternative vowel sounds are
expressed by applying one of four combining characters to a consonant. Each vowel sign
appears on a different side of the base consonant: right, left, top, and bottom. They are all
encoded as combining characters following the consonant.
Like Buginese, geminated and clustered consonants are not indicated, nor are syllable-final
consonants. However, Makasar differs from the Buginese script in that it does not have the
pre-nasalized clusters, such as /ŋka/, that occur in Buginese, and it includes special fea-
tures for consonant repetition.
There is only one independent vowel sign, U+11EF1 J makasar letter a. Vowel signs
can be attached to this character to produce other vowel sounds when a syllable has no
consonant, such as at the beginning of a word.
Consonant Repetition. Adjacent syllables that use the same consonant can be written by
appending two vowel signs to a single consonant, as shown in the following example. Usu-
ally both vowels are the same in this case, and a consonant can take a maximum of two
vowel signs.
U+11EE7 L da + U+11EF4 K vowel sign u + U+11EF4 K vowel sign u →
M [dudu]
U+11EF2 O makasar angka can also be used to repeat the consonant used in the previous
syllable. This is particularly useful when one or both syllables use the inherent vowel, but
angka may also be followed by a different vowel sound from that of the previous syllable.
Angka is associated with the inherent vowel or a vowel sign in the same way as any normal
consonant character. For example:
U+11EED N ra + U+11EF4 K vowel sign u + U+11EF2 O angka → PO
[rura]
U+11EE5 Q ma + U+11EF2 O angka + U+11EF3 R vowel sign i → QX
[mami]
Indonesia and Oceania 699 17.8 Makasar
Letter va. U+11EEF makasar letter va is named “VA” even though the consonant is
pronounced /w/ in the Makasar language. The name for this character aligns with the
name for the related letter U+1A13 buginese letter va.
Digits. The available Makasar manuscript sources show two distinct sets of digits. The first
set strongly resembles European digits and can be represented with U+0030..U+0039. The
second set strongly resembles Arabic-Indic digits, and can be represented with
U+0660..U+0669. Therefore, script-specific digits for Makasar are not separately encoded.
Digits are frequently used, and both sets occur concurrently in the sources.
The Arabic-Indic digits are restricted to Arabic-language environments—particularly for
expressing dates of the Hijri era. The European digits are used for general purposes, but
occur within Arabic-language contexts for writing non-Hijri dates, specifically those of the
Gregorian calendar.
Digits may occur above U+0600 V arabic number sign or U+0601 W arabic sign
sanah, see Figure 9-6 for an example.
Punctuation. Sentences are delimited with U+11EF7 H makasar passimbang, and sections
are terminated with U+11EF8 I makasar end of section. Words are often, but not
always, separated by spaces. Line breaks normally appear after syllable boundaries.
Hyphens or other marks indicating continuance are not used.
The end of a text is often marked using a stylized rendering of the Arabic word tammat
U, meaning “it is complete.” There is no atomic character encoded for this symbol, so
the sequence should be represented using Arabic letters <ta + meem + shadda + ta>, where
the shadda is optional.
Indonesia and Oceania 700 17.8 Makasar
701
Chapter 18
East Asia 18
This chapter presents modern-day scripts used in East Asia. This includes major writing
systems associated with Chinese, Japanese, and Korean. It also includes several scripts for
minority languages spoken in southern China. The scripts discussed are as follows:
The characters that are now called East Asian ideographs, and known as Han ideographs in
the Unicode Standard, were developed in China in the second millennium bce. The basic
system of writing Chinese using ideographs has not changed since that time, although the
set of ideographs used, their specific shapes, and the technologies involved have developed
over the centuries. The encoding of Chinese ideographs in the Unicode Standard is
described in Section 18.1, Han. For more on usage of the term ideograph, see “Logosylla-
baries” in Section 6.1, Writing Systems.
As civilizations developed surrounding China, they frequently adapted China’s ideographs
for writing their own languages. Japan, Korea, and Vietnam all borrowed and modified
Chinese ideographs for their own languages. Chinese is an isolating language, monosyl-
labic and noninflecting, and ideographic writing suits it well. As Han ideographs were
adopted for unrelated languages, however, extensive modifications were required.
Chinese ideographs were originally used to write Japanese, for which they are, in fact, ill
suited. As an adaptation, the Japanese developed two syllabaries, Hiragana and Katakana,
whose shapes are simplified or stylized versions of certain ideographs. (See Section 18.4,
Hiragana and Katakana.) Chinese ideographs are called kanji in Japanese and are still
used, in combination with Hiragana and Katakana, in modern Japanese.
In Korea, Chinese ideographs were originally used to write Korean, for which they are also
ill suited. The Koreans developed an alphabetic system, Hangul, discussed in Section 18.6,
Hangul. The shapes of Hangul syllables or the letter-like jamos from which they are com-
posed are not directly influenced by Chinese ideographs. However, the individual jamos
are grouped into syllabic blocks that resemble ideographs both visually and in the relation-
ship they have to the spoken language (one syllable per block). Chinese ideographs are
called hanja in Korean and are still used together with Hangul in South Korea for modern
Korean. The Unicode Standard includes a complete set of Korean Hangul syllables as well
as the individual jamos, which can also be used to write Korean. Section 3.12, Conjoining
East Asia 702
Jamo Behavior, describes how to use the conjoining jamos and how to convert between the
two methods for representing Korean.
In Vietnam, a set of native ideographs was created for Vietnamese based on the same prin-
ciples used to create new ideographs for Chinese. These Vietnamese ideographs were used
through the beginning of the 20th century and are occasionally used in more recent sig-
nage and other limited contexts.
Yi was originally written using a set of ideographs invented in imitation of the Chinese.
Modern Yi as encoded in the Unicode Standard is a syllabary derived from these ideo-
graphs and is discussed in Section 18.7, Yi.
Bopomofo, discussed in Section 18.3, Bopomofo, is another recently invented syllabic sys-
tem, used to represent Chinese phonetics.
In all these East Asian scripts, the characters (Chinese ideographs, Japanese kana, Korean
Hangul syllables, and Yi syllables) are written within uniformly sized rectangles, usually
squares. Traditionally, the basic writing direction followed the conventions of Chinese
handwriting, in top-down vertical lines arranged from right to left across the page. Under
the influence of Western printing technologies, a horizontal, left-to-right directionality has
become common, and proportional fonts are seeing increased use, particularly in Japan.
Horizontal, right-to-left text is also found on occasion, usually for shorter texts such as
inscriptions or store signs. Diacritical marks are rarely used, although phonetic annota-
tions are not uncommon. Older editions of the Chinese classics sometimes use the ideo-
graphic tone marks (U+302A..U+302D) to indicate unusual pronunciations of characters.
Many older character sets include characters intended to simplify the implementation of
East Asian scripts, such as variant punctuation forms for text written vertically, halfwidth
forms (which occupy only half a rectangle), and fullwidth forms (which allow Latin letters
to occupy a full rectangle). These characters are included in the Unicode Standard for com-
patibility with older standards.
Appendix E, Han Unification History, describes how the diverse typographic traditions of
mainland China, Taiwan, Japan, Korea, and Vietnam have been reconciled to provide a
common set of ideographs in the Unicode Standard for all these languages and regions.
Nüshu is a siniform script devised by and for women to write the local Chinese dialect of
southeastern Hunan province, China. Nüshu is based on Chinese Han characters. Unlike
Chinese, the characters typically denote the phonetic value of syllables. Less often Nüshu
characters are used as ideographs. Although very few fluent Nüshu users were alive in the
late twentieth century, the script has drawn national and international attention, leading to
the study and preservation of the script.
The Lisu script was developed in the early 20th century by using a combination of Latin
letters, rotated Latin letters, and Latin punctuation repurposed as tone letters, to create a
writing system for the Lisu language, spoken by large communities, mostly in Yunnan
province in China. It sees considerable use in China, where it has been an official script
since 1992.
East Asia 703
The Miao script was created in 1904 by adapting Latin letter variants, English shorthand
characters, Miao pictographs, and Cree syllable forms. The script was originally developed
to write the Northeast Yunnan Miao language of southern China. Today it is also used to
write other Miao dialects and the languages of the Yi and Lisu nationalities of southern
China.
Tangut is a large, historic siniform ideographic script used to write the Tangut language, a
Tibeto-Burman language spoken from about the 11th century ce until the 16th century in
the area of present-day northwestern China. Tangut was re-discovered in the late 19th cen-
tury, and has been largely deciphered. Today the script is of interest to students and schol-
ars.
East Asia 704 18.1 Han
18.1 Han
CJK Unified Ideographs
The Unicode Standard contains a set of unified Han ideographic characters used in the
written Chinese, Japanese, and Korean languages. The term Han, derived from the Chi-
nese Han Dynasty, refers generally to Chinese traditional culture. The Han ideographic
characters make up a coherent script, which was traditionally written vertically, with the
vertical lines ordered from right to left. In modern usage, especially in technical works and
in computer-rendered text, the Han script is written horizontally from left to right and is
freely mixed with Latin or other scripts. When used in writing Japanese or Korean, the Han
characters are interspersed with other scripts unique to those languages (Hiragana and
Katakana for Japanese; Hangul syllables for Korean).
Although the term “CJK”—Chinese, Japanese, and Korean—is used throughout this text to
describe the languages that currently use Han ideographic characters, it should be noted
that earlier Vietnamese writing systems were based on Han ideographs. Consequently, the
term “CJKV” would be more accurate in a historical sense. Han ideographs are still used for
historical, religious, and pedagogical purposes in Vietnam. For more on usage of the term
ideograph, see “Logosyllabaries” in Section 6.1, Writing Systems.
The term “Han ideographic characters” is used within the Unicode Standard as a common
term traditionally used in Western texts, although “sinogram” is preferred by professional
linguists. Taken literally, the word “ideograph” applies only to some of the ancient original
character forms, which indeed arose as ideographic depictions. The vast majority of Han
characters were developed later via composition, borrowing, and other non-ideographic
principles, but the term “Han ideographs” remains in English usage as a conventional
cover term for the script as a whole.
The Han ideographic characters constitute a very large set, numbering in the tens of thou-
sands. They have a long history of use in East Asia. Enormous compendia of Han ideo-
graphic characters exist because of a continuous, millennia-long scholarly tradition of
collecting all Han character citations, including variant, mistaken, and nonce forms, into
annotated character dictionaries.
The Unicode Standard draws its unified Han character repertoire from a number of differ-
ent character set standards. These standards are grouped into a number of sources listed in
tables in Section E.3, CJK Sources.
Because of the large size of the Han ideographic character repertoire, and because of the
particular problems that the characters pose for standardizing their encoding, this charac-
ter block description is more extended than that for other scripts and is divided into several
subsections. The first subsection, “Blocks Containing Han Ideographs,” describes the way
in which the Unicode Standard divides Han ideographs into blocks. This subsection is fol-
lowed by an extended discussion of the characteristics of Han characters, with particular
attention being paid to the problem of unification of encoding for characters used for dif-
ferent languages. There is a formal statement of the principles behind the Unified Han
East Asia 705 18.1 Han
character encoding adopted in the Unicode Standard and the order of its arrangement. For
a detailed account of the background and history of development of the Unified Han char-
acter encoding, see Appendix E, Han Unification History.
Characters in the unified ideograph blocks are defined by the IRG, based on Han unifica-
tion principles explained later in this section.
The two compatibility ideographs blocks contain various duplicate or unifiable variant
characters encoded for round-trip compatibility with various legacy standards. For historic
reasons, the CJK Compatibility Ideographs block also contains twelve CJK unified ideo-
graphs. Those twelve ideographs are clearly labeled in the code charts for that block.
Extensions to the URO. The initial repertoire of the CJK Unified Ideographs block
included characters submitted to the IRG prior to 1992, consisting of commonly used
characters. That initial repertoire, also known as the Unified Repertoire and Ordering, or
URO, was derived entirely from the G, T, J, and K sources. It has subsequently been
extended with small sets of unified ideographs or ideographic components needed for
interoperability with various standards, or for other reasons, as shown in Table 18-2.
Han Ideographs for Slavonic Transcription. The URO includes twenty CJK Unified Ideo-
graphs, U+9FD6 through U+9FE9, which are used for transcribing Slavonic literary docu-
ments into Chinese. Renewed contact between the Russian and Chinese Empires from the
18th to the 20th centuries led to the translation of Slavonic literary documents into both
classical and vernacular Chinese. The Russian Mission in Beijing was a driving force
behind this effort, and many of these characters were coined by Archimandrite Gurias,
who was the head of the 14th Russian Mission (1858–1864). Although some existing CJK
Unified Ideographs can be used for transcribing Slavonic, these twenty characters are dis-
tinct. Many of these characters are unusual in that they represent syllables not usually
found in Chinese.
Other Large CJK Extensions. Characters in the CJK Unified Ideographs Extension A
block are rare and are not unifiable with characters in the CJK Unified Ideographs block.
They were submitted to the IRG during 1992–1998 and are derived entirely from the G, T,
J, K, and V sources.
The CJK Unified Ideographs Extension B block contains rare and historic characters that
are also not unifiable with characters in the CJK Unified Ideographs block. They were sub-
mitted to the IRG during 1998–2000.
The CJK Unified Ideographs Extension C through F blocks mostly contain rare, historic,
or uncommon characters that are not unifiable with characters in any previously encoded
CJK Unified Ideographs block. Extension D is somewhat unique in that it is made up of
urgently needed characters from various regions. Extension C ideographs were submitted
to the IRG during 2002–2006. Extension D ideographs were submitted to the IRG during
2006–2009. Extension E ideographs were submitted to the IRG during 2006–2013. Exten-
sion F ideographs were submitted during 2012–2015.
Principle for Extensions. The only principled difference in the unification work done by
the IRG on the unified ideograph blocks is that the Source Separation Rule (rule R1) was
applied only to the original CJK Unified Ideographs block and not to the extension blocks.
The Source Separation Rule states that ideographs that are distinctly encoded in a source
East Asia 707 18.1 Han
must not be unified. (For further discussion, see “Principles of Han Unification” later in
this section.)
The seven unified ideograph blocks are not closed repertoires. Each contains a small range
of reserved code points at the end of the block. Additional unified ideographs may eventu-
ally be encoded in those ranges—as has already occurred in the CJK Unified Ideographs
block itself. There is no guarantee that any such Han ideographic additions would be of the
same types or from the same sources as preexisting characters in the block, and implemen-
tations should be careful not to make hard-coded assumptions regarding the range of
assignments within the Han ideographic blocks in general.
Several Han characters unique to the U source and which are not unifiable with other char-
acters in the CJK Unified Ideographs block are found in the CJK Compatibility Ideographs
block. There are 12 of these characters: U+FA0E, U+FA0F, U+FA11, U+FA13, U+FA14,
U+FA1F, U+FA21, U+FA23, U+FA24, U+FA27, U+FA28, and U+FA29. The remaining
characters in the CJK Compatibility Ideographs block and the CJK Compatibility Ideo-
graphs Supplement block are either duplicates or unifiable variants of a character in one of
the blocks of unified ideographs.
IICore. IICore (International Ideograph Core) is a set of important Han ideographs,
incorporating characters from all the defined blocks. This set of nearly 10,000 characters
has been developed by the IRG and represents the set of characters in everyday use
throughout East Asia. By covering the characters in IICore, developers guarantee that they
can handle all the needs of almost all of their customers. This coverage is of particular use
on devices such as cell phones or PDAs, which have relatively stringent resource limita-
tions. Characters in IICore are explicitly tagged as such in the Unihan Database (see Uni-
code Standard Annex #38, “Unicode Han Database (Unihan)”).
and native phonetic scripts (kana in Japan, hangul in Korea) as now used in the orthogra-
phies of Japan and Korea (see Table 18-3).
The evolution of character shapes and semantic drift over the centuries has resulted in
changes to the original forms and meanings. For example, the Chinese character 8 tZng
(Japanese tou or yu, Korean thang), which originally meant “hot water,” has come to mean
“soup” in Chinese. “Hot water” remains the primary meaning in Japanese and Korean,
whereas “soup” appears in more recent borrowings from Chinese, such as “soup noodles”
(Japanese tanmen; Korean thangmyen). Still, the identical appearance and similarities in
meaning are dramatic and more than justify the concept of a unified Han script that tran-
scends language.
The “nationality” of the Han characters became an issue only when each country began to
create coded character sets (for example, China’s GB 2312-80, Japan’s JIS X 0208-1978, and
Korea’s KS C 5601-87) based on purely local needs. This problem appears to have arisen
more from the priority placed on local requirements and lack of coordination with other
countries, rather than out of conscious design. Nevertheless, the identity of the Han char-
acters is fundamentally independent of language, as shown by dictionary definitions,
vocabulary lists, and encoding standards.
Terminology. Several standard romanizations of the term used to refer to East Asian ideo-
graphic characters are commonly used. They include hànzì (Chinese), kanzi (Japanese),
kanji (colloquial Japanese), hanja (Korean), and ChÔ hán (Vietnamese). The standard
English translations for these terms are interchangeable: Han character, Han ideographic
character, East Asian ideographic character, or CJK ideographic character. For clarity, the
Unicode Standard uses some subset of the English terms when referring to these charac-
ters. The term Kanzi is used in reference to a specific Japanese government publication.
The unrelated term KangXi (which is a Chinese reign name, rather than another romaniza-
East Asia 709 18.1 Han
tion of “Han character”) is used only when referring to the primary dictionary used for
determining Han character arrangement in the Unicode Standard. (See Table 18-7.)
Distinguishing Han Character Usage Between Languages. There is some concern that
unifying the Han characters may lead to confusion because they are sometimes used differ-
ently by the various East Asian languages. Computationally, Han character unification
presents no more difficulty than employing a single Latin character set that is used to write
languages as different as English and French. Programmers do not expect the characters
“c”, “h”, “a”, and “t” alone to tell us whether chat is a French word for cat or an English
word meaning “informal talk.” Likewise, we depend on context to identify the American
hood (of a car) with the British bonnet. Few computer users are confused by the fact that
ASCII can also be used to represent such words as the Welsh word ynghyd, which are
strange looking to English eyes. Although it would be convenient to identify words by lan-
guage for programs such as spell-checkers, it is neither practical nor productive to encode
a separate Latin character set for every language that uses it.
Similarly, the Han characters are often combined to “spell” words whose meaning may not
be evident from the constituent characters. For example, the two characters “to cut” and
“hand” mean “postage stamp” in Japanese, but the compound may appear to be nonsense
to a speaker of Chinese or Korean (see Figure 18-1).
Even within one language, a computer requires context to distinguish the meanings of
words represented by coded characters. The word chuugoku in Japanese, for example, may
refer to China or to a district in central west Honshuu (see Figure 18-2).
Coding these two characters as four so as to capture this distinction would probably cause
more confusion and still not provide a general solution. The Unicode Standard leaves the
issues of language tagging and word recognition up to a higher level of software and does
not attempt to encode the language of the Han characters.
Simplified and Traditional Chinese. There are currently two main varieties of written
Chinese: “simplified Chinese” (jiântîzì), used in most parts of the People’s Republic of
China (PRC) and Singapore, and “traditional Chinese” (fántîzì), used predominantly in
the Hong Kong and Macao SARs, Taiwan, and overseas Chinese communities. The process
of interconverting between the two is a complex one. This complexity arises largely because
East Asia 710 18.1 Han
a single simplified form may correspond to multiple traditional forms, such as U+53F0 3,
which is a traditional character in its own right and the simplified form for U+6AAF 4,
U+81FA 5, and U+98B1 6. Moreover, vocabulary differences have arisen between Man-
darin as spoken in Taiwan and Mandarin as spoken in the PRC, the most notable of which
is the usual name of the language itself: guóy& (the National Language) in Taiwan and
p&t]nghuà (the Common Speech) in the PRC. Merely converting the character content of
a text from simplified Chinese to the appropriate traditional counterpart is insufficient to
change a simplified Chinese document to traditional Chinese, or vice versa. (The vast
majority of Chinese characters are the same in both simplified and traditional Chinese.)
There are two PRC national standards, GB 2312-80 and GB 12345-90, which are intended
to represent simplified and traditional Chinese, respectively. The character repertoires of
the two are the same, but the simplified forms occur in GB 2312-80 and the traditional
ones in GB 12345-90. These are both part of the IRG G source, with traditional forms and
simplified forms separated where they differ. As a result, the Unicode Standard contains a
number of distinct simplifications for characters, such as U+8AAC i and U+8BF4 j.
While there are lists of official simplifications published by the PRC, most of these are
obtained by applying a few general principles to specific areas. In particular, there is a set of
radicals (such as U+2F94 / kangxi radical speech, U+2F99 0 kangxi radical shell,
U+2FA8 1 kangxi radical gate, and U+2FC3 2 kangxi radical bird) for which sim-
plifications exist (U+2EC8 + cjk radical c-simplified speech, U+2EC9 , cjk radi-
cal c-simplified shell, U+2ED4 - cjk radical c-simplified gate, and U+2EE6 .
cjk radical c-simplified bird). The basic technique for simplifying a character contain-
ing one of these radicals is to substitute the simplified radical, as in the previous example.
The Unicode Standard does not explicitly encode all simplified forms for traditional Chi-
nese characters. Where the simplified and traditional forms exist as different encoded char-
acters, each should be used as appropriate. The Unicode Standard does not specify how to
represent a new simplified form (or, more rarely, a new traditional form) that can be
derived algorithmically from an encoded traditional form (simplified form).
Dialects and Early Forms of Chinese. Chinese is not a single language, but a complex of
spoken forms that share a single written form. Although these spoken forms are referred to
as dialects, they are actually mutually unintelligible and distinct languages. Virtually all
modern written Chinese is Mandarin, the dominant language in both the PRC and Tai-
wan. Speakers of other Chinese languages learn to read and write Mandarin, although they
pronounce it using the rules of their own language. (This would be like having Spanish
children read and write only French, but pronouncing it as if it were Spanish.) The major
non-Mandarin Chinese languages are Cantonese (spoken in the Hong Kong and Macao
SARs, in many overseas Chinese communities, and in much of Guangdong province), Wu,
Min, Hakka, Gan, and Xiang.
Prior to the 20th century, the standard form of written Chinese was literary Chinese, a
form derived from the classical Chinese written, but probably not spoken by Confucius in
the sixth century bce.
East Asia 711 18.1 Han
The ideographic repertoire of the Unicode Standard is sufficient for all but the most spe-
cialized texts of modern Chinese, literary Chinese, and classical Chinese. Preclassical Chi-
nese, written using seal forms or oracle bone forms, has not been systematically
incorporated into the Unicode Standard, because those very early, historic forms differed
substantially from the classic and modern forms of Han characters. They require investiga-
tion and encoding as distinct historic scripts.
Among modern Chinese languages, Cantonese is occasionally found in printed materials;
the others are almost never seen in printed form. There is less standardization for the ideo-
graphic repertoires of these languages, and no fully systematic effort has been undertaken
to catalog the nonstandard ideographs they use. Because of efforts on the part of the gov-
ernment of the Hong Kong SAR, however, the current ideographic repertoire of the Uni-
code Standard should be adequate for many—but not all—written Cantonese texts.
Sorting Han Ideographs. The Unicode Standard does not define a method by which ideo-
graphic characters are sorted; the requirements for sorting differ by locale and application.
Possible collating sequences include phonetic, radical-stroke (KangXi, Xinhua Zidian, and
so on), four-corner, and total stroke count. Raw character codes alone are seldom sufficient
to achieve a usable ordering in any of these schemes; ancillary data are usually required.
(See Table 18-7 for a summary of the authoritative sources used to determine the order of
Han ideographs in the code charts.)
Character Glyphs. In form, Han characters are monospaced. Every character takes the
same vertical and horizontal space, regardless of how simple or complex its particular form
is. This practice follows from the long history of printing and typographical practice in
China, which traditionally placed each character in a square cell. When written vertically,
there are also a number of named cursive styles for Han characters, but the cursive forms of
the characters tend to be quite idiosyncratic and are not implemented in general-purpose
Han character fonts for computers.
There may be a wide variation in the glyphs used in different countries and for different
applications. The most commonly used typefaces in one country may not be used in others.
The types of glyphs used to depict characters in the Han ideographic repertoire of the Uni-
code Standard have been constrained by available fonts. Users are advised to consult
authoritative sources for the appropriate glyphs for individual markets and applications. It
is assumed that most Unicode implementations will provide users with the ability to select
the font (or mixture of fonts) that is most appropriate for a given locale.
)
ape
t sh
trac
abs
1 2
Y(
Z (typeface)
X (semantic)
The semantic attribute (represented along the X axis) distinguishes characters by meaning
and usage. Distinctions are made between entirely unrelated characters such as > (marsh)
and : (machine) as well as extensions or borrowings beyond the original semantic cluster
such as ;1 (a phonetic borrowing used as a simplified form of :) and ;2 (table, the orig-
inal meaning).
The abstract shape attribute (the Y axis) distinguishes the variant forms of a single charac-
ter with a single semantic attribute (that is, a character with a single position on the X axis).
The actual shape (typeface) attribute (the Z axis) is for differences of type design (the actual
shape used in imaging) of each variant form.
Z-axis typeface and stylistic differences are generally ignored for the purpose of encoding
Han ideographs, but can be represented in text by the use of variation sequences; see
Section 23.4, Variation Selectors.
Unification Rules
The following rules were applied during the process of merging Han characters from the
different source character sets.
R1 Source Separation Rule. If two ideographs are distinct in a primary source stan-
dard, then they are not unified.
• This rule is sometimes called the round-trip rule because its goal is to facilitate a
round-trip conversion of character data between an IRG source standard and
the Unicode Standard without loss of information.
East Asia 713 18.1 Han
• This rule was applied only for the work on the original CJK Unified Ideographs
block [also known as the Unified Repertoire and Ordering (URO)]. The IRG
dropped this rule in 1992 and will not use it in future work.
Figure 18-4 illustrates six variants of the CJK ideograph meaning “sword.”
“sword”
Each of the six variants in Figure 18-4 is separately encoded in one of the primary source
standards—in this case, J0 (JIS X 0208-1990), as shown in Table 18-4.
Because the six sword characters are historically related, they are not subject to disunifica-
tion by the Noncognate Rule (R2) and thus would ordinarily have been considered for pos-
sible abstract shape-based unification by R3. Under that rule, the fourth and fifth variants
would probably have been unified for encoding. However, the Source Separation Rule
required that all six variants be separately encoded, precluding them from any consider-
ation of shape-based unification. Further variants of the “sword” ideograph, U+5251 and
U+528E, are also separately encoded because of application of the Source Separation
Rule—in that case applied to one or more Chinese primary source standards, rather than
to the J0 Japanese primary source standard.
R2 Noncognate Rule. In general, if two ideographs are unrelated in historical deriva-
tion (noncognate characters), then they are not unified.
For example, the ideographs in Figure 18-5, although visually quite similar, are neverthe-
less not unified because they are historically unrelated and have distinct meanings.
≠
earth warrior, scholar
East Asia 714 18.1 Han
Abstract Shape
Two-Level Classification. Using the three-dimensional model, characters are analyzed in a
two-level classification. The two-level classification distinguishes characters by abstract
shape (Y axis) and actual shape of a particular typeface (Z axis). Variant forms are identified
based on the difference of abstract shapes.
To determine differences in abstract shape and actual shape, the structure and features of
each component of an ideograph are analyzed as follows.
Ideographic Component Structure. The component structure of each ideograph is exam-
ined. A component is a geometrical combination of primitive elements. Various ideographs
can be configured with these components used in conjunction with other components.
Some components can be combined to make a component more complicated in its struc-
ture. Therefore, an ideograph can be defined as a component tree with the entire ideograph
as the root node and with the bottom nodes consisting of primitive elements (see
Figure 18-6 and Figure 18-7).
vs.
vs.
vs.
Ideograph Features. The following features of each ideograph to be compared are exam-
ined:
• Number of components
• Relative positions of components in each complete ideograph
• Structure of a corresponding component
East Asia 715 18.1 Han
e‡f Same number and same relative position of components, corresponding com-
ponents structured differently
i‡j Characters with different radical in a component
East Asia 716 18.1 Han
Differences in the actual shapes of ideographs that have been unified are illustrated in
Table 18-6.
a. These ideographs (having the same abstract shape) would have been unified
except for the Source Separation Rule.
When a character is found in the KangXi Zidian, it follows the KangXi Zidian order. When
it is not found in the KangXi Zidian and it is found in Dai Kan-Wa Jiten, it is given a posi-
tion extrapolated from the KangXi position of the preceding character in Dai Kan-Wa Jiten.
When it is not found in either KangXi or Dai Kan-Wa, then the Hanyu Da Zidian and Dae
Jaweon dictionaries are consulted in a similar manner.
East Asia 717 18.1 Han
Ideographs with simplified KangXi radicals are placed in a group following the traditional
KangXi radical from which the simplified radical is derived. For example, characters with
the simplified radical + corresponding to KangXi radical / follow the last nonsimplified
character having / as a radical. The arrangement for these simplified characters is that of
the Hanyu Da Zidian.
The few characters that are not found in any of the four dictionaries are placed following
characters with the same KangXi radical and stroke count. The radical-stroke order that
results is a culturally neutral order. It does not exactly match the order found in common
dictionaries.
Information for sorting all CJK ideographs by the radical-stroke method is found in the
Unihan Database (see Unicode Standard Annex #38, “Unicode Han Database (Unihan)”).
It should be used if characters from the various blocks containing ideographs (see
Table 18-1) are to be properly interleaved. Note, however, that there is no standard way of
ordering characters with the same radical-stroke count; for most purposes, Unicode code
point order would be as acceptable as any other way.
Details regarding the form of the online charts for the CJK unified ideographs are dis-
cussed in Section 24.2, CJK Ideographs.
Radical-Stroke Indices
To expedite locating specific Han ideographic characters in the code charts, radical-stroke
indices are provided on the Unicode website. An interactive radical-stroke index page
enables queries by specific radical numbers and stroke counts. Two fully formatted tradi-
tional radical-stroke indices are also posted in PDF format. The larger of those provides a
radical-stroke index for all of the Han ideographic characters in the Unicode Standard,
including CJK compatibility ideographs. There is also a more compact radical-stroke index
limited to the IICore set of 9,810 CJK unified ideographs in common usage. The following
text describes how radical-stroke indices work for Han ideographic characters and explains
the particular adaptations which have been made for the Unicode radical-stroke indices.
Under the traditional radical-stroke system, each Han ideograph is considered to be writ-
ten with one of a number of different character elements or radicals and a number of addi-
tional strokes. For example, the character @ has the radical $ and seven additional
strokes. To find the character @ within a dictionary, one would first locate the section for
its radical, $, and then find the subsection for characters with seven additional strokes.
This method is complicated by the fact that there are occasional ambiguities in the count-
ing of strokes. Even worse, some characters are considered by different authorities to be
written with different radicals; there is not, in fact, universal agreement about which set of
radicals to use for certain characters, particularly with the increased use of simplified
characters.
The most influential authority for radical-stroke information is the eighteenth-century
KangXi dictionary, which contains 214 radicals. The main problem in using KangXi radi-
cals today is that many simplified characters are difficult to classify under any of the 214
East Asia 718 18.1 Han
KangXi radicals. As a result, various modern radical sets have been introduced. None, how-
ever, is in general use, and the 214 KangXi radicals remain the best known. See “CJK and
KangXi Radicals” in the following text.
The Unicode radical-stroke charts are based on the KangXi radicals. The Unicode Stan-
dard follows a number of different sources for radical-stroke classification. Where two
sources are at odds as to radical or stroke count for a given character, the character is
shown in both positions in the radical-stroke charts.
Simplified characters are, as a rule, considered to have the same radical as their traditional
forms and are found under the appropriate radical. For example, the character & is found
under the same radical, +, as its traditional form (%).
graphs. These 12 characters are not duplicates and should be treated as a small extension to
the set of unified ideographs.
Except for the 12 unified ideographs just enumerated, CJK compatibility ideographs from
this block are not used in Ideographic Description Sequences.
An additional 59 compatibility ideographs are found from U+FA30 to U+FA6A. They are
included in the Unicode Standard to provide full round-trip compatibility with the ideo-
graphic repertoire of JIS X 0213:2000 and should not be used for any other purpose.
An additional three compatibility ideographs are encoded at the range U+FA6B to
U+FA6D. They are included in the Unicode Standard to provide full round-trip compati-
bility with the ideographic repertoire of the Japanese television standard, ARIB STD-B24,
and should not be used for any other purpose.
An additional 106 compatibility ideographs are encoded at the range U+FA70 to U+FAD9.
They are included in the Unicode Standard to provide full round-trip compatibility with
the ideographic repertoire of KPS 10721-2000. They should not be used for any other pur-
pose.
The names for the compatibility ideographs are also algorithmically derived. Thus the
name for the compatibility ideograph U+F900 is cjk compatibility ideograph-f900. See
the formal definition of the Name property in Section 4.8, Name.
All of the compatibility ideographs in this block, except for the 12 unified ideographs, have
standardized variation sequences defined in StandardizedVariants.txt. See the discussion
in Section 23.4, Variation Selectors for more details.
Kanbun: U+3190–U+319F
This block contains a set of Kanbun marks used in Japanese texts to indicate the Japanese
reading order of classical Chinese texts. These marks are not encoded in any other current
character encoding standards but are widely used in literature. They are typically written in
an annotation style to the left of each line of vertically rendered Chinese text. For more
details, see JIS X 4051.
East Asia 721 18.1 Han
Semantics. Characters in the CJK and KangXi Radicals blocks should never be used as
ideographs. They have different properties and meanings. U+2F00 kangxi radical one is
not equivalent to U+4E00 cjk unified ideograph-4e00, for example. The former is to be
treated as a symbol, the latter as a word or part of a word.
The characters in the CJK and KangXi Radicals blocks are compatibility characters. Except
in cases where it is necessary to make a semantic distinction between a Chinese character
in its role as a radical and the same Chinese character in its role as an ideograph, the char-
acters from the Unified Ideographs blocks should be used instead of the compatibility rad-
icals. To emphasize this difference, radicals may be given a distinct font style from their
ideographic counterparts.
influenced by the Han script. Like the Han script, those siniform historic scripts, which
include Tangut, Jurchen, and Khitan, are logographic in nature. Furthermore, they built up
characters using radicals and components, and with side-by-side and top-to-bottom stack-
ing very similar in structure to the way CJK ideographs are composed.
The general usefulness of Ideographic Description Sequences for describing unencoded
characters and the applicability of the characters in the Ideographic Description block to
description of siniform logographs mean that the syntax for Ideographic Description
Sequences can be generalized to extend to additional East Asian logographic scripts.
Ideographic Description Sequences. Ideographic Description Sequences are defined by
the following grammar. The list of characters associated with the Ideographic and Radical
properties can be found in the Unicode Character Database. In particular, the Ideographic
property is intended to apply to other siniform ideographic systems, in addition to CJK
ideographs. Nüshu ideographs, Tangut ideographs, and Tangut components can also be
used as elements of an Ideographic Description Sequence.
IDS := Ideographic | Radical | CJK_Stroke | Private Use | U+FF1F
| IDS_BinaryOperator IDS IDS
| IDS_TrinaryOperator IDS IDS IDS
CJK_Stroke := U+31C0 | U+31C1 | ... | U+31E3
IDS_BinaryOperator := U+2FF0 | U+2FF1 | U+2FF4 | U+2FF5 | U+2FF6 | U+2FF7 |
U+2FF8 | U+2FF9 | U+2FFA | U+2FFB
IDS_TrinaryOperator:= U+2FF2 | U+2FF3
Previous versions of the Unicode standard imposed various limits on the length of a
sequence or parts of it, and restricted the use of IDSes to CJK Unified Ideographs. Those
limits and restrictions are no longer imposed by the standard. Although not formally pro-
scribed by the syntax, it is not a good idea to mix scripts in any given Ideographic Descrip-
tion Sequence. For example, it is not meaningful to mix CJK ideographs or CJK radicals
with Tangut ideographs or components in a single description.
The operators indicate the relative graphic positions of the operands running from left to
right, from top to bottom, or from enclosure to enclosed. A user wishing to represent an
unencoded ideograph will need to analyze its structure to determine how to describe it
using an Ideographic Description Sequence. As a rule, it is best to use the natural radical-
phonetic division for an ideograph if it has one and to use as short a description sequence
as possible; however, there is no requirement that these rules be followed. Beyond that, the
shortest possible Ideographic Description Sequence is preferred.
East Asia 725 18.2 Ideographic Description Characters
Figure 18-8 provides an example IDS for each of the IDCs, along with annotated versions
of the IDCs that indicate the order of their operands.
Figure 18-9 illustrates the use of this grammar to provide descriptions of encoded or unen-
coded ideographs. Examples 9–13 illustrate more complex Ideographic Description
Sequences showing the use of some of the less common operators.
Equivalence. Many unencoded ideographs can be described in more than one way using
this algorithm, either because the pieces of a description can themselves be broken down
further (examples 1–3 in Figure 18-9) or because duplications appear within the Unicode
Standard (examples 5–8 in Figure 18-9).
The Unicode Standard does not define equivalence for two Ideographic Description
Sequences that are not identical. Figure 18-9 contains numerous examples illustrating how
different Ideographic Description Sequences might be used to describe the same ideo-
graph.
In particular, Ideographic Description Sequences should not be used to provide alternative
graphic representations of encoded ideographs in data interchange. Searching, collation,
and other content-based text operations would then fail.
Interaction with the Ideographic Variation Mark. As with ideographs proper, the Ideo-
graphic Variation Mark (U+303E) may be placed before an Ideographic Description
Sequence to indicate that the description is merely an approximation of the original ideo-
graph desired. A sequence of characters that includes an Ideographic Variation Mark is not
an Ideographic Description Sequence.
Rendering. Ideographic Description characters are visible characters and are not to be
treated as control characters. Thus the sequence U+2FF1 U+4E95 U+86D9 must have a
distinct appearance from U+4E95 U+86D9.
East Asia 726 18.2 Ideographic Description Characters
S → #TU d → 7 W8 ef
2FF1 4E95 86D9 2FF0 6C34 2FF1 53E3 5DDB
→ #T"VW → 7 _8 ef
2FF1 4E95 2FF0 866B 572D 2FF0 6C35 2FF1 53E3 5DDB
→ #T"V#XX → 7 g8 ef
2FF1 4E95 2FF0 866B 2FF1 571F 571F 2FF0 2F54 2FF1 53E3 5DDB
Y → *Z#[\ → 7 h8 ef
2FF8 5382 2FF1 4ECA 6B62 2FF0 2EA1 2FF1 53E3 5DDB
C → 872IJ
2FF1 2FF0 9CE5 9F9C 706B
D → :9KLKM N
2FF3 2FF2 4E02 5F61 4E02 5F50 76BF
E → 7:B7OOPQKR
2FF0 2FF3 2FFB 2FF0 65E5 65E5 5DE5 7F51 4E02 4E5E
F → ;S78TUV
2FF4 56D7 2FF0 2FF1 9E75 51FC 9091
] → 7_87Z^\
2FF0 6C35 2FF1 2FF0 4FDD 53BD 571F
18.3 Bopomofo
Bopomofo: U+3100–U+312F, U+31A0–U+31BF
Bopomofo constitute a set of characters used to annotate or teach the phonetics of Chinese,
primarily the standard Mandarin language. These characters are used in dictionaries and
teaching materials, but not in the actual writing of Chinese text. The formal Chinese names
for this alphabet are Zhuyin-Zimu (“phonetic alphabet”) and Zhuyin-Fuhao (“phonetic
symbols”), but the informal term “Bopomofo” (analogous to “ABCs”) provides a more ser-
viceable English name and is also used in China. The Bopomofo were developed as part of
a populist literacy campaign following the 1911 revolution; thus they are acceptable to all
branches of modern Chinese culture, although in the People’s Republic of China their
function has been largely taken over by the Pinyin romanization system.
Bopomofo is a hybrid writing system—part alphabet and part syllabary. The letters of
Bopomofo are used to represent either the initial parts or the final parts of a Chinese sylla-
ble. The initials are just consonants, as for an alphabet. The finals constitute either simple
vowels, vocalic diphthongs, or vowels plus nasal consonant combinations. Because a num-
ber of Chinese syllables have no initial consonant, the Bopomofo letters for finals may con-
stitute an entire syllable by themselves. More typically, a Chinese syllable is represented by
one initial consonant letter, followed by one final letter. In some instances, a third letter is
used to indicate a complex vowel nucleus for the syllable. For example, the syllable that
would be written luan in Pinyin is segmented l-u-an in Bopomofo—that is, <U+310C,
U+3128, U+3122>.
Standards. The standard Mandarin set of Bopomofo is included in the People’s Republic
of China standards GB 2312 and GB 18030, and in the Republic of China (Taiwan) stan-
dard CNS 11643.
Mandarin Tone Marks. Small modifier letters used to indicate the five Mandarin tones are
part of the Bopomofo system. In the Unicode Standard they have been unified into the
Modifier Letter range, as shown in Table 18-8.
Standard Mandarin Bopomofo. The order of the Mandarin Bopomofo letters U+3105..
U+3129 is standard worldwide. The code offset of the first letter U+3105 bopomofo let-
ter b from a multiple of 16 is included to match the offset in the ISO-registered standard
GB 2312.
East Asia 728 18.3 Bopomofo
Extended Bopomofo. To represent the sounds of Chinese dialects other than Mandarin,
the basic Bopomofo set U+3105..U+3129 has been augmented by additional phonetic
characters. These extensions are much less broadly recognized than the basic Mandarin
set. The three extended Bopomofo characters U+312A..U+312C are cited in some standard
reference works, such as the encyclopedia Xin Ci Hai. Another set of 24 extended Bopo-
mofo, encoded at U+31A0..U+31B7, was designed in 1948 to cover additional sounds of
the Minnan and Hakka dialects. The extensions are used together with the main set of
Bopomofo characters to provide a complete phonetic orthography for those dialects. There
are no standard Bopomofo letters for the phonetics of Cantonese or several other Southern
Chinese dialects.
The small characters encoded at U+31B4..U+31B7 represent syllable-final consonants not
present in standard Mandarin or in Mandarin dialects. They have the same shapes as
Bopomofo “b”, “d”, “k”, and “h”, respectively, but are rendered in a smaller form than the
initial consonants; they are also generally shown close to the syllable medial vowel charac-
ter. These final letters are encoded separately so that the Minnan and Hakka dialects can be
represented unambiguously in plain text without having to resort to subscripting or other
fancy text mechanisms to represent the final consonants.
Three Bopomofo letters for sounds found in non-Chinese languages are encoded in the
range U+31B8..U+31BA. These characters are used in the Hmu and Ge languages, mem-
bers of the Hmong-Mien (or Miao-Yao) language family, spoken primarily in southeastern
Guizhou. The characters are part of an obsolete orthography for Hmu and Ge devised by
the missionary Maurice Hutton in the 1920s and 1930s. A small group of Hmu Christians
are still using a hymnal text written by Hutton that contains these characters.
U+312E bopomofo letter o with dot above, which was initially thought to be a CJK
Unified Ideograph because it appears in Japan’s Dai Kan-Wa Jiten as a kanji, is the original
form of U+311C ㄜ bopomofo letter e. The Mandarin sound “e” was originally written
as U+311B ㄛ bopomofo letter o with a dot above. This dotted form was later replaced
by a new character that uses a vertical stroke instead of a dot, which is U+311C ㄜ bopo-
mofo letter e.
Extended Bopomofo Tone Marks. In addition to the Mandarin tone marks enumerated in
Table 18-8, other tone marks appropriate for use with the extended Bopomofo transcrip-
tions of Minnan and Hakka can be found in the Modifier Letter range, as shown in
Table 18-9. The “departing tone” refers to the qusheng in traditional Chinese tonal analysis,
with the yin variant historically derived from voiceless initials and the yang variant from
voiced initials. Southern Chinese dialects in general maintain more tonal distinctions than
Mandarin does.
Rendering of Bopomofo. Bopomofo is rendered from left to right in horizontal text, but
also commonly appears in vertical text. It may be used by itself in either orientation, but
typically appears in interlinear annotation of Chinese (Han character) text. Children’s
books are often completely annotated with Bopomofo pronunciations for every character.
This interlinear annotation is structurally quite similar to the system of Japanese ruby
annotation, but it has additional complications that result from the explicit usage of tone
marks with the Bopomofo letters.
U+3127 bopomofo letter i has notable variation in rendering in horizontal and vertical
layout contexts. In traditional typesetting, the stroke of the glyph was chosen to stand per-
pendicular to the writing direction. In that practice, the glyph is shown as a horizontal
stroke in vertically set text, and as a vertical stroke in horizontally set text. However, mod-
ern digital typography has changed this practice. All modern fonts use a horizontal stroke
glyph for U+3127, and that form is generally used in both horizontal and vertical layout
contexts. In the Unicode Standard, the form in the charts follows the modern practice,
showing a horizontal stroke for the glyph; the vertical stroke form is considered to be an
occasionally occurring variant. Earlier versions of the standard followed traditional typo-
graphic practice, and showed a vertical stroke glyph in the charts.
In horizontal interlineation, the Bopomofo is generally placed above the corresponding
Han character(s); tone marks, if present, appear at the end of each syllabic group of Bopo-
mofo letters. In vertical interlineation, the Bopomofo is generally placed on the right side
of the corresponding Han character(s); tone marks, if present, appear in a separate inter-
linear row to the right side of the vowel letter. When using extended Bopomofo for Minnan
and Hakka, the tone marks may also be mixed with European digits 0–9 to express changes
in actual tonetic values resulting from juxtaposition of basic tones.
East Asia 730 18.4 Hiragana and Katakana
Katakana: U+30A0–U+30FF
Katakana is the noncursive syllabary used to write non-Japanese (usually Western) words
phonetically in Japanese. It is also used to write Japanese words with visual emphasis.
Katakana syllables are phonetically equivalent to corresponding Hiragana syllables.
Katakana contains two characters, U+30F5 katakana letter small ka and U+30F6
katakana letter small ke, that are used in special Japanese spelling conventions (for
example, the spelling of place names that include archaic Japanese connective particles).
Standards. The Katakana block is based on the JIS X 0208-1990 standard. Some additions
are based on the JIS X 0213:2000 standard.
Punctuation-like Characters. U+30FB katakana middle dot is used to separate words
when writing non-Japanese phrases. U+30A0 katakana-hiragana double hyphen is a
delimiter occasionally used in analyzed Katakana or Hiragana textual material.
East Asia 731 18.4 Hiragana and Katakana
Pronunciation: e ye
Kanji Source:
Hiragana:
U+1B001
Katakana:
U+1B000
The hentaigana V, which would have been named hentaigana letter e-1, has been uni-
fied with the existing U+1B001 hiragana letter archaic ye and is aliased accordingly.
When sorting, U+1B001 hiragana letter archaic ye should appear between U+1B00E
hentaigana letter u-5 and U+1B00F hentaigana letter e-2.
The 285 remaining characters in these blocks are additional hentaigana that represent
obsolete or nonstandard hiragana that were in use in Japan up until the script reform of
1900 that standardized the use of a single character for each syllable. Hentaigana are still in
use today in Japan, but are limited to Japan’s family registry (koseki in Japanese) and spe-
cialized uses, such as business signage and other decor that are specifically designed to con-
vey a feeling of nostalgia or traditional charm.
Each hentaigana is associated with a single parent unified ideograph, a cursive form of
which served as the basis for its shape, and generally correspond to a single syllable. Hen-
taigana that correspond to the same syllable, but that do not share the same parent unified
ideograph have different shapes and are therefore encoded separately. For example,
U+1B006 hentaigana letter i-1 through U+1B009 hentaigana letter i-4 all corre-
spond to the same syllable i (U+3044 hiragana letter i), but have parent unified ideo-
graphs U+4EE5 以, U+4F0A 伊, U+610F 意, and U+79FB 移, respectively, as shown in
Figure 18-11.
Hentaigana
1B006 1B007 1B008 1B009
Parent Ideograph
4EE5 4F0A 610F 79FB
Syllable i
Some hentaigana that correspond to the same syllable and share the same parent unified
ideograph are also encoded separately because they have different shapes. For example,
U+1B080 hentaigana letter na-3 through U+1B082 hentaigana letter na-5 corre-
spond to the same syllable na (U+306A hiragana letter na) and share the same parent
unified ideograph U+5948 奈, as shown in Figure 18-12.
East Asia 733 18.4 Hiragana and Katakana
Hentaigana
1B080 1B081 1B082 1B07D 1B11D
Parent Ideograph
5948 7B49 65E0
18.6 Hangul
Korean Hangul may be considered a featural syllabic script. As opposed to many other syl-
labic scripts, the syllables are formed from a set of alphabetic components in a regular fash-
ion. These alphabetic components are called jamo.
The name Hangul itself is just one of several terms that may be used to refer to the script. In
some contexts, the preferred term is simply the generic Korean characters. Hangul is used
more frequently in South Korea, whereas a basically synonymous term Choseongul is pre-
ferred in North Korea. A politically neutral term, Jeongum, may also be used.
The Unicode Standard contains both the complete set of precomposed modern Hangul
syllable blocks and a set of conjoining Hangul jamo. The conjoining Hangul jamo can be
used to to represent all of the modern Hangul syllable blocks, as well as the ancient syllable
blocks used in Old Korean. For a description of conjoining jamo behavior and precom-
posed Hangul syllables, see Section 3.12, Conjoining Jamo Behavior. For a discussion of the
interaction of combining marks with jamo and Hangul syllables, see “Combining Marks
and Korean Syllables” in Section 3.6, Combination.
For other blocks containing characters related to Hangul, see “Enclosed CJK Letters and
Months: U+3200–U+32FF” and “CJK Compatibility: U+3300–U+33FF” in Section 22.10,
Enclosed and Square, as well as Section 18.5, Halfwidth and Fullwidth Forms.
ing, sequences of Hangul syllables for modern Korean may be collated with a simple binary
comparison.
When Korean text includes sequences of conjoining jamo, as for Old Korean, or mixtures
of precomposed syllable blocks and conjoining jamo, the easiest approach for collation is
to decompose the precomposed syllable blocks into conjoining jamo before comparing.
Additional steps must be taken to ensure that comparison is then done for sequences of
conjoining jamo that comprise complete syllables. See Unicode Technical Report #10,
“Unicode Collation Algorithm,” for more discussion about the collation of Korean.
East Asia 739 18.7 Yi
18.7 Yi
Yi: U+A000–U+A4CF
The Yi syllabary encoded in Unicode is used to write the Liangshan dialect of the Yi lan-
guage, a member of the Sino-Tibetan language family.
Yi is the Chinese name for one of the largest ethnic minorities in the People’s Republic of
China. The Yi, also known historically and in English as the Lolo, do not have a single eth-
nonym, but refer to themselves variously as Nuosu, Sani, Axi or Misapo. According to the
1990 census, more than 6.5 million Yi live in southwestern China in the provinces of Sich-
uan, Guizhou, Yunnan, and Guangxi. Smaller populations of Yi are also to be found in
Myanmar, Laos, and Vietnam. Yi is one of the official languages of the PRC, with between
4 and 5 million speakers.
The Yi language is divided into six major dialects. The Northern dialect, which is also
known as the Liangshan dialect because it is spoken throughout the region of the Greater
and Lesser Liangshan Mountains, is the largest and linguistically most coherent of these
dialects. In 1991, there were about 1.6 million speakers of the Liangshan Yi dialect. The eth-
nonym of speakers of the Liangshan dialect is Nuosu.
Traditional Yi Script. The traditional Yi script, historically known as Cuan or Wei, is an
ideographic script. Unlike in other Chinese-influenced siniform scripts, however, the ideo-
graphs of Yi appear not to be derived from Han ideographs. One of the more widespread
traditions relates that the script, comprising about 1,840 ideographs, was devised by some-
one named Aki during the Tang dynasty (618–907 ce). The earliest surviving examples of
the Yi script are monumental inscriptions dating from about 500 years ago; the earliest
example is an inscription on a bronze bell dated 1485.
There is no single unified Yi script, but rather many local script traditions that vary consid-
erably with regard to the repertoire, shapes, and orientations of individual glyphs and the
overall writing direction. The profusion of local script variants occurred largely because
until modern times the Yi script was mainly used for writing religious, magical, medical, or
genealogical texts that were handed down from generation to generation by the priests of
individual villages, and not as a means of communication between different communities
or for the general dissemination of knowledge. Although a vast number of manuscripts
written in the traditional Yi script have survived to the present day, the Yi script was not
widely used in printing before the 20th century.
Because the traditional Yi script is not standardized, a considerable number of glyphs are
used in the various script traditions. According to one authority, there are more than
14,200 glyphs used in Yunnan, more than 8,000 in Sichuan, more than 7,000 in Guizhou,
and more than 600 in Guangxi. However, these figures are misleading—most of the glyphs
are simple variants of the same abstract character. For example, a 1989 dictionary of the
Guizhou Yi script contains about 8,000 individual glyphs, but excluding glyph variants
reduces this count to about 1,700 basic characters, which is quite close to the figure of 1,840
characters that Aki is reputed to have devised.
East Asia 740 18.7 Yi
Standardized Yi Script. There has never been a high level of literacy in the traditional Yi
script. Usage of the traditional script has remained limited even in modern times because
the traditional script does not accurately reflect the phonetic characteristics of the modern
Yi language, and because it has numerous variant glyphs and differences from locality to
locality.
To improve literacy in Yi, a scheme for representing the Liangshan dialect using the Latin
alphabet was introduced in 1956. A standardized form of the traditional script used for
writing the Liangshan Yi dialect was devised in 1974 and officially promulgated in 1980.
The standardized Liangshan Yi script encoded in Unicode is suitable for writing only the
Liangshan Yi dialect; it is not intended as a unified script for writing all Yi dialects. Stan-
dardized versions of other local variants of traditional Yi scripts do not yet exist.
The standardized Yi syllabary comprises 1,164 signs representing each of the allowable syl-
lables in the Liangshan Yi dialect. There are 819 unique signs representing syllables pro-
nounced in the high level, low falling, and midlevel tones, and 345 composite signs
representing syllables pronounced in the secondary high tone. The signs for syllables in the
secondary high tone consist of the sign for the corresponding syllable in the midlevel tone
(or in three cases the low falling tone), plus a diacritical mark shaped like an inverted breve.
For example, U+A001 yi syllable ix is the same as U+A002 yi syllable i plus a diacritical
mark. In addition to the 1,164 signs representing specific syllables, a syllable iteration mark
is used to indicate reduplication of the preceding syllable, which is frequently used in inter-
rogative constructs.
Standards. In 1991, a national standard for Yi was adopted by China as GB 13134-91. This
encoding includes all 1,164 Yi syllables as well as the syllable iteration mark, and is the basis
for the encoding in the Unicode Standard. The syllables in the secondary high tone, which
are differentiated from the corresponding syllable in the midlevel tone or the low falling
tone by a diacritical mark, are not decomposable.
Naming Conventions and Order. The Yi syllables are named on the basis of the spelling of
the syllable in the standard Liangshan Yi romanization introduced in 1956. The tone of the
syllable is indicated by the final letter: “t” indicates the high level tone, “p” indicates the low
falling tone, “x” indicates the secondary high tone, and an absence of final “t”, “p”, or “x”
indicates the midlevel tone.
With the exception of U+A015, the Yi syllables are ordered according to their phonetic
order in the Liangshan Yi romanization—that is, by initial consonant, then by vowel, and
finally by tone (t, x, unmarked, and p). This is the order used in dictionaries of Liangshan
Yi that are ordered phonetically.
Yi Syllable Iteration Mark. U+A015 yi syllable wu does not represent a specific syllable
in the Yi language, but rather is used as a syllable iteration mark. Its character properties
therefore differ from those for the rest of the Yi syllable characters. The misnomer of
U+A015 as yi syllable wu derives from the fact that it is represented by the letter w in the
romanized Yi alphabet, and from some confusion about the meaning of the gap in tradi-
tional Yi syllable charts for the hypothetical syllable “wu”.
East Asia 741 18.7 Yi
The Yi syllable iteration mark is used to replace the second occurrence of a reduplicated
syllable under all circumstances. It is very common in both formal and informal Yi texts.
Punctuation. The standardized Yi script does not have any special punctuation marks, but
relies on the same set of punctuation marks used for writing modern Chinese in the PRC,
including U+3001 ideographic comma and U+3002 ideographic full stop.
Rendering. The traditional Yi script used a variety of writing directions—for example, right
to left in the Liangshan region of Sichuan, and top to bottom in columns running from left
to right in Guizhou and Yunnan. The standardized Yi script follows the writing rules for
Han ideographs, so characters are generally written from left to right or occasionally from
top to bottom. There is no typographic interaction between individual characters of the Yi
script.
Yi Radicals. To facilitate the lookup of Yi characters in dictionaries, sets of radicals mod-
eled on Han radicals have been devised for the various Yi scripts. (For information on Han
radicals, see “CJK and KangXi Radicals” in Section 18.1, Han). The traditional Guizhou Yi
script has 119 radicals; the traditional Liangshan Yi script has 170 radicals; and the tradi-
tional Yunnan Sani Yi script has 25 radicals. The standardized Liangshan Yi script encoded
in Unicode has a set of 55 radical characters, which are encoded in the Yi Radicals block
(U+A490..U+A4C5). Each radical represents a distinctive stroke element that is common
to a subset of the characters encoded in the Yi Syllables block. The name used for each rad-
ical character is that of the corresponding Yi syllable closest to it in shape.
East Asia 742 18.8 Nüshu
18.8 Nüshu
Nüshu: U+1B170–U+1B2FF
Nüshu is a siniform script devised by women to write the local Chinese dialect of
Jiangyong county in the Xiaoshui Valley of southeastern Hunan province in China. Nüshu
means “women’s writing,” and was originally used only by women, many of whom could
not write Chinese Han characters. The script appeared in handwritten cloth-bound book-
lets of poems and songs, called San Chao Shu (三朝書), that were passed down from one
“sworn sister” to another upon marriage. Nüshu also was used for other purposes, and on
different media. By the late twentieth century, very few women fluent in the script were still
alive. National and international attention to Nüshu has led to active efforts to study and
preserve the script.
Structure. Nüshu is written vertically in columns which are laid out from right to left.
Although largely based on Chinese Han characters, Nüshu characters typically represent
the phonetic values of syllables, with many characters representing several homophonous
words. Some signs are used as ideographs.
Names. Nüshu characters are named sequentially by prefixing the string “nushu charac-
ter-” to the code point. The diaeresis is not included in this prefix because of the con-
straints on letters that can be used in character names.
Order. The Nüshu characters are ordered by stroke count, then by vowel, consonant, and
tone.
Punctuation. Nüshu has one punctuation mark, U+16FE1 nushu iteration mark,
located in the Ideographic Symbols and Punctuation block.
Sources. The Unicode Character Database contains a source data file for Nüshu called
NushuSources.txt. This data file contains normative information on the source references
for each Nüshu character. NushuSources.txt also contains an informative reading value for
each character.
East Asia 743 18.9 Lisu
18.9 Lisu
Lisu: U+A4D0–U+A4FF
Somewhere between 1908 and 1914 a Karen evangelist from Myanmar by the name of Ba
Thaw modified the shapes of Latin characters and created the Lisu script. Afterwards, Brit-
ish missionary James Outram Fraser and some Lisu pastors revised and improved the
script. The script is commonly known in the West as the Fraser script. It is also sometimes
called the Old Lisu script, to distinguish it from newer, Latin-based orthographies for the
Lisu language.
There are 630,000 Lisu people in China, mainly in the regions of Nujiang, Diqing, Lijiang,
Dehong, Baoshan, Kunming and Chuxiong in the Yunnan Province. Another 350,000 Lisu
live in Myanmar, Thailand and India. Other user communities are mostly Christians from
the Dulong, the Nu and the Bai nationalities in China.
At present, about 200,000 Lisu in China use the Lisu script and about 160,000 in the other
countries are literate in it. The Lisu script is widely used in China in education, publishing,
the media and religion. Various schools and universities at the national, provincial and
prefectural levels have been offering Lisu courses for many years. Globally, the script is also
widely used in a variety of Lisu literature.
Structure. There are 40 letters in the Lisu alphabet. These consist of 30 consonants and 10
vowels. Each letter was originally derived from the capital letters of the Latin alphabet.
Twenty-five of them look like sans-serif Latin capital letters (all but “Q”) in upright posi-
tions; the other 15 are derived from sans-serif Latin capital letters rotated 180 degrees.
Although the letters of the Lisu script clearly derived originally from the Latin alphabet, the
Lisu script is distinguished from the Latin script. The Latin script is bicameral, with case
mappings between uppercase and lowercase letters. The Lisu script is unicameral; it has no
casing, and the letters do not change form. Furthermore, typography for the Lisu script is
rather sharply distinguished from typography for the Latin script. There is not the same
range of font faces as for the Latin script, and Lisu typography is typically monospaced and
heavily influenced by the conventions of Chinese typography.
Consonant letters have an inherent [O] vowel unless followed by an explicit vowel letter.
Three letters sometimes represent a vowel and sometimes a consonant: U+A4EA lisu let-
ter wa, U+A4EC lisu letter ya, and U+A4ED lisu letter gha.
Tone Letters. The Lisu script has six tone letters which are placed after the syllable to mark
tones. These tone letters are listed in Table 18-12, with the tones identified in terms of their
pitch contours.
Each of the six tone letters represents one simple tone. Although the tone letters clearly
derive from Western punctuation marks (full stop, comma, semicolon, and colon), they do
not function as punctuation at all. Rather, they are word-forming modifier letters. Further-
more, each tone letter is typeset on an em-square, including those whose visual appearance
consists of two marks.
East Asia 744 18.9 Lisu
The first four tone letters can be used in combination with the last two to represent certain
combination tones. Of the various possibilities, only “,;” is still in use; the rest are now
rarely seen in China.
Other Modifier Letters. Nasalized vowels are denoted by a nasalization mark following the
vowel. This word-forming character is not encoded separately in the Lisu script, but is rep-
resented by U+02BC modifier letter apostrophe, which has the requisite shape and
properties (General_Category = Lm) and is used in similar contexts.
A glide based on the vowel A, pronounced as [O] without an initial glottal stop (and nor-
mally bearing a 31 low falling pitch), is written after a verbal form to mark various aspects.
This word-forming modifier letters is represented by U+02CD modifier letter low
macron. In a Lisu font, this modifier letter should be rendered on the baseline, to harmo-
nize with the position of the tone letters.
Digits and Separators. There are no unique Lisu digits. The Lisu use European digits for
counting. The thousands separator and the decimal point are represented with U+002C
comma and U+002E full stop, respectively. To separate chapter and verse numbers,
U+003A colon and U+003B semicolon are used. These can be readily distinguished from
the similar-appearing tone letters by their numerical context.
Punctuation. U+A4FE “-.” lisu punctuation comma and U+A4FF “=” lisu punctua-
tion full stop are punctuation marks used respectively to denote a lesser and a greater
degree of finality. These characters are similar in appearance to sequences of Latin punctu-
ation marks, but are not unified with them.
Over time various other punctuation marks from European or Chinese traditions have
been adopted into Lisu orthography. Table 18-13 lists all known adopted punctuation,
along with the respective contexts of use.
U+2010 hyphen may be preferred to U+002D hyphen-minus for the dash used to sepa-
rate syllables in names, as its semantics are less ambiguous than U+002D.
The use of the U+003F “?” question mark replaced the older Lisu tradition of using a
tone letter combination to represent the question prosody, followed by a Lisu full stop:
“..:=”
Line Breaking. A line break is not allowed within an orthographic syllable in Lisu. A line
break is also prohibited before a punctuation mark, even if it is preceded by a space. In gen-
East Asia 745 18.9 Lisu
eral there is no hyphenation of words across line breaks, except for proper nouns, where a
break is allowed after the hyphen used as a syllable separator.
Word Separation. The Lisu script separates syllables using a space or, for proper names, a
hyphen. In the case of polysyllabic words, it can be ambiguous as to which syllables join
together to form a word. Thus for most text processing at the character level, a syllable
(starting after a space or punctuation and ending before another space or punctuation) is
treated as a word except for proper names—where the occurrence of a hyphen holds the
word together.
East Asia 746 18.10 Miao
18.10 Miao
Miao: U+16F00–U+16F9F
The Miao script, also called Lao Miaowen (“Old Miao Script”) in Chinese, was created in
1904 by Samuel Pollard and others, to write the Northeast Yunnan Miao language of
southern China. The script has also been referred to as the Pollard script, but that usage is
no longer preferred. The Miao script was created by an adaptation of Latin letter variants,
English shorthand characters, Miao pictographs, and Cree syllable forms. (See Section 20.2,
Canadian Aboriginal Syllabics.) Today, the script is used to write various Miao dialects, as
well as languages of the Yi and Lisu nationalities in southern China.
The script was reformed in the 1950s by Yang Rongxin and others, and was later adopted as
the “Normalized” writing system of Kunming City and Chuxiong Prefecture. The main dif-
ference between the pre-reformed and the reformed orthographies is in how they mark
tones. Both orthographies can be correctly represented using the Miao characters encoded
in the Unicode Standard.
Encoding Principles. The script is written left to right. The basic syllabic structure contains
an initial consonant or consonant cluster and a final. The final consists of either a vowel or
vowel cluster, an optional final nasal, plus a tone mark. The initial consonant may be pre-
ceded by U+16F50 miao letter nasalization, and can be followed by combining marks
for voicing (U+16F52 miao sign reformed voicing) or aspiration (U+16F51 miao sign
aspiration and U+16F53 miao sign reformed aspiration).
The Gan Yi variety of Miao has an additional combining mark, U+16F4F miao sign con-
sonant modifier bar. That mark is only applied to two consonants, U+16F0E miao let-
ter tta or U+16F10 miao letter na, indicating a distinct place of articulation. The mark
follows the consonant in logical order, as for all combining marks, but is rendered with a
small vertical bar at the lower left-hand side of the modified consonant.
Tone Marks. In the Chuxiong reformed orthography, vowels and final nasals appear on the
baseline. If no explicit tone mark is present, this indicates the default tone 3. An additional
tone mark, encoded in the range U+16F93..U+16F99, may follow the vowel to indicate
other tones. A set of archaic tone marks used in the reformed orthography is encoded in
the range U+16F9A..U+16F9F.
In the pre-reformed orthography, such as that used for the language Ahmao (Northern
Hmong), the tone marks are represented in a different manner, using one of five shifter
characters. These are represented in sequence following the vowel or vowel sequence and
indicate where the vowel letter is to be rendered in relation to the consonant. If more than
one vowel letter appears before the shifter, all of the vowel glyphs are moved together to the
appropriate position.
Rendering of “wart”. Several Miao consonants appear in the code charts with a “wart”
attached to the glyph, usually on the left-hand side. In the Chuxiong orthography, a dot
appears instead of the wart on these consonants. Because the user communities consider
East Asia 747 18.10 Miao
the appearance of the wart or dot to be a different way to write the same characters and not
a difference of the character’s identity, the differences in appearance are a matter of font
style.
Ordering. The order of Miao characters in the code charts derives from a reference order-
ing widely employed in China, based in part on the order of Bopomofo phonetic charac-
ters. The expected collation order for Miao strings varies by language and user
communities, and requires tailoring. See Unicode Technical Standard #10, “Unicode Col-
lation Algorithm.”
Digits. Miao uses European digits.
Punctuation. The Miao script employs a variety of punctuation marks, both from the East
Asian typographical tradition and from the Western typographical tradition. There are no
script-specific punctuation marks.
East Asia 748 18.11 Tangut
18.11 Tangut
Tangut: U+17000–U+187FF
Tangut, also known as Xixia, is a large, historic siniform ideographic script used to write
the Tangut language, a Tibeto-Burman language spoken from about the 11th century ce
until the 16th century in the area of present-day northwestern China. The Tangut script
was created under the first emperor of Western Xia about 1036 ce. After the fall of the
Western Xia to the Mongols, the script continued to be used during the Yuan and Ming
dynasties, but it had become obsolete by the end of Ming dynasty. Tangut was re-discov-
ered in the late 19th century, and has been largely deciphered, thanks to the ground-break-
ing work done in the early 20th century by N. A. Nevskij. Tangut is found in thousands of
official, private, and religious texts, including books and sutras, inscriptions, and manu-
scripts. Today the study of Tangut is a separate discipline, with scholars in China, Japan,
Russia, and other countries publishing works on Tangut language and culture.
Structure. Tangut characters superficially resemble Chinese ideographs; however, the
script is unique and unrelated to Chinese ideographs. Tangut was originally written top to
bottom, with columns laid out right to left, in the same manner as Chinese was tradition-
ally written. In current practice, the script is written horizontally left to right. Most Tangut
characters are made up of 8 to 15 strokes. The script has no combining characters.
Encoding Principles. The repertoire of Tangut characters is intended to cover all Tangut
characters used as head entries or index entries in the major works of modern Tangut lexi-
cography and scholarship. A number of principles have been adopted to handle variant
glyph shapes, because Tangut characters are often written with different glyph shapes in
the primary sources. When character variants are not used contrastively in a single source
reference, they are unified as a single character, typically using the glyph found in Li Fan-
wen 2008. However, if a single source includes two or more variants as separate head or
index entries, then the variants are encoded as separate characters. In cases where two
characters with the same shape are cataloged separately in a single source, but have differ-
ent pronunciations or meanings, only one character is encoded. Also, a few erroneous or
“ghost” characters in modern dictionaries are separately encoded.
Character Names. The names for the Tangut characters are algorithmically derived by
prefixing the code point with the string “tangut ideograph-”. Hence the name for
U+17000 is tangut ideograph-17000.
Punctuation. Contemporary sources use U+16FE0 tangut iteration mark, located in
the Ideographic Symbols and Punctuation block. There are no other script-specific punc-
tuation marks.
Sources. The Unicode Character Database contains a source data file for Tangut called
TangutSources.txt. This data file contains normative information on the source references
for each Tangut character. TangutSources.txt also contains the informative radical-stroke
values for each character. The data in TangutSources.txt shares the same format as the Uni-
East Asia 749 18.11 Tangut
han data files in the UCD. The Tangut code chart also indicates the source reference and
the radical-stroke value for each character.
Sorting. No universally accepted or standard character sort order exists for Tangut. All
extant Tangut dictionaries dating to the Western Xia period (1038-1227) base their order-
ing on phonetic principles, which do not help in locating specific characters. Almost all
modern Tangut dictionaries and glossaries order characters by radical and stroke count.
However, the radical/stroke indices in modern handbooks all differ from one another. The
radical system adopted in the Tangut block is based on that of Han Xiaomang 2004, with
some modifications. In the Tangut block, signs are grouped by radical, and radicals are
ordered by stroke count and stroke order. Within each radical, signs are ordered by stroke
count and stroke order.
Stroke Order. Because current day Tangut dictionaries do not provide information on
how Tangut characters should be written or on their stroke count, modern scholars have
reconstructed stroke count and stroke order based on the analogy to Chinese characters.
The stroke order used by scholars may not reflect the actual stroke order used by Tangut
scribes.
and is in the same format as Unihan. The Tangut code chart also indicates the source ref-
erence and the radical-stroke value for each character.
751
Chapter 19
Africa 19
This chapter covers the following scripts of Africa:
Ethiopic and Tifinagh are scripts with long histories. Although their roots can be traced
back to the original Semitic and North African writing systems, they would not be classi-
fied as Middle Eastern scripts today.
The remaining scripts in this chapter have been developed relatively recently. Some of
them show roots in Latin and other letterforms. They are all original creative contributions
intended specifically to serve the linguistic communities that use them.
Osmanya is an alphabetic script developed in the early 20th century to write the Somali
language. N’Ko is a right-to-left alphabetic script devised in 1949 as a writing system for
Manden languages in West Africa. Vai is a syllabic script used for the Vai language in Libe-
ria and Sierra Leone; it was developed in the 1830s, but the standard syllabary was pub-
lished in 1962. Bamum is a syllabary developed between 1896 and 1910, used for writing
the Bamum language in western Cameroon. Modern Bassa Vah is an alphabetic script
developed early in the 20th century. Mende Kikakui is a right-to-left script used for writing
Mende. It was also created in the early 20th century.
Adlam is an alphabetic script used to write Fulani and other African languages. The Fulani
are a widespread ethnic group in Africa, and the Fulani language is spoken by more than 40
million people. The script was developed in the late 1980s, and was subsequently widely
adopted among Fulani communities, where it is taught in schools.
The Medefaidrin script is used to write the liturgical language Medefaidrin by members of
an indigenous Christian church in Nigeria. According to community tradition, the lan-
guage was revealed to one of the founders of the community in 1927 by divine inspiration.
It is presently used for Sunday school lessons and prayers or meditation.
Africa 752 19.1 Ethiopic
19.1 Ethiopic
Ethiopic: U+1200–U+137F
The Ethiopic syllabary originally evolved for writing the Semitic language Ge’ez. Indeed,
the English noun “Ethiopic” simply means “the Ge’ez language.” Ge’ez itself is now limited
to liturgical usage, but its script has been adopted for modern use in writing several lan-
guages of central east Africa, including Amharic, Tigre, and Oromo.
Basic and Extended Ethiopic. The Ethiopic characters encoded here include the basic set
that has become established in common usage for writing major languages. As with other
productive scripts, the basic Ethiopic forms are sometimes modified to produce an
extended range of characters for writing additional languages.
Encoding Principles. The syllables of the Ethiopic script are traditionally presented as a
two-dimensional matrix of consonant-vowel combinations. The encoding follows this
structure; in particular, the codespace range U+1200..U+1357 is interpreted as a matrix of
43 consonants crossed with 8 vowels, making 344 conceptual syllables. Most of these con-
sonant-vowel syllables are represented by characters in the script, but some of them hap-
pen to be unused, accounting for the blank cells in the matrix.
Variant Glyph Forms. A given Ethiopic syllable may be represented by different glyph
forms, analogous to the glyph variants of Latin lowercase “a” or “g”, which do not coexist in
the same font. Thus the particular glyph shown in the code chart for each position in the
matrix is merely one representation of that conceptual syllable, and the glyph itself is not
the object that is encoded.
Labialized Subseries. A few Ethiopic consonants have labialized (“W”) forms that are tra-
ditionally allotted their own consonant series in the syllable matrix, although only a subset
of the possible vowel forms are realized. Each of these derivative series is encoded immedi-
ately after the corresponding main consonant series. Because the standard vowel series
includes both “AA” and “WAA”, two different cells of the syllable matrix might represent
the “consonant + W + AA” syllable. For example:
U+1257 = QH + WAA: potential but unused version of qhwaa
U+125B = QHW + AA: ethiopic syllable qhwaa
In these cases, where the two conceptual syllables are equivalent, the entry in the labialized
subseries is encoded and not the “consonant + WAA” entry in the main syllable series. The
six specific cases are enumerated in Table 19-1. In three of these cases, the -WAA position
in the syllable matrix has been reanalyzed and used for encoding a syllable in -OA for
extended Ethiopic.
Africa 753 19.1 Ethiopic
Also, within the labialized subseries, the sixth vowel (“-E”) forms are sometimes considered
to be second vowel (“-U”) forms. For example:
U+1249 = QW + U: unused version of qwe
U+124D = QW + E: ethiopic syllable qwe
In these cases, where the two syllables are nearly equivalent, the “-E” entry is encoded and
not the “-U” entry. The six specific cases are enumerated in Table 19-2.
Keyboard Input. Because the Ethiopic script includes more than 300 characters, the units
of keyboard input must constitute some smaller set of entities, typically 43+8 codes inter-
preted as the coordinates of the syllable matrix. Because these keyboard input codes are
expected to be transient entities that are resolved into syllabic characters before they enter
stored text, keyboard input codes are not specified in this standard.
Syllable Names. The Ethiopic script often has multiple syllables corresponding to the same
Latin letter, making it difficult to assign unique Latin names. Therefore the names list
makes use of certain devices (such as doubling a Latin letter in the name) merely to create
uniqueness; this device has no relation to the phonetics of these syllables in any particular
language.
Encoding Order and Sorting. The order of the consonants in the encoding is based on the
traditional alphabetical order. It may differ from the sort order used for one or another lan-
guage, if only because in many languages various pairs or triplets of syllables are treated as
equivalent in the first sorting pass. For example, an Amharic dictionary may start out with
a section headed by three H-like syllables:
Africa 754 19.1 Ethiopic
Ethiopic Extensions
The Ethiopic script is used for a large number of languages and dialects in Ethiopia and in
some instances has been extended significantly beyond the set of characters used for major
languages such as Amharic and Tigre. There are three blocks of extensions to the Ethiopic
Africa 755 19.1 Ethiopic
19.2 Osmanya
Osmanya: U+10480–U+104AF
The Osmanya script, which in Somali is called abc defghi far Soomaali “Somali writing”
or jidfgklb Cismaanya, was devised in 1920–1922 by jidfgk lmdna opkbqlq (Cis-
maan Yuusuf Keenadiid) to represent the Somali language. It replaced an attempt by
Sheikh Uweys of the Confraternity Qadiriyyah (died 1909) to devise an Arabic-based
orthography for Somali. It has, in turn, been replaced by the Latin orthography of Muuse
Xaaji Ismaaciil Galaal (1914–1980). In 1961, both the Latin and the Osmanya scripts were
adopted for use in Somalia, but in 1969 there was a coup, with one of its stated aims being
the resolution of the debate over the country’s writing system. A Latin orthography was
finally adopted in 1973. Gregersen (1977) states that some 20,000 or more people use
Osmanya in private correspondence and bookkeeping, and that several books and a
biweekly journal Horseed (“Vanguard”) were published in cyclostyled format.
Structure. Osmanya is an alphabetic script, read from left to right in horizontal lines run-
ning from top to bottom. It has 22 consonants and 8 vowels. Unique long vowels are writ-
ten for U+1049B g osmanya letter aa, U+1049C p osmanya letter ee, and U+1049D
e osmanya letter oo; long uu and ii are written with the consonants U+10493 m
osmanya letter waw and U+10495 l osmanya letter ya, respectively.
Ordering. Alphabetical ordering is based on the order of the Arabic alphabet, as specified
by Osman Abdihalim Yuusuf Osman Keenadiid. This ordering is similar to the ordering
given in Diringer (1996).
Character Names and Glyphs. The character names used in the Unicode Standard are as
given by Osman. The glyphs shown in the code charts are taken from Afkeenna iyo fartysa
(“Our language and its handwriting”) 1971.
Africa 757 19.3 Tifinagh
19.3 Tifinagh
Tifinagh: U+2D30–U+2D7F
The Tifinagh script is used by approximately 20 million people who speak varieties of lan-
guages commonly called Berber or Amazigh. The three main varieties in Morocco are
known as Tarifite, Tamazighe, and Tachelhite. In Morocco, more than 40% of the popula-
tion speaks Berber. The Berber language, written in the Tifinagh script, is currently taught
to approximately 300,000 pupils in 10,000 schools—mostly primary schools—in Morocco.
Three Moroccan universities offer Berber courses in the Tifinagh script leading to a Mas-
ter’s degree.
Tifinagh is an alphabetic writing system. It uses spaces to separate words and makes use of
Western punctuation.
History. The earliest variety of the Berber alphabet is Libyan. Two forms exist: a Western
form and an Eastern form. The Western variety was used along the Mediterranean coast
from Kabylia to Morocco and most probably to the Canary Islands. The Eastern variety,
Old Tifinagh, is also called Libyan-Berber or Old Tuareg. It contains signs not found in the
Libyan variety and was used to transcribe Old Tuareg. The word tifinagh is a feminine plu-
ral noun whose singular would be tafniqt; it means “the Phoenician (letters).”
Neo-Tifinagh refers to the writing systems that were developed to represent the Maghreb
Berber dialects. A number of variants of Neo-Tifinagh exist, the first of which was pro-
posed in the 1960s by the Académie Berbère. That variant has spread in Morocco and Alge-
ria, especially in Kabylia. Other Neo-Tifinagh systems are nearly identical to the Académie
Berbère system. The encoding in the Tifinagh block is based on the Neo-Tifinagh systems.
Source Standards. The encoding consists of four Tifinagh character subsets: the basic set
of the Institut Royal de la Culture Amazighe (IRCAM), the extended IRCAM set, other
Neo-Tifinagh letters in use, and modern Tuareg letters. The first subset represents the set
of characters chosen by IRCAM to unify the orthography of the different Moroccan mod-
ern-day Berber dialects while using the historical Tifinagh script.
Ordering. The letters are arranged according to the order specified by IRCAM. Other Neo-
Tifinagh and Tuareg letters are interspersed according to their pronunciation.
Directionality. Historically, Berber texts did not have a fixed direction. Early inscriptions
were written horizontally from left to right, from right to left, vertically (bottom to top, top
to bottom); boustrophedon directionality was also known. Modern-day Berber script is
most frequently written in horizontal lines from left to right; therefore the bidirectional
class for Tifinagh letters is specified as strong left to right. Displaying Berber texts in other
directions can be accomplished by the use of directional overrides or by the use of higher-
level protocols.
Diacritical Marks. Modern Tifinagh variants tend to use combining diacritical marks to
complement the Tifinagh block. The Hawad notation, for example, uses diacritical marks
from the Combining Diacritical Marks block (U+0300–U+036F). These marks are used to
Africa 758 19.3 Tifinagh
represent vowels and foreign consonants. In this notation, <U+2D35, U+0307> represents
“a”, <U+2D49, U+0309> represents a long “i” /i:/, and <U+2D31, U+0302> represents a
“p”. Some long vowels are represented using two diacritical marks above. A long “e” /e:/ is
thus written <U+2D49, U+0307, U+0304>. These marks are displayed side by side above
their base letter in the order in which they are encoded, instead of being stacked.
Yal and Yan. While the neo-Tifinagh glyph for U+2D4D tifinagh letter yal in Morocco
is typically rendered with two bars linked by a small slanted stroke ⵍ, traditional texts from
all areas, and modern-day materials from areas outside Morocco often represent yal with
two vertical strokes ⵏⵏ. However, the two vertical bar shape can cause visual ambiguity in
words with consonant clusters, because yal may be mistaken for two instances of U+2D4F
tifinagh letter yan, whose glyph is a single vertical stroke ⵏ. Individual font designers,
local traditions, and national preferences employ various means to prevent confusion,
including varying the spacing between the bars, and slanting or lowering the bars.
Figure 19-1 shows examples that illustrate contextual shaping by slanting the bars of yal
and yan.
2D4D
+
2D4D
→
2D4D
+
2D4F
→
2D4F
+
2D4D
→
2D4F
+
2D4F
→
Bi-Consonants. Bi-consonants are additional letterforms used in the Tifinagh script, par-
ticularly for Tuareg, to represent a consonant cluster—a sequence of two consonants with-
out an intervening vowel. These bi-consonants, sometimes also referred to as bigraphs, are
not directly encoded as single characters in the Unicode Standard. Instead, they are repre-
sented as a sequence of the two consonant letters, separated either by U+200D zero
width joiner or by U+2D7F tifinagh consonant joiner.
When a bi-consonant is considered obligatory in text, it is represented by the two conso-
nant letters, with U+2D7F tifinagh consonant joiner inserted between them. This use
of U+2D7F is comparable in function to the use of U+0652 arabic sukun to indicate the
absence of a vowel after a consonant, when Tuareg is written in the Arabic script. However,
instead of appearing as a visible mark in the text, U+2D7F tifinagh consonant joiner
indicates the presence of a bi-consonant, which should then be rendered with a preformed
Africa 759 19.3 Tifinagh
glyph for the sequence. Examples of common Tifinagh bi-consonants and their representa-
tion are shown in Figure 19-2.
+ + →
2D4E 2D7F 2D5C
+ + →
2D4F 2D7F 2D3E
+ + →
2D4F 2D7F 2D5C
+ + →
2D54 2D7F 2D5C
+ + →
2D59 2D7F 2D5C
If a rendering system cannot display obligatory bi-consonants with the correct, fully-
formed bi-consonant glyphs, a fallback rendering should be used which displays the tifi-
nagh consonant joiner visibly, so that the correct textual distinctions are maintained,
even if they cannot be properly displayed.
When a bi-consonant is considered merely an optional, ligated form of two consonant let-
ters, the bi-consonant can be represented by the two consonant letters, with U+200D zero
width joiner inserted between them, as a hint that the ligated form is preferred. If a ren-
dering system cannot display the optional, ligated form, the fallback display should simply
be the sequence of consonants, with no visible display of the ZWJ.
Bi-consonants often have regional glyph variants, so fonts may need to be designed differ-
ently for different regional uses of the Tifinagh script.
Africa 760 19.4 N’Ko
19.4 N’Ko
N’Ko: U+07C0–U+07FF
N’Ko is a literary dialect used by the Manden (or Manding) people, who live primarily in
West Africa. The script was devised by Solomana Kante in 1949 as a writing system for the
Manden languages. The Manden language group is known as Mandenkan, where the suffix
-kan means “language of.” In addition to the substantial number of Mandens, some non-
Mandens speak Mandenkan as a second language. There are an estimated 20 million Man-
denkan speakers.
The major dialects of the Manden language are Bamanan, Jula, Maninka, and Mandinka.
There are a number of other related dialects. When Mandens from different subgroups talk
to each other, it is common practice for them to switch—consciously or subconsciously—
from their own dialect to the conventional, literary dialect commonly known as Kangbe,
“the clear language,” also known as N’Ko. This dialect switching can occur in conversa-
tions between the Bamanan of Mali, the Maninka of Guinea, the Jula of the Ivory Coast,
and the Mandinka of Gambia or Senegal, for example. Although there are great similarities
between their dialects, speakers sometimes find it necessary to switch to Kangbe (N’Ko) by
using a common word or phrase, similar to the accommodations Danes, Swedes, and Nor-
wegians sometimes make when speaking to one another. For example, the word for
“name” in Bamanan is togo, while it is tooh in Maninka. Speakers of both dialects will write
it as PQR , although each may pronounce it differently.
Character Names and Block Name. Although the traditional name of the N’Ko language
and script includes an apostrophe, apostrophes are disallowed in Unicode character and
block names. Because of this, the formal block name is “NKo” and the script portion of the
Unicode character names is “nko”.
Structure. The N’Ko script is written from right to left. It is phonetic in nature (one sym-
bol, one sound). N’Ko has seven vowels, each of which can bear one of seven diacritical
marks that modify the tone of the vowel as well as an optional diacritical mark that indi-
cates nasalization. N’Ko has 19 consonants and two “abstract” consonants, U+07E0 nko
letter na woloso and U+07E7 nko letter nya woloso, which indicate original conso-
nants mutated by a preceding nasal, either word-internally or across word boundaries.
Some consonants can bear one of three diacritical marks to transcribe foreign sounds or to
transliterate foreign letters.
U+07D2 nko letter n is considered neither a vowel nor a consonant; it indicates a syl-
labic alveolar or velar nasal. It can bear a diacritical mark, but cannot bear the nasal dia-
critic. The letter U+07D1 nko letter dagbasinna has a special function in N’Ko
orthography. The standard spelling rule is that when two successive syllables have the same
vowel, the vowel is written only after the second of the two syllables. For example, STU <ba,
la, oo> is pronounced [bolo], but in a foreign syllable to be pronounced [blo], the dag-
basinna is inserted for STVU <ba, dagbasinna, la, oo> to show that a consonant cluster is
intended.
Africa 761 19.4 N’Ko
Diacritical Marks. N’Ko diacritical marks are script-specific, despite superficial resem-
blances to other diacritical marks encoded for more general use. Some N’Ko diacritics have
a wider range of glyph representation than the generic marks do, and are typically drawn
rather higher and bolder than the generic marks.
Two of the tone diacritics, when applied to consonants, indicate specific sounds from other
languages—in particular, Arabic or French language sounds. U+07F3 nko combining
double dot above is also used as a diacritic to represent sounds from other languages.
The combinations used are as shown in Table 19-3.
Table 19-4 shows the use of the tone diacritics when applied to vowels.
When applied to a vowel, U+07F2 nko combining nasalization mark indicates the
nasalization of that vowel. In the text stream, this mark is applied before any of the tone
marks because combining marks below precede combining marks above in canonical
order.
Digits. N’Ko uses decimal digits specific to the script. These digits have strong right-to-left
directionality. Numbers are stored in text in logical order with most significant digit first;
when displayed, numerals are then laid out in right-to-left order, with the most significant
digit at the rightmost side, as illustrated for the numeral 144 in Figure 19-3. This situation
differs from how numerals are handled in Hebrew and Arabic, where numerals are laid out
in left-to-right order, even though the overall text direction is right to left.
Ordinal Numbers. Diacritical marks are also used to mark ordinal numbers. The first ordi-
nal is indicated by applying U+07ED nko combining short rising tone (a dot above) to
U+07C1 nko digit one. All other ordinal numbers are indicated by applying U+07F2 nko
combining nasalization mark (an oval dot below) to the last digit in any sequence of
digits composing the number. Thus the nasalization mark under the digit two would indi-
cate the ordinal value 2nd, while the nasalization mark under the final digit four in the
numeral 144 would indicate the ordinal value 144th, as shown in Figure 19-3.
Punctuation. N’Ko uses a number of punctuation marks in common with other scripts.
U+061F arabic question mark, U+060C arabic comma, U+061B arabic semicolon,
and the paired U+FD3E ornate left parenthesis and U+FD3F ornate right paren-
thesis are used, often with different shapes than are used in Arabic. A script-specific
U+07F8 nko comma and U+07F9 nko exclamation mark are encoded. The nko comma
Africa 763 19.4 N’Ko
differs in shape from the arabic comma, and the two are sometimes used distinctively in
the same N’Ko text.
The character U+07F6 nko symbol oo dennen is used as an addition to phrases to indi-
cate remote future placement of the topic under discussion. The decorative U+07F7 \
nko symbol gbakurunen represents the three stones that hold a cooking pot over the fire
and is used to end major sections of text.
The two tonal apostrophes, U+07F4 nko high tone apostrophe and U+07F5 nko low
tone apostrophe, are used to show the elision of a vowel while preserving the tonal infor-
mation of the syllable. Their glyph representations can vary in height relative to the base-
line. N’Ko also uses a set of paired punctuation, U+2E1C left low paraphrase bracket
and U+2E1D right low paraphrase bracket, to indicate indirect quotations.
Ordering. The order of N’Ko characters in the code charts reflects the traditional ordering
of N’Ko. However, in collation, the three archaic letters U+07E8 nko letter jona ja,
U+07E9 nko letter jona cha, and U+07EA nko letter jona ra should be weighted as
variants of U+07D6 nko letter ja, U+07D7 nko letter cha, and U+07D9 nko letter
ra, respectively.
Rendering. N’Ko letters have shaping behavior similar to that of Arabic. Each letter can
take one of four possible forms, as shown in Table 19-5.
Table 19-5. N’Ko Letter Shaping
Character Xn Xr Xm Xl
a # $ % &
ee ( ) * +
i , | . /
e 0 1 2 3
u 4 5 6 7
oo 8 9 : ;
o < = > ?
dagbasinna @ A B C
n D E F G
ba H I J K
pa L M N O
ta P Q R S
Africa 764 19.4 N’Ko
ja T U V W
cha X Y Z [
da \ ] ^ _
ra ` a b c
rra d e f g
sa h i j k
gba l m n o
fa p q r s
ka t u v w
la x y z {
na woloso # $ % &
ma ( ) * +
nya , z . /
na 0 1 2 3
ha 4 5 6 7
wa 8 9 : ;
ya < = > ?
nya woloso @ A B C
jona ja H I J K
jona cha D E F G
jona ra L M N O
A noncursive style of N’Ko writing exists where no joining line is used between the letters
in a word. This is a font convention, not a dynamic style like bold or italic, both of which
are also valid dynamic styles for N’Ko. Noncursive fonts are mostly used as display fonts
for the titles of books and articles. U+07FA nko lajanyalan is sometimes used like
U+0640 arabic tatweel to justify lines, although Latin-style justification where space is
increased tends to be more common.
Africa 765 19.5 Vai
19.5 Vai
Vai: U+A500–U+A63F
The Vai script is used for the Vai language, spoken in coastal areas of western Liberia and
eastern Sierra Leone. It was developed in the early 1830s primarily by MMmMlu Duwalu
BukNlN of Jondu, Liberia, who later stated that the inspiration had come to him in a dream.
He may have also been aware of, and influenced by, other scripts including Latin, Arabic,
and possibly Cherokee, or he may have phoneticized and regularized an earlier picto-
graphic script. In the years afterward, the Vai built an educational infrastructure that
enabled the script to flourish; by the late 1800s European traders reported that most Vai
were literate in the script. Although there were standardization efforts in 1899 and again at
a 1962 conference at the University of Liberia, nowadays the script is learned informally
and there is no means to ensure adherence to a standardized version; most Vai literates
know only a subset of the standardized characters. The script is primarily used for corre-
spondence and record-keeping, mainly among merchants and traders. Literacy in Vai
coexists with literacy in English and Arabic.
Sources. The primary sources for the Vai characters in Unicode are the 1962 Vai Standard
Syllabary, modern primers and texts which use the Standard Syllabary (including a few
glyph modifications reflecting modern preferences), the 1911 additions of Momolu Massa-
quoi, and the characters found in The Book of Ndole, the longest surviving text from the
early period of Vai script usage.
Basic Structure. Vai is a syllabic script written left to right. The Vai language has seven oral
vowels [e i a o u t u], five of which also occur in nasal form [p q r v t]. The standard syllabary
includes standalone vowel characters for the oral vowels and three of the nasal ones, char-
acters for most of the consonant-vowel combinations formed from each of thirty conso-
nants or consonant clusters, and a character for the final velar nasal consonant [}].
The writing system has a moraic structure: the weight (or duration) of a syllable deter-
mines the number of characters used to write it (as with Japanese kana). A short syllable is
written with any single character in the range U+A500..U+A60B. Long syllables are written
with two characters, and involve a long vowel, a diphthong, or a syllable ending with
U+A60B vai syllable ng. Note that the only closed syllables in Vai—that is, those that end
with a consonant—are those ending with vai syllable ng. The long vowel is generally
written using either an additional standalone vowel to double the vowel sound of the pre-
ceding character, or using U+A60C vai syllable lengthener, while the diphthong is
generally written using an additional standalone vowel. In some cases, the second charac-
ter for a long vowel or diphthong may be written using characters such as U+A54C vai
syllable ha or U+A54E vai syllable wa instead of standalone vowels.
Historic Syllables. In The Book of Ndole more than one character may be used to represent
a pronounced syllable; they have been separately encoded.
Logograms. The oldest Vai texts used an additional set of symbols called “logograms,” rep-
resenting complete syllables with an associated meaning or range of meanings; these sym-
Africa 766 19.5 Vai
bols may be remnants from a precursor pictographic script. At least two of these symbols
are still used: U+A618 vai symbol faa represents the word meaning “die, kill” and is used
alongside a person’s date of death (the glyph is said to represent a wilting tree); U+A613
vai symbol feeng represents the word meaning “thing.”
Digits. In the 1920s ten decimal digits were devised for Vai; these digits were “Vai-style”
glyph variants of European digits. They never became popular with Vai people, but are
encoded in the standard for historical purposes. Modern literature uses European digits.
Punctuation. Vai makes use of European punctuation, although a small number of script-
specific punctuation marks commonly occur. U+A60D vai comma rests on or slightly
below the baseline; U+A60E vai full stop rests on the baseline and can be doubled for use
as an exclamation mark. U+A60F vai question mark also rests on the baseline; it is rarely
used. Some modern primers prefer these Vai punctuation marks; some prefer the Euro-
pean equivalents. Some Vai writers mark the end of a sentence by using U+A502 vai syl-
lable hee instead of punctuation.
Segmentation. Vai is written without spaces between words. Line breaking opportunities
can occur between most characters except that line breaks should not occur before
U+A60B vai syllable ng used as a syllable final, or before U+A60C vai syllable
lengthener (which is always a syllable final). Line breaks also should not occur before
one of the “h-” characters (U+A502, U+A526, U+A54C, U+A573, U+A597, U+A5BD,
U+A5E4) when it is used to extend the vowel of the preceding character (that is, when it is
a syllable final), and line breaks should not occur before the punctuation characters
U+A60D vai comma, U+A60E vai full stop, and U+A60F vai question mark.
Ordering. There is no evidence of traditional conventions on ordering apart from the
order of listings found in syllabary charts. The syllables in the Vai block are arranged in the
order recommended by a panel of Vai script experts. Logograms should be sorted by their
phonetic values.
Africa 767 19.6 Bamum
19.6 Bamum
Bamum: U+A6A0–U+A6FF
The Bamum script is used for the Bamum language, spoken primarily in western Camer-
oon. It was developed between 1896 and 1910, mostly by King Ibrahim Njoya of the
Bamum Kingdom. Apparently inspired by a dream and by awareness of other writing, his
original idea for the script was to collect and provide approximately 500 logographic sym-
bols (denoting objects and actions) to serve more as a memory aid than as a representation
of language.
Using the rebus principle, the script was rapidly simplified through six stages, known as
Stage A, Stage B, and so on, into a syllabary known as A-ka-u-ku, consisting of 80 syllable
characters or letters. These letters are used with two combining diacritics and six punctua-
tion marks. The repertoire in this block covers the A-ka-u-ku syllabary, or Phase G form,
which remains in modern use.
Structure. Modern Bamum is written left-to-right. One interesting feature is that some-
times more letters than necessary are used to write a given syllable. For example, the word
lam “wedding” is written using the sequence of syllabic characters, la + a + m. This feature
is known as pleonastic syllable representation.
Diacritical Marks. U+A6F0 bamum combining mark koqndon may be applied to any of
the 80 letters. It usually functions to glottalize the final vowel of a syllable. U+A6F1 bamum
combining mark tukwentis is only known to be used with 13 letters—usually to truncate
a full syllable to its final consonant.
Punctuation. U+A6F2 bamum njaemli was a character used in the original set of logo-
graphic symbols to introduce proper names or to change the meaning of a word. The shape
of the glyph for njaemli has changed, but the character is still in use. The other punctuation
marks correspond in function to the similarly-named punctuation marks used in Euro-
pean typography.
Digits. The last ten letters in the syllabary are also used to represent digits. Historically, the
last of these was used for 10, but its meaning was changed to represent zero when decimal-
based mathematics was introduced.
block. The Bamum Supplement block covers distinct characters from the earlier phases
which are no longer part of the modern Bamum script.
The character names in this block include a reference to the last phase in which they
appear. So, for example, U+16867 bamum letter phase-b pit was last used during Phase
B, while U+168EE bamum letter phase-c pin continued in use and is attested through
Phase C.
Traditional Bamum texts using these historical characters do not use punctuation or digits.
Numerical values for digits are written out as words instead.
Africa 769 19.7 Bassa Vah
19.9 Adlam
Adlam: U+1E900–U+1E95F
Adlam is a script used to write Fulani and other African languages. The Fulani are a large,
historically nomadic tribe of Africa numbering more than 45 million and spread across
Senegambia (Senegal) to the banks of the Nile and the Red Sea. Depending on the lan-
guage, they are called by different names, including Fulani, Fula, Peul, Pul, Fut, Fellata,
Tekruri, Toucouleur, Peulh, Wasolonka, and Kourte.
The Fulani are today a widespread ethnic group in Africa, and the Fulani language is spo-
ken by more than 40 million.
During the late 1980s, brothers Ibrahima and Abdoulaye Barry devised this alphabetic
script to represent the Fulani language. After several years of development it was widely
adopted among Fulani communities and is currently taught at schools in Guinea, Nigeria,
Liberia and other nearby countries. The name Adlam is derived from the first four letters of
the alphabet (A, D, L, M), standing for Alkule Dandaydhe Leñol Mulugol (“the alphabet
that protects the peoples from vanishing”).
Structure. Adlam is a casing script with right-to-left directionality. Its letters can be written
separately or can be cursively joined in the same way that Arabic and N’Ko are. Joining is
optional, not obligatory.
Diacritical Marks. A range of diacritical marks is used. The lengthener U+1E944 adlam
alif lengthener is used only on the letters U+1E900 adlam capital letter alif and
U+1E922 adlam small letter alif. The lengthener U+1E945 adlam vowel length-
ener is used with other vowels. The U+1E946 adlam gemination mark marks long con-
sonants. These diacritical marks are typically high with capital letters, and high with small
letters with ascenders, but low with other small letters.
The diacritical mark U+1E947 adlam hamza is used atop a consonant when a glottal stop
occurs between it and the following vowel. The hamza has high and low variants. The mark
U+1E948 adlam consonant modifier is used to indicate foreign sounds, primarily in
Arabic transcription.
The U+1E94A adlam nukta is used to indicate both native and borrowed sounds. When
vowels are lengthened, however, the nukta is drawn below the vowels to indicate the
change. When drawn above a letter, the nukta is called hoortobbhere (“dot above”) in
Fulani; when drawn below, it is called lestobbhere (“dot below”).
This varied rendering of the Adlam nukta is similar to the behavior of some accents in
Latin typography, for which the rendering often depends on the availability of fonts, cul-
tural preferences, or the geographical area. A Latin example is the preference in Latvian
and in Romanian for a comma below diacritic shape for some letters, while a cedilla shape
is preferred for the same letters in Turkish and in Marshallese.
Line Breaking. Letters have the same line breaking behavior as N’Ko.
Africa 773 19.9 Adlam
Numbers. Adlam uses ten digits with a right-to-left directionality like the digits in N’Ko.
Punctuation. Adlam uses European punctuation and the U+061F arabic question
mark.
Cursive Joining. Cursive joining is used in some contexts. In a cursive context, all letters
are dual-joining with a base form, a left-joining form, a dual-joining form, and a right-join-
ing form. Diacritics do not break cursive connections.
Digits and punctuation do not participate in shaping. In a cursive context, U+0640 arabic
tatweel can be used for elongation.
Africa 774 19.10 Medefaidrin
19.10 Medefaidrin
Medefaidrin: U+16E40–U+16E9F
The Medefaidrin script is used to write the liturgical language Medefaidrin by members of
an indigenous Christian church, Oberi Okaime (“Church freely given”), which was active
in the Nigerian province of Calabar in the 1930s near the Western bank of Cross River. The
main spoken language for this group is Ibibio-Efik, which belongs to the Atlantic family of
the Niger-Congo languages.
The Medefaidrin script shows the strong influence of English orthography with the use of
capital and small letters, and a special sign for the pronoun “I”, which has both an upper
and lowercase form (U+16E44 medefaidrin capital letter atiu and U+16E64 mede-
faidrin small letter atiu). The community tradition is that this spirit language was
revealed to Bishop Ukpong, one of the founders of the community, in 1927 by divine inspi-
ration. The secretary of the group, Jakeld Udofia, transcribed the language to writing. Pres-
ently, the religious community counts about 4,000 members. The Medefaidrin language is
used for teaching Sunday school lessons and for saying prayers or meditation on the scrip-
tures.
Structure. Medefaidrin is written left to right. There is a close relationship between the
phonological analysis and the writing system: the letters are pronounced mostly as written.
Ordering. The order of Medefaidrin characters in the code charts reflects the traditional
ordering of Medefaidrin found in instruction materials.
Punctuation, Digits, and Other Marks. Medefaidrin uses a vigesimal (base-20) number
system that requires twenty digits. Script-specific punctuation marks are U+16E97 mede-
faidrin comma, U+16E98 medefaidrin full stop, and U+16E9A medefaidrin excla-
mation mark. Another unique mark is a symbol for the conjunction “or,” represented by
the Medefaidrin aiva, U+16E99 medefaidrin symbol aiva.
775
Chapter 20
Americas 20
The following scripts from the Americas are discussed in this chapter:
Cherokee Osage
Canadian Aboriginal Syllabics Deseret
The Cherokee script is a syllabary developed between 1815 and 1821, to write the Cherokee
language. The Cherokee script is still used by small communities in Oklahoma and North
Carolina.
Canadian Aboriginal Syllabics were invented in the 1830s for Algonquian languages in
Canada. The system has been extended many times, and is now actively used by other com-
munities, including speakers of Inuktitut and Athapascan languages.
The Osage script is an alphabet used to write the Osage language spoken by a Native Amer-
ican tribe in the United States. The script was written with a variety of ad-hoc orthogra-
phies and transcriptions for two centuries until the Osage Nation recently developed its
standard orthography in 2014.
Deseret is a phonemic alphabet devised in the 1850s to write English. It saw limited use for
a few decades by members of The Church of Jesus Christ of Latter-day Saints.
Americas 776 20.1 Cherokee
20.1 Cherokee
Cherokee: U+13A0–U+13FF
Cherokee Supplement: U+AB70–U+ABBF
The Cherokee script is used to write the Cherokee language. Cherokee is a member of the
Iroquoian language family. It is related to Cayuga, Seneca, Onondaga, Wyandot-Huron,
Tuscarora, Oneida, and Mohawk. The relationship is not close because roughly 3,000 years
ago the Cherokees migrated southeastward from the Great Lakes region of North America
to what is now North Carolina, Tennessee, and Georgia. Cherokee is the native tongue of
approximately 20,000 people, although most speakers today use it as a second language.
The Cherokee word for both the language and the people is QRS Tsalagi.
The Cherokee syllabary, as invented by Sequoyah between 1815 and 1821, contained 6
vowels and 17 consonants. Sequoyah avoided copying from other alphabets, but his origi-
nal letters were modified to make them easier to print. Samuel Worcester worked in con-
junction with Sequoyah, Chief Charles Hicks, and Charles Thompson (first cousin of
Sequoyah) in the design of the Cherokee type which was finalized in 1827. Using fonts
available to him, Worcester assigned a number of Latin letters to the Cherokee syllables. At
this time the Cherokee letter “MV” was dropped, and the Cherokee syllabary reached the
size of 85 letters. Worcester’s press printed 13,980,000 pages of Native American-language
text, most of it in Cherokee.
Structure. Cherokee is a left-to-right script. It has no Cherokee-specific combining charac-
ters.
Casing. Most existing Cherokee text is caseless. Traditionally, the forms of the syllable let-
ters were designed as caps height—and in fact, a number of the Cherokee syllables are visu-
ally indistinguishable from Latin uppercase letters. As a result, most Cherokee text has the
visual appearance of all caps. The characters used for representing such unicameral Cher-
okee text are the basic syllables in the Cherokee block: U+13A0 cherokee letter a, and
so forth.
In some old printed material, such as the Cherokee New Testament, case conventions
adapted from the Latin script were used. Sentence-initial letters and initial letters for per-
sonal and place names, for example, were typeset using a larger size font. Furthermore, sys-
tematic distinction in casing has become more prevalent in modern typeset materials, as
well.
Starting with Version 8.0, the Unicode Standard includes a set of lowercase Cherokee sylla-
bles to accommodate the need to represent casing distinctions in Cherokee text. The Cher-
okee script is now encoded as a fully bicameral script, with case mapping.
The lowercase syllable letters are mostly encoded in the Cherokee Supplement block. A few
are encoded at the end of the Cherokee block, after the basic Cherokee syllable letters,
which are now treated as the uppercase of the case pairs.
Americas 777 20.1 Cherokee
The usual way for a script originally encoded in the Unicode Standard as a unicameral
script to later gain casing is by adding a new set of uppercase letters for it. The Cherokee
script is an important exception because the previously encoded Cherokee unicameral set
is treated as the uppercase as of Version 8.0, and the new set of letters are the lowercase.
The reason for this exception has to do with Cherokee typography and the status of exist-
ing fonts. Because all existing fonts already treated Cherokee syllable letters as cap height,
attempting to extend them by changing the existing letters to less than cap height and add-
ing new uppercase letters to the fonts would have destabilized the layout of all existing
Cherokee text. On the other hand, innovating in the fonts by adding new lowercase forms
with a smaller size and less than cap height allows a graceful introduction of casing without
invalidating the layout of existing text.
This exceptional introduction of a lowercase set to change a unicameral encoding into a
bicameral encoding has important implications that implementers of the Cherokee script
need to keep in mind. First, in order to preserve case folding stability, Cherokee case folds
to the previously encoded uppercase letters, rather than to the newly encoded lowercase
letters. This exceptional case folding behavior impacts identifiers, and so can trip up imple-
mentations if they are not prepared for it. Second, representation of cased Cherokee text
requires using the new lowercase letters for most of the body text, instead of just changing
a few initial letters to uppercase. That means that representation of traditional text such as
the Cherokee New Testament requires substantial re-encoding of the text. Third, the fact
that uppercase Cherokee still represents the default and is most widely supported in fonts
means that input systems which are extended to support the new lowercase letters face
unusual design choices.
Tones. Each Cherokee syllable can be spoken on one of four pitch or tone levels, or can
slide from one pitch to one or two others within the same syllable. However, only in certain
words does the tone of a syllable change the meaning. Tones are unmarked.
Input. Several keyboarding conventions exist for inputting Cherokee. Some involve dead-
key input based on Latin transliterations; some are based on sound-mnemonics related to
Latin letters on keyboards; and some are ergonomic systems based on frequency of the syl-
lables in the Cherokee language
Numbers. Although Sequoyah invented a Cherokee number system, it was not adopted
and is not encoded in the Unicode Standard. The Cherokee Nation uses European num-
bers. Cherokee speakers pay careful attention to the use of ordinal and cardinal numbers.
When speaking of a numbered series, they will use ordinals. For example, when numbering
chapters in a book, Cherokee headings would use First Chapter, Second Chapter, and so
on, instead of Chapter One, Chapter Two, and so on.
Punctuation. Cherokee uses standard Latin punctuation.
Standards. There are no other encoding standards for Cherokee. Cherokee spelling is not
standardized: each person spells as the word sounds to him or her.
Americas 778 20.2 Canadian Aboriginal Syllabics
Ø ± ≥ ∏ …
PE PI PO PA P
Some variations in vowels also occur. For example, in Inuktitut usage, the syllable U+1450
– canadian syllabics to is transcribed into Latin letters as “TU” rather than “TO”, but
the structure of the syllabary is generally the same regardless of language.
Arrangement. The arrangement of signs follows the Algonquian ordering (down-pointing,
up-pointing, right-pointing, left-pointing), as in the previous example.
Sorted within each series are the variant forms for that series. Algonquian variants appear
first, then Inuktitut variants, then Athapascan variants. This arrangement is convenient and
consistent with the historical diffusion of Syllabics writing; it does not imply any hierarchy.
Some glyphs do not show the same down/up/right/left directions in the typical fashion—
for example, beginning with U+146B Î canadian syllabics ke. These glyphs are varia-
tions of the rule because of the shape of the basic glyph; they do not affect the convention.
Vowel length and labialization modify the character series through the addition of various
marks (for example, U+143E æ canadian syllabics pwii). Such modified characters are
considered unique syllables. They are not decomposed into base characters and one or
more diacritics. Some language families have different conventions for placement of the
modifying mark. For the sake of consistency and simplicity, and to support multiple North
American languages in the same document, each of these variants is assigned a unique
code point.
Americas 779 20.2 Canadian Aboriginal Syllabics
Extensions. A few additional syllables in the range U+166F..U+167F at the end of this
block have been added for Inuktitut, Woods Cree, and Blackfoot. Because these extensions
were encoded well after the main repertoire in the block, their arrangement in the code
charts is outside the framework for the rest of the characters in the block.
Punctuation and Symbols. Languages written using the Canadian Aboriginal Syllabics
make use of the common punctuation marks of Western typography. However, a few
punctuation marks are specific in form and are separately encoded as script-specific marks
for syllabics. These include: U+166E canadian syllabics full stop and U+1400 cana-
dian syllabics hyphen.
There is also a special symbol, U+166D canadian syllabics chi sign, used in religious
texts as a symbol to denote Christ.
20.3 Osage
Osage: U+104B0–U+104FF
The Osage script is used to write the Osage language. This language is spoken by a Native
American tribe of the Great Plains that originated in the Ohio River valley area of the pres-
ent-day United States. By the 17th century, the Osage people had migrated to their current
locations in Missouri, Kansas, Arkansas, Oklahoma, and Texas. The term “Osage” roughly
translates to “mid-waters.”
For two centuries, the Osage language was written with a variety of ad-hoc Latin orthogra-
phies and transcription systems. In 2004, the Osage Nation initiated a program to develop
a standard orthography to write the language. By 2006, a practical orthography had been
designed based on modifications or fusions of the shapes of Latin letters. Use of the Osage
orthography led to further improvements, culminating in the adoption of the current set of
letters in 2014.
Structure. Osage is a left-to-right alphabetic script. It has no Osage-specific combining
characters, but makes use of common diacritical marks.
Casing. Casing is used in the standard Osage orthography.
Vowels. Diacritical marks are used in Osage to distinguish length, nasalization, and
accents. The particular diacritical marks used to make these distinctions are shown in
Table 20-1.
Numbers and Punctuation. Osage uses European numbers and standard Latin punctua-
tion.
Americas 781 20.4 Deseret
20.4 Deseret
Deseret: U+10400–U+1044F
Deseret is a phonemic alphabet devised to write the English language. It was originally
developed in the 1850s by the regents of the University of Deseret, now the University of
Utah. It was promoted by The Church of Jesus Christ of Latter-day Saints, also known as
the “Mormon” or LDS Church, under Church President Brigham Young (1801–1877). The
name Deseret is taken from a word in the Book of Mormon defined to mean “honeybee”
and reflects the LDS use of the beehive as a symbol of cooperative industry. Most literature
about the script treats the term Deseret Alphabet as a proper noun and capitalizes it as such.
Among the designers of the Deseret Alphabet was George D. Watt, who had been trained in
shorthand and served as Brigham Young’s secretary. It is possible that, under Watt’s influ-
ence, Sir Isaac Pitman’s 1847 English Phonotypic Alphabet was used as the model for the
Deseret Alphabet.
The Deseret Alphabet was a work in progress through most of the 1850s, with the set of let-
ters and their shapes changing from time to time. The final version was used for the printed
material of the late 1860s, but earlier versions are found in handwritten manuscripts.
The Church commissioned two typefaces and published four books using the Deseret
Alphabet. The Church-owned Deseret News also published passages of scripture using the
alphabet on occasion. In addition, some historical records, diaries, and other materials
were handwritten using this script, and it had limited use on coins and signs. There is also
one tombstone in Cedar City, Utah, written in the Deseret Alphabet. However, the script
failed to gain wide acceptance and was not actively promoted after 1869. Today, the
Deseret Alphabet remains of interest primarily to historians and hobbyists.
Letter Names and Shapes. Pedagogical materials produced by the LDS Church gave
names to all of the non-vowel letters and indicated the vowel sounds with English exam-
ples. In the Unicode Standard, the spelling of the non-vowel letter names has been modi-
fied to clarify their pronunciations, and the vowels have been given names that emphasize
the parallel structure of the two vowel runs.
The glyphs used in the Unicode Standard are derived from the second typeface commis-
sioned by the LDS Church and represent the shapes most commonly encountered. Alter-
nate glyphs are found in the first typeface and in some instructional material.
Structure. The final version of the script consists of 38 letters, long i through eng. Two
additional letters, oi and ew, found only in handwritten materials, are encoded after the
first 38. The alphabet is bicameral; capital and small letters differ only in size and not in
shape. The order of the letters is phonetic: letters for similar classes of sound are grouped
together. In particular, most consonants come in unvoiced/voiced pairs. Forty-letter ver-
sions of the alphabet inserted oi after ay and ew after ow.
Sorting. The order of the letters in the Unicode Standard is the one used in all but one of
the nineteenth-century descriptions of the alphabet. The exception is one in which the let-
Americas 782 20.4 Deseret
ters wu and yee are inverted. The order yee-wu follows the order of the “coalescents” in
Pitman’s work; the order wu-yee appears in a greater number of Deseret materials, how-
ever, and has been followed here.
Alphabetized material followed the standard order of the Deseret Alphabet in the code
charts, except that the short and long vowel pairs are grouped together, in the order long
vowel first, and then short vowel.
Typographic Conventions. The Deseret Alphabet is written from left to right. Punctuation,
capitalization, and digits are the same as in English. All words are written phonemically
with the exception of short words that have pronunciations equivalent to letter names, as
shown in Figure 20-1.
Chapter 21
Notational Systems 21
This chapter discusses various notational systems:
Braille consists of a related set of notational systems, using raised dots embossed on paper
or other mediums to provide a tactile writing system for the blind. The patterns of dots are
associated with the letters or syllables of other writing systems, but the particular rules of
association vary from language to language. The Unicode Standard encodes a complete set
of symbols for the shapes of Braille patterns; however the association of letters to the pat-
terns is left to other standards. Text should normally be represented using the regular Uni-
code characters of the script. Only when the intent is to convey a particular binding of text
to a Braille pattern sequence should it be represented using the symbols for the Braille pat-
terns.
Musical notation—particularly Western musical notation—is different from ordinary text
in the way it is laid out, especially the representation of pitch and duration in Western
musical notation. However, ordinary text commonly refers to the basic graphical elements
that are used in musical notation, so such symbols are encoded in the Unicode Standard.
Additional sets of symbols are encoded to support historical systems of musical notation.
Duployan is an uncased, alphabetic stenographic writing system invented by Emile
Duployé, and published in 1860. It was one of the two most commonly used French short-
hands. The Duployan shorthands are used as a secondary shorthand for writing French,
English, German, Spanish, and Romanian. An adaptation and augmentation of Duployan
was used as an alternate primary script for several First Nations’ languages in interior Brit-
ish Columbia, Canada.
Sutton SignWriting is a notational system developed in 1974 by Valerie Sutton and used for
the transcription of many sign languages. It is a featural writing system, in which visually
iconic basic symbols are arranged in two-dimensional layout to form snapshots of the indi-
vidual signs of a sign language, which are roughly equivalent to words. The Unicode Stan-
dard encodes the basic symbols as atomic characters or combining character sequences.
Notational Systems 786 21.1 Braille
21.1 Braille
Braille Patterns: U+2800–U+28FF
Braille is a writing system used by blind people worldwide. It uses a system of six or eight
raised dots, arranged in two vertical rows of three or four dots, respectively. Eight-dot sys-
tems build on six-dot systems by adding two extra dots above or below the core matrix. Six-
dot Braille allows 64 possible combinations, and eight-dot Braille allows 256 possible pat-
terns of dot combinations. There is no fixed correspondence between a dot pattern and a
character or symbol of any given script. Dot pattern assignments are dependent on context
and user community. A single pattern can represent an abbreviation or a frequently occur-
ring short word. For a number of contexts and user communities, the series of ISO techni-
cal reports starting with ISO/TR 11548-1 provide standardized correspondence tables as
well as invocation sequences to indicate a context switch.
The Unicode Standard encodes a single complete set of 256 eight-dot patterns. This set
includes the 64 dot patterns needed for six-dot Braille.
The character names for Braille patterns are based on the assignments of the dots of the
Braille pattern to digits 1 to 8 as follows:
1 4
2 5
3 6
7 8
The designation of dots 1 to 6 corresponds to that of six-dot Braille. The additional dots 7
and 8 are added beneath. The character name for a Braille pattern consists of braille pat-
tern dots-12345678, where only those digits corresponding to dots in the pattern are
included. The name for the empty pattern is braille pattern blank.
The 256 Braille patterns are arranged in the same sequence as in ISO/TR 11548-1, which is
based on an octal number generated from the pattern arrangement. Octal numbers are
associated with each dot of a Braille pattern in the following way:
1 10
2 20
4 40
100 200
The octal number is obtained by adding the values corresponding to the dots present in the
pattern. Octal numbers smaller than 100 are expanded to three digits by inserting leading
zeroes. For example, the dots of braille pattern dots-1247 are assigned to the octal val-
ues of 18, 28, 108, and 1008. The octal number representing the sum of these values is 1138.
The assignment of meanings to Braille patterns is outside the scope of this standard.
Notational Systems 787 21.1 Braille
Example. According to ISO/TR 11548-2, the character latin capital letter f can be rep-
resented in eight-dot Braille by the combination of the dots 1, 2, 4, and 7 (braille pattern
dots-1247). A full circle corresponds to a tangible (set) dot, and empty circles serve as posi-
tion indicators for dots not set within the dot matrix:
1 4
2
° 5
3
°° 6
7 ° 8
Usage Model. The eight-dot Braille patterns in the Unicode Standard are intended to be
used with either style of eight-dot Braille system, whether the additional two dots are con-
sidered to be in the top row or in the bottom row. These two systems are never intermixed
in the same context, so their distinction is a matter of convention. The intent of encoding
the 256 Braille patterns in the Unicode Standard is to allow input and output devices to be
implemented that can interchange Braille data without having to go through a context-
dependent conversion from semantic values to patterns, or vice versa. In this manner,
final-form documents can be exchanged and faithfully rendered. At the same time, process-
ing of textual data that require semantic support is intended to take place using the regular
character assignments in the Unicode Standard.
Imaging. When output on a Braille device, dots shown as black are intended to be ren-
dered as tangible. Dots shown in the standard as open circles are blank (not rendered as
tangible). The Unicode Standard does not specify any physical dimension of Braille char-
acters.
In the absence of a higher-level protocol, Braille patterns are output from left to right.
When used to render final form (tangible) documents, Braille patterns are normally not
intermixed with any other Unicode characters except control codes.
Script. Unlike other sets of symbols, the Braille Patterns are given their own, unique value
of the Script property in the Unicode Standard. This follows both from the behavior of
Braille in forming a consistent writing system on its own terms, as well as from the inde-
pendent bibliographic status of books and other documents printed in Braille. For more
information on the Script property, see Unicode Standard Annex #24, “Unicode Script
Property.”
Notational Systems 788 21.2 Western Musical Symbols
texts—particularly for applications in early music—note heads, stems, flags, and other
associated symbols may need to be rendered in different colors—for example, red.
Symbols in Other Blocks. U+266D music flat sign, U+266E music natural sign, and
U+266F music sharp sign—three characters that occur frequently in musical notation—
are encoded in the Miscellaneous Symbols block (U+2600..U+267F). However, four char-
acters also encoded in that block are to be interpreted merely as dingbats or miscellaneous
symbols, not as representing actual musical notes:
U+2669 quarter note
U+266A eighth note
U+266B beamed eighth notes
U+266C beamed sixteenth notes
Processing. Most musical symbols can be thought of as simple spacing characters when
used inline within texts and examples, even though they behave in a more complex manner
in full musical layout. Some characters are meant only to be combined with others to pro-
duce combined character sequences, representing musical notes and their particular artic-
ulations. Musical symbols can be input, processed, and displayed in a manner similar to
mathematical symbols. When embedded in text, most of the symbols are simple spacing
characters with no special properties. A few characters have format control functions, as
described later in this section.
Input Methods. Musical symbols can be entered via standard alphanumeric keyboard, via
piano keyboard or other device, or by a graphical method. Keyboard input of the musical
symbols may make use of techniques similar to those used for Chinese, Japanese, and
Korean. In addition, input methods utilizing pointing devices or piano keyboards could be
developed similar to those in existing musical layout systems. For example, within a graph-
ical user interface, the user could choose symbols from a palette-style menu.
Directionality. When combined with right-to-left texts—in Hebrew or Arabic, for exam-
ple—the musical notation is usually written from left to right in the normal manner. The
words are divided into syllables and placed under or above the notes in the same fashion as
for Latin and other left-to-right scripts. The individual words or syllables corresponding to
each note, however, are written in the dominant direction of the script.
The opposite approach is also known: in some traditions, the musical notation is actually
written from right to left. In that case, some of the symbols, such as clef signs, are mirrored;
other symbols, such as notes, flags, and accidentals, are not mirrored. All responsibility for
such details of bidirectional layout lies with higher-level protocols and is not reflected in
any character properties. Figure 21-1 exemplifies this principle with two musical passages.
The first example shows Turkish lyrics in Arabic script with ordinary left-to-right musical
notation; the second shows right-to-left musical notation. Note the partial mirroring.
Format Characters. Extensive ligature-like beams are used frequently in musical notation
between groups of notes having short values. The practice is widespread and very predict-
able, so it is therefore amenable to algorithmic handling. The format characters U+1D173
Notational Systems 790 21.2 Western Musical Symbols
∑ÊÛ≠ ≠éÁ éó ÍÁ éó ÍÁ
..
‰‚
„‰
musical symbol begin beam and U+1D174 musical symbol end beam can be used to
indicate the extents of beam groupings. In some exceptional cases, beams are left unclosed
on one end. This status can be indicated with a U+1D159 musical symbol null note-
head character if no stem is to appear at the end of the beam.
Similarly, format characters have been provided for other connecting structures. The char-
acters U+1D175 musical symbol begin tie, U+1D176 musical symbol end tie,
U+1D177 musical symbol begin slur, U+1D178 musical symbol end slur, U+1D179
musical symbol begin phrase, and U+1D17A musical symbol end phrase indicate the
extent of these features. Like beaming, these features are easily handled in an algorithmic
fashion.
These pairs of characters modify the layout and grouping of notes and phrases in full musi-
cal notation. When musical examples are written or rendered in plain text without special
software, the start/end format characters may be rendered as brackets or left uninterpreted.
To the extent possible, more sophisticated software that renders musical examples inline
with natural-language text might interpret them in their actual format control capacity,
rendering slurs, beams, and so forth, as appropriate.
Precomposed Note Characters. For maximum flexibility, the character set includes both
precomposed note values and primitives from which complete notes may be constructed.
The precomposed versions are provided mainly for convenience. However, if any normal-
ization form is applied, including NFC, the characters will be decomposed. For further
information, see Section 3.11, Normalization Forms. The canonical equivalents for these
characters are given in the Unicode Character Database and are illustrated in Figure 21-2.
Notational Systems 791 21.2 Western Musical Symbols
fi
1D15E
=◊+
1D157
Â
1D165
‚=ÿ+Â+
1D162 1D158 1D165 1D170
fl
1D15F
=ÿ+
1D158
Â
1D165
„=ÿ+Â+Ò
1D163 1D158 1D165 1D171
‡=ÿ+Â+Ó
1D160 1D158 1D165 1D16E
‰=ÿ+Â+Ú
1D164 1D158 1D165 1D172
·=ÿ+Â+Ô
1D161 1D158 1D165 1D16F
Alternative Noteheads. More complex notes built up from alternative noteheads, stems,
flags, and articulation symbols are necessary for complete implementations and complex
scores. Examples of their use include American shape-note and modern percussion nota-
tions, as shown in the first line of Figure 21-3.
« = « +  Â√ = √ + Â
1D147 1D165 1D143 1D165
flfl = + 𝅘𝅥𝅮 + 𝅘𝅥𝅮+ 𝅙 + 𝅮 +
1D173 1D160 1D160 1D159 1D16E 1D174
U+1D159 musical symbol null notehead is a special notehead that has no distinct
visual appearance of its own. It can be used as an anchor for a combining flag in compli-
cated musical scoring. For example, in a beamed sequence of notes, the beam might be
extended beyond visible notes, as shown in the second line of Figure 21-3. Even though the
null notehead has no visual appearance of its own, it is not a default ignorable code point;
some indication of its presence, as for instance a dotted box glyph, should be shown if dis-
played outside of a context that supports full musical rendering.
Augmentation Dots and Articulation Symbols. Augmentation dots and articulation sym-
bols may be appended to either the precomposed or built-up notes. In addition, augmenta-
tion dots and articulation symbols may be repeated as necessary to build a complete note
symbol. Examples of the use of augmentation dots and articulation symbols are shown in
Figure 21-4.
Notational Systems 792 21.2 Western Musical Symbols
‡à = ÿ +  + Ó + Ìà
1D158 1D165 1D16E 1D16D
flà = ÿ +  + ¸à
1D158 1D165 1D17C
˚‡àà = ÿ +  + Ó + ˚ + Ìà + Ìà
1D158 1D165 1D16E 1D17B 1D16D 1D16D
Gregorian. The punctum, or Gregorian brevis, a square shape, is unified with U+1D147
musical symbol square notehead black. The Gregorian semibrevis, a diamond or loz-
enge shape, is unified with U+1D1BA musical symbol semibrevis black. Thus Grego-
rian notation, medieval notation, and modern notation either require separate fonts in
practice or need font features to make subtle differentiations between shapes where
required.
Kievan. Kievan musical notation is a form of linear musical notation found in religious
chant books of the Russian Orthodox Church, among others. It is also referred to as East
Slavic musical notation. The notation originated in the 1500s, and the first books using
Kievan notation were published in 1772. The notation is still used today.
Unlike Western plainchant, Kievan is written on a five-line staff (encoded at U+1D11A)
with uniquely shaped notes, and several distinct symbols, including its own C clef and flat
signs. U+1D1DF musical symbol kievan end of piece is analogous to the Western
U+1D102 musical symbol final barline.
Beaming is used in Kievan notation occasionally, and the existing musical format charac-
ters encoded between U+1D173 and U+1D17A may be used in implementations of beam-
ing in higher-level protocols.
Notational Systems 794 21.3 Byzantine Musical Symbols
Naming Conventions. The character names are based on the standard names widely used
by modern scholars. There is no standardized ancient system for naming these characters.
Apparent gaps in the numbering sequence are due to the unification with standard letters
and between vocal and instrumental notations.
If a symbol is used in both the vocal notation system and the instrumental notation system,
its Unicode character name is based on the vocal notation system catalog number. Thus
U+1D20D greek vocal notation symbol-14 has a glyph based on an inverted capital
lambda. In the vocal notation system, it represents the first sharp of B; in the instrumental
notation system, it represents the first sharp of d’. Because it is used in both systems, its
name is based on its sequence in the vocal notation system, rather than its sequence in the
instrumental notation system. The character names list in the Unicode Character Database
is fully annotated with the functions of the symbols for each system.
Font. Scholars usually typeset musical characters in sans-serif fonts to distinguish them
from standard letters, which are usually represented with a serifed font. However, this is
not required. The code charts use a font without serifs for reasons of clarity.
Combining Marks. The combining marks encoded in the range U+1D242..U+1D244 are
placed over the vocal or instrumental notation symbols. They are used to indicate metrical
qualities.
Notational Systems 798 21.5 Duployan
21.5 Duployan
Duployan: U+1BC00–U+1BC9F
The Duployan shorthands are used to write French, English, German, Spanish, and Roma-
nian. The original Duployan shorthand was invented by Emile Duployé, and published in
1860 as a stenographic shorthand for French. It was one of the two most commonly used
French shorthands. There are three main English adaptations from the late 19th and early
20th centuries based on Duployan: Pernin, Sloan, and Perrault. None were as popular as
the Gregg and Pitman shorthands.
An adaptation and augmentation of Duployan by Father Jean Marie Raphael LeJeune was
used as an alternate primary script for several First Nations’ languages in interior British
Columbia, including Chinook Jargon, Okanagan, Lilooet, Shushwap, and North Thomp-
son. Its original use and greatest surviving attestation is from the Kamloops Wawa, a Chi-
nook Jargon newsletter of the Catholic diocese of Kamloops, British Columbia, published
1891–1923. Chinook Jargon was a trade language widely spoken from southeast Alaska to
northern California, from the Pacific to the Rockies, and sporadically outside this area. The
Chinook script uses the basic Duployan inventory, with the addition of several derived let-
terforms and compound letters.
Structure. Duployan is an uncased, alphabetic stenographic writing system. The model let-
terforms are generally based on circles and lines. It is a left-to-right script.
The basic inventory of consonant and vowel signs has been augmented over the years to
provide more efficient shorthands and has been adapted to the phonologies of languages
other than the original French. The Romanian Pernin, Perrault, and Sloan stenographic
orthographies add a few letters or letterforms, ideographs, and several combined letters.
The core repertoire of Duployan contains several classes of letters, differentiated primarily
by visual form and stroke direction, and nominally by phonetic value. Letter classes
include the line consonants (P, T, F, K, and L-type), arc consonants (M, N, J, and S-type),
circle vowels (A and O vowels), nasal vowels, and orienting vowels (U/EU, I/E). In addi-
tion, the Chinook writing contains spacing letters, compound consonants, and a logo-
graph. The extended Duployan shorthand includes four other letter classes—the complex
letters (multisyllabic symbols with consonant forms), and high, low, and connecting termi-
nals for common word endings.
The extended Duployan shorthand includes four other letter classes—the complex letters
(multisyllabic symbols with consonant forms), and high, low, and connecting terminals for
common word endings. The repertoire also includes U+1BC9D duployan thick letter
selector, which modifies a preceding Duployan character by causing it to be rendered
bold.
Notational Systems 799 21.5 Duployan
<U+1D9FF signwriting head, U+1DA16 signwriting eyes closed, fill>, the fill modi-
fier selects between one eye and both eyes closed.
The rotation modifiers turn a base character by 45 degree increments. In combination with
handshape characters, the rotation modifiers also distinguish between the right and left
hand characters. U+1DAA4 signwriting rotation modifier-5 turns a base character by
180 degrees. For a handshape that distinguishes between right and left hand shapes,
U+1DAAC signwriting rotation modifier-13 turns the left hand shape 180 degrees.
Punctuation. Sutton SignWriting uses five script-specific punctuation marks. These
include U+1DA8B signwriting parenthesis, which represents an opening parenthesis.
A closing parenthesis is represented with the sequence <U+1DA8B signwriting paren-
thesis, U+1DAA4 signwriting rotation modifier-5>.
Notational Systems 802 21.6 Sutton SignWriting
803
Chapter 22
Symbols 22
The universe of symbols is rich and open-ended. The collection of encoded symbols in the
Unicode Standard encompasses the following:
Pictorial or graphic items for which there is no demonstrated need or strong desire to
exchange in plain text are not encoded in the standard.
Combining marks may be used with symbols, particularly the set encoded at U+20D0..
U+20FF (see Section 7.9, Combining Marks).
Letterlike and currency symbols, as well as numerals, superscripts, and subscripts, are typ-
ically subject to the same font and style changes as the surrounding text. Where square and
enclosed symbols occur in East Asian contexts, they generally follow the prevailing type
styles.
Other symbols have an appearance that is independent of type style, or a more limited or
altogether different range of type style variation than the regular text surrounding them.
For example, mathematical alphanumeric symbols are typically used for mathematical
variables; those letterlike symbols that are part of this set carry semantic information in
their type style. This fact restricts—but does not completely eliminate—possible style vari-
ations. However, symbols such as mathematical operators can be used with any script or
independent of any script.
Special invisible operator characters can be used to explicitly encode some mathematical
operations, such as multiplication, which are normally implied by juxtaposition. This aids
in automatic interpretation of mathematical notation.
In a bidirectional context (see Unicode Standard Annex #9, “Unicode Bidirectional Algo-
rithm”), most symbol characters have no inherent directionality but resolve their direc-
tionality for display according to the Unicode Bidirectional Algorithm. For some symbols,
such as brackets and mathematical operators whose image is not bilaterally symmetric, the
Symbols 804
mirror image is used when the character is part of the right-to-left text stream (see
Section 4.7, Bidi Mirrored).
Dingbats and optical character recognition characters are different from all other charac-
ters in the standard, in that they are encoded based primarily on their precise appearance.
Many symbols encoded in the Unicode Standard are intended to support legacy imple-
mentations and obsolescent practices, such as terminal emulation or other character mode
user interfaces. Examples include box drawing components and control pictures.
A number of symbols are also encoded for compatibility with the emoji (“picture charac-
ter,” or pictograph) sets encoded by several Japanese cell phone carriers as extensions of
the JIS X 0208 character set. Those symbols are interchanged as plain text, and are encoded
in the Unicode Standard to support interoperability. Other symbols—many of which are
also pictographic—are encoded for compatibility with Webdings and Wingdings sets, or
various e-mail systems, and to address other interchange requirements.
Many of the symbols encoded in Unicode can be used as operators or given some other
syntactical function in a formal language syntax. For more information, see Unicode Stan-
dard Annex #31, “Unicode Identifier and Pattern Syntax.”
Symbols 805 22.1 Currency Symbols
$$
Claims that glyph variants of a certain currency symbol are used consistently to indicate a
particular currency could not be substantiated upon further research. Therefore, the Uni-
code Standard considers these variants to be typographical and provides a single encoding
for them. See ISO/IEC 10367, Annex B (informative), for an example of multiple render-
ings for U+00A3 pound sign.
Fonts. Currency symbols are commonly designed to display at the same width as a digit
(most often a European digit, U+0030..U+0039) to assist in alignment of monetary values
in tabular displays. Like letters, they tend to follow the stylistic design features of particular
fonts because they are used often and need to harmonize with body text. In particular, even
though there may be more or less normative designs for the currency sign per se, as for the
euro sign, type designers freely adapt such designs to make them fit the logic of the rest of
their fonts. This partly explains why currency signs show more glyph variation than other
types of symbols.
Lira Sign. A separate currency sign U+20A4 lira sign is encoded for compatibility with
the HP Roman-8 character set, which is still widely implemented in printers. In general,
U+00A3 pound sign may be used for both the various currencies known as pound (or
punt) and the currencies known as lira. Examples include the British pound sterling, the
historic Irish punt, and the former lira currency of Italy. Until 2012, the lira sign was also
used for the Turkish lira, but for current Turkish usage, see U+20BA turkish lira sign.
As in the case of the dollar sign, the glyphic distinction between single- and double-bar ver-
sions of the sign is not indicative of a systematic difference in the currency.
Dollar and Peso. The dollar sign (U+0024) is used for many currencies in Latin America
and elsewhere. In particular, this use includes current and discontinued Latin American
peso currencies, such as the Mexican, Chilean, Colombian and Dominican pesos. How-
ever, the Philippine peso uses a different symbol found at U+20B1.
Yen and Yuan. Like the dollar sign and the pound sign, U+00A5 yen sign has been used as
the currency sign for more than one currency. The double-crossbar glyph is the official
form for both the yen currency of Japan (JPY ) and for the yuan (renminbi) currency of
China (CNY ). This is the case, despite the fact that some glyph standards historically spec-
ified a single-crossbar form, notably the OCR-A standard ISO 1073-1:1976, which influ-
enced the representative glyph in various character set standards from China. In the
Unicode Standard, U+00A5 yen sign is intended to be the character for the currency sign
for both the yen and the yuan, independent of the details of glyphic presentation.
Symbols 807 22.1 Currency Symbols
As listed in Table 22-1, there are also a number of CJK ideographs to represent the words
yen (or en) and yuan, as well as the Korean word won, and these also tend to overlap in use
as currency symbols.
Euro Sign. The single currency for member countries of the European Economic and
Monetary Union is the euro (EUR). The euro character is encoded in the Unicode Stan-
dard as U+20AC euro sign.
Indian Rupee Sign. U+20B9 0 indian rupee sign is the character encoded to represent
the Indian rupee currency symbol introduced by the Government of India in 2010 as the
official currency symbol for the Indian rupee (INR). It is distinguished from U+20A8
rupee sign, which is an older symbol not formally tied to any particular currency. There
are also a number of script-specific rupee symbols encoded for historic usage by various
scripts of India. See Table 22-1 for a listing.
Rupee is also the common name for a number of currencies for other countries of South
Asia and of Indonesia, as well as several historic currencies. It is often abbreviated using
Latin letters, or may be spelled out or abbreviated in the Arabic script, depending on local
conventions.
Turkish Lira Sign. The Turkish lira sign, encoded as U+20BA A turkish lira sign, is a
symbol representing the lira currency of Turkey. Prior to the introduction of the new sym-
bol in 2012, the currency was typically abbreviated with the letters “TL”. The new symbol
was selected by the Central Bank of Turkey from entries in a public contest and is quickly
gaining common use, but the old abbreviation is also still in use.
Ruble Sign. The ruble sign, encoded as U+20BD / ruble sign, was adopted as the official
symbol for the currency of Russian Federation in 2013. Ruble is also used as the name of
various currencies in Eastern Europe. In English, both spellings “ruble” and “rouble” are
used.
Lari Sign. The lari sign, encoded as U+20BE 1 lari sign, was adopted as the official sym-
bol for the currency of Georgia in 2014. The name lari is an old Georgian word denoting a
hoard or property. The image for the lari sign is based on the letter U+10DA 2 georgian
letter las. The lari currency was established on October 2, 1995.
Bitcoin Sign. U+20BF bitcoin sign represents the bitcoin, a cryptocurrency and payment
system invented by programmers. A cryptocurrency such as the bitcoin works as a medium
of exchange that uses cryptography to secure transactions and to control the creation of
additional units of currency. It is categorized as a decentralized virtual or digital currency.
Other Currency Symbols. Additional forms of currency symbols are found in the Small
Form Variants (U+FE50..U+FE6F) and the Halfwidth and Fullwidth Forms
(U+FF00..U+FFEF) blocks. Those symbols have the General_Category property value
Currency_Symbol (gc = Sc).
Ancient Greek and Roman monetary symbols, for such coins and values as the Greek obol
or the Roman denarius and as, are encoded in the Ancient Greek Numbers
(U+10140..U+1018F) and Ancient Symbols (U+10190..U+101CF) blocks. Those symbols
Symbols 808 22.1 Currency Symbols
denote values of weights and currencies, but are not used as regular currency symbols. As
such, their General_Category property value is Other_Symbol (gc = So).
Symbols 809 22.2 Letterlike Symbols
Unit Symbols. Several letterlike symbols are used to indicate units. In most cases, however,
such as for SI units (Système International), the use of regular letters or other symbols is
preferred. U+2113 script small l is commonly used as a non-SI symbol for the liter. Offi-
cial SI usage prefers the regular lowercase letter l.
Three letterlike symbols have been given canonical equivalence to regular letters: U+2126
ohm sign, U+212A kelvin sign, and U+212B angstrom sign. In all three instances, the
regular letter should be used. If text is normalized according to Unicode Standard Annex
#15, “Unicode Normalization Forms,” these three characters will be replaced by their reg-
ular equivalents.
In normal use, it is better to represent degrees Celsius “°C” with a sequence of U+00B0
degree sign + U+0043 latin capital letter c, rather than U+2103 degree celsius. For
searching, treat these two sequences as identical. Similarly, the sequence U+00B0 degree
sign + U+0046 latin capital letter f is preferred over U+2109 degree fahrenheit,
and those two sequences should be treated as identical for searching.
Compatibility. Some symbols are composites of several letters. Many of these composite
symbols are encoded for compatibility with Asian and other legacy encodings. (See also
“CJK Compatibility Ideographs” in Section 18.1, Han.) The use of these composite symbols
Symbols 810 22.2 Letterlike Symbols
is discouraged where their presence is not required by compatibility. For example, in nor-
mal use, the symbols U+2121 TEL telephone sign and U+213B FAX facsimile sign are
simply spelled out.
In the context of East Asian typography, many letterlike symbols, and in particular com-
posites, form part of a collection of compatibility symbols, the larger part of which is
located in the CJK Compatibility block (see Section 22.10, Enclosed and Square). When
used in this way, these symbols are rendered as “wide” characters occupying a full cell.
They remain upright in vertical layout, contrary to the rotated rendering of their regular
letter equivalents. See Unicode Standard Annex #11, “East Asian Width,” for more infor-
mation.
Where the letterlike symbols have alphabetic equivalents, they collate in alphabetic
sequence; otherwise, they should be treated as symbols. The letterlike symbols may have
different directional properties than normal letters. For example, the four transfinite cardi-
nal symbols (U+2135..U+2138) are used in ordinary mathematical text and do not share
the strong right-to-left directionality of the Hebrew letters from which they are derived.
Styles. The letterlike symbols include some of the few instances in which the Unicode Stan-
dard encodes stylistic variants of letters as distinct characters. For example, there are
instances of blackletter (Fraktur), double-struck, italic, and script styles for certain Latin
letters used as mathematical symbols. The choice of these stylistic variants for encoding
reflects their common use as distinct symbols. They form part of the larger set of mathe-
matical alphanumeric symbols. For the complete set and more information on its use, see
“Mathematical Alphanumeric Symbols” in this section. These symbols should not be used
in ordinary, nonscientific texts.
Despite its name, U+2118 script capital p is neither script nor capital—it is uniquely the
Weierstrass elliptic function symbol derived from a calligraphic lowercase p. U+2113
script small l is derived from a special italic form of the lowercase letter l and, when it
occurs in mathematical notation, is known as the symbol ell. Use U+1D4C1 mathemati-
cal script small l as the lowercase script l for mathematical notation.
Standards. The Unicode Standard encodes letterlike symbols from many different
national standards and corporate collections.
tion as in ordinary text. Markup not only provides the necessary scoping in these cases, but
also allows the use of a more extended alphabet.
Mathematical Alphabets
Basic Set of Alphanumeric Characters. Mathematical notation uses a basic set of mathe-
matical alphanumeric characters, which consists of the following:
• The set of basic Latin digits (0–9) (U+0030..U+0039)
• The set of basic uppercase and lowercase Latin letters (a– z, A–Z)
• The uppercase Greek letters ë–© (U+0391..U+03A9), plus the nabla á
(U+2207) and the variant of theta p given by U+03F4
• The lowercase Greek letters ±–… (U+03B1..U+03C9), plus the partial differen-
tial sign Ç (U+2202), and the six glyph variants q, r, s, t, u, and v, given by
U+03F5, U+03D1, U+03F0, U+03D5, U+03F1, and U+03D6, respectively
Only unaccented forms of the letters are used for mathematical notation, because general
accents such as the acute accent would interfere with common mathematical diacritics.
Examples of common mathematical diacritics that can interfere with general accents are
the circumflex, macron, or the single or double dot above, the latter two of which are used
in physics to denote derivatives with respect to the time variable. Mathematical symbols
with diacritics are always represented by combining character sequences.
For some characters in the basic set of Greek characters, two variants of the same character
are included. This is because they can appear in the same mathematical document with dif-
ferent meanings, even though they would have the same meaning in Greek text. (See “Vari-
ant Letterforms” in Section 7.2, Greek.)
Additional Characters. In addition to this basic set, mathematical notation uses the upper-
case and lowercase digamma, in regular (U+03DC and U+03DD) and bold (U+1D7CA
and U+1D7CB), and the four Hebrew-derived characters (U+2135..U+2138). Occasional
uses of other alphabetic and numeric characters are known. Examples include U+0428
cyrillic capital letter sha, U+306E hiragana letter no, and Eastern Arabic-Indic
digits (U+06F0..U+06F9). However, these characters are used only in their basic forms,
rather than in multiple mathematical styles.
Dotless Characters. In the Unicode Standard, the characters “i” and “j”, including their
variations in the mathematical alphabets, have the Soft_Dotted property. Any conformant
renderer will remove the dot when the character is followed by a nonspacing combining
mark above. Therefore, using an individual mathematical italic i or j with math accents
would result in the intended display. However, in mathematical equations an entire sub-
expression can be placed underneath a math accent—for example, when a “wide hat” is
placed on top of i+j, as shown in Figure 22-3.
In such a situation, a renderer can no longer rely simply on the presence of an adjacent
combining character to substitute for the un-dotted glyph, and whether the dots should be
Symbols 812 22.2 Letterlike Symbols
ˆ
i+j = iˆ + jˆ
removed in such a situation is no longer predictable. Authors differ in whether they expect
the dotted or dotless forms in that case.
In some documents mathematical italic dotless i or j is used explicitly without any combin-
ing marks, or even in contrast to the dotted versions. Therefore, the Unicode Standard pro-
vides the explicitly dotless characters U+1D6A4 mathematical italic small dotless i
and U+1D6A5 mathematical italic small dotless j. These two characters map to the
ISOAMSO entities imath and jmath or the TEX macros \imath and \ jmath. These entities
are, by default, always italic. The appearance of these two characters in the code charts is
similar to the shapes of the entities documented in the ISO 9573-13 entity sets and used by
TEX. The mathematical dotless characters do not have case mappings.
Semantic Distinctions. Mathematical notation requires a number of Latin and Greek
alphabets that initially appear to be mere font variations of one another. The letter H can
appear as plain or upright (H), bold (H), italic (H), as well as script, Fraktur, and other
styles. However, in any given document, these characters have distinct, and usually unre-
lated, mathematical semantics. For example, a normal H represents a different variable
from a bold H, and so on. If these attributes are dropped in plain text, the distinctions are
lost and the meaning of the text is altered. Without the distinctions, the well-known Ham-
iltonian formula turns into the integral equation in the variable H as shown in Figure 22-4.
Mathematicians will object that a properly formatted integral equation requires all the let-
ters in this example (except for the “d”) to be in italics. However, because the distinction
between s and H has been lost, they would recognize it as a fallback representation of an
integral equation, and not as a fallback representation of the Hamiltonian. By encoding a
separate set of alphabets, it is possible to preserve such distinctions in plain text.
Mathematical Alphabets. The alphanumeric symbols are listed in Table 22-2.
The math styles in Table 22-2 represent those encountered in mathematical use. The plain
letters have been unified with the existing characters in the Basic Latin and Greek blocks.
There are 24 double-struck, italic, Fraktur, and script characters that already exist in the
Letterlike Symbols block (U+2100..U+214F). These are explicitly unified with the charac-
ters in this block, and corresponding holes have been left in the mathematical alphabets.
Symbols 813 22.2 Letterlike Symbols
The alphabets in this block encode only semantic distinction, but not which specific font
will be used to supply the actual plain, script, Fraktur, double-struck, sans-serif, or mono-
space glyphs. Especially the script and double-struck styles can show considerable varia-
tion across fonts. Characters from the Mathematical Alphanumeric Symbols block are not
to be used for nonmathematical styled text.
Compatibility Decompositions. All mathematical alphanumeric symbols have compatibil-
ity decompositions to the base Latin and Greek letters. This does not imply that the use of
these characters is discouraged for mathematical use. Folding away such distinctions by
applying the compatibility mappings is usually not desirable, as it loses the semantic dis-
tinctions for which these characters were encoded. See Unicode Standard Annex #15,
“Unicode Normalization Forms.”
the lowercase italic letter z because it clashes with subscripts. In common text fonts, the
italic letter v and Greek letter nu are not very distinct. A rounded italic letter v is therefore
preferred in a mathematical font. There are other characters that sometimes have similar
shapes and require special attention to avoid ambiguity. Examples are shown in
Figure 22-5.
italic a alpha
italic v (pointed) nu
italic v (rounded) upsilon
script X chi
plain Y Upsilon
Hard-to-Distinguish Letters. Not all sans-serif fonts allow an easy distinction between
lowercase l and uppercase I, and not all monospaced (monowidth) fonts allow a distinction
between the letter l and the digit one. Such fonts are not usable for mathematics. In Fraktur,
the letters ' and (, in particular, must be made distinguishable. Overburdened blackletter
forms are inappropriate for mathematical notation. Similarly, the digit zero must be dis-
tinct from the uppercase letter O for all mathematical alphanumeric sets. Some characters
are so similar that even mathematical fonts do not attempt to provide distinct glyphs for
them. Their use is normally avoided in mathematical notation unless no confusion is pos-
sible in a given context—for example, uppercase A and uppercase Alpha.
Font Support for Combining Diacritics. Mathematical equations require that characters
be combined with diacritics (dots, tilde, circumflex, or arrows above are common), as well
as followed or preceded by superscripted or subscripted letters or numbers. This require-
ment leads to designs for italic styles that are less inclined and script styles that have smaller
overhangs and less slant than equivalent styles commonly used for text such as wedding
invitations.
Type Style for Script Characters. In some instances, a deliberate unification with a non-
mathematical symbol has been undertaken; for example, U+2133 is unified with the pre-
1949 symbol for the German currency unit Mark. This unification restricts the range of
glyphs that can be used for this character in the charts. Therefore the font used for the rep-
resentative glyphs in the code charts is based on a simplified “English Script” style, as per
recommendation by the American Mathematical Society. For consistency, other script
characters in the Letterlike Symbols block are now shown in the same type style.
Double-Struck Characters. The double-struck glyphs shown in earlier editions of the stan-
dard attempted to match the design used for all the other Latin characters in the standard,
which is based on Times. The current set of fonts was prepared in consultation with the
American Mathematical Society and leading mathematical publishers; it shows much sim-
Symbols 815 22.2 Letterlike Symbols
pler forms that are derived from the forms written on a blackboard. However, both serifed
and non-serifed forms can be used in mathematical texts, and inline fonts are found in
works published by certain publishers.
22.3 Numerals
Many characters in the Unicode Standard are used to represent numbers or numeric
expressions. Some characters are used exclusively in a numeric context; other characters
can be used both as letters and numerically, depending on context. The notational systems
for numbers are equally varied. They range from the familiar decimal notation to non-dec-
imal systems, such as Roman numerals.
Encoding Principles. The Unicode Standard encodes sets of digit characters (or non-digit
characters, as appropriate) for each script which has significantly distinct forms for numer-
als. As in the case of encoding of letters (and other units) for writing systems, the emphasis
is on encoding the units of the written forms for numeric systems.
Sets of digits which differ by mathematical style are separately encoded, for use in mathe-
matics. Such mathematically styled digits may carry distinct semantics which is maintained
as a plain text distinction in the representation of mathematical expressions. This treat-
ment of styled digits for mathematics parallels the treatment of styled alphabets for mathe-
matics. See “Mathematical Alphabets” in Section 22.2, Letterlike Symbols.
Other font face distinctions for digits which do not have mathematical significance, such as
the use of old style digits in running text, are not separately encoded. Other glyphic varia-
tions in digits and numeric characters are likewise not separately encoded. There are a few
documented exceptions to this general rule. See “Glyph Variants of Decimal Digits” later in
this section.
Decimal Digits
A decimal digit is a digit that is used in decimal (radix 10) place value notation. The most
widely used decimal digits are the European digits, encoded in the range from U+0030
digit zero to U+0039 digit nine. Because of their early encoding history, these digits are
also commonly known as ASCII digits. They are also known as Western digits or Latin digits.
The European digits are used with a large variety of writing systems, including those whose
own number systems are not decimal radix systems.
Many scripts also have their own decimal digits, which are separately encoded. Examples
are the digits used with the Arabic script or those of the Indic scripts. Table 22-3 lists scripts
for which separate decimal digits are encoded, together with the section in the Unicode
Standard which describes that script. The scripts marked with an asterisk (Arabic, Myan-
mar, and Tai Tham) have two or more sets of digits.
In the Unicode Standard, a character is formally classified as a decimal digit if it meets the
conditions set out in “Decimal Digits” in Section 4.6, Numeric Value and has been assigned
the property Numeric_Type = Decimal. The Numeric_Type property can be used to get
the complete list of all decimal digits for any version of the Unicode Standard. (See Deriv-
edNumericType.txt in the Unicode Character Database.)
When characters classified as decimal digits are used in sequences to represent decimal
radix numerals, they are always stored most significant digit first. This convention includes
Symbols 817 22.3 Numerals
decimal digits associated with scripts whose predominant layout direction is right-to-left.
The visual layout of decimal radix numerals in bidirectional contexts depends on the inter-
action of their Bidi_Class values with the Unicode Bidirectional Algorithm (UBA). In
many cases, decimal digits share the same strong Bidi_Class values with the letters of their
script (“L” or “R”). A few common-use decimal digits, such as the ASCII digits and the Ara-
bic script digits have special Bidi_Class values that interact with dedicated rules for resolv-
ing the direction of numbers in the UBA. (See Unicode Standard Annex #9, “Unicode
Bidirectional Algorithm.”)
The Unicode Standard does not specify which sets of decimal digits can or should be used
with any particular writing system, language, or locale. However, the information provided
in the Unicode Common Locale Data Repository (CLDR) contains information about
which set or sets of digits are used with particular locales defined in CLDR. Numeral sys-
tems for a given locale require additional information, such as the appropriate decimal and
grouping separators, the type of digit grouping used, and so on; that information is also
supplied in CLDR.
Symbols 818 22.3 Numerals
Exceptions. There are several scripts with exceptional encodings for characters that are
used as decimal digits. For the Arabic script, there are two sets of decimal digits encoded
which have somewhat different glyphs and different directional properties. See “Arabic-
Indic Digits” in Section 9.2, Arabic for a discussion of these two sets and their use in Arabic
text. For the Myanmar script a second set of digits is encoded for the Shan language, and a
third set of digits is encoded for the Tai Laing language. The Tai Tham script also has two
sets of digits, which are used in different contexts.
CJK Ideographs Used as Decimal Digits. The CJK ideographs listed in Table 4-5, with
numeric values in the range one through nine, can be used in decimal notations (with 0
represented by U+3007 ideographic number zero). These ideographic digits are not
coded in a contiguous sequence, nor do they occur in numeric order. Unlike other script-
specific digits, they are not uniquely used as decimal digits. The same characters may be
used in the traditional Chinese system for writing numbers, which is not a decimal radix
system, but which instead uses numeric symbols for tens, hundreds, thousands, ten thou-
sands, and so forth. See Figure 22-6, which illustrates two different ways the number 1,234
can be written with CJK ideographs.
or
CJK numeric ideographs are also used in word compounds which are not interpreted as
numbers. Parsing CJK ideographs as decimal numbers therefore requires information
about the context of their use.
Other Digits
Hexadecimal Digits. Conventionally, the letters “A” through “F”, or their lowercase equiv-
alents are used with the ASCII decimal digits to form a set of hexadecimal digits. These
characters have been assigned the Hex_Digit property. Although overlapping the letters
and digits this way is not ideal from the point of view of numerical parsing, the practice is
long standing; nothing would be gained by encoding a new, parallel, separate set of hexa-
decimal digits.
Compatibility Digits. There are a several sets of compatibility digits in the Unicode Stan-
dard. Table 22-4 provides a full list of compatibility digits.
The fullwidth digits are simply wide presentation forms of ASCII digits, occurring in East
Asian typographical contexts. They have compatibility decompositions to ASCII digits,
have Numeric_Type = Decimal, and should be processed as regular decimal digits.
The various mathematically styled digits in the range U+1D7CE..U+1D7F5 are specifically
intended for mathematical use. They also have compatibility decompositions to ASCII dig-
Symbols 819 22.3 Numerals
its and meet the criteria for Numeric_Type = Decimal. Although they may have particular
mathematical meanings attached to them, in most cases it would be safe for generic parsers
to simply treat them as additional sets of decimal digits.
Parsing of Superscript and Subscript Digits. In the Unicode Character Database, super-
script and subscript digits have not been given the General_Category property value Deci-
mal_Number (gc = Nd); correspondingly, they have the Numeric_Type property value
Digit, rather than Decimal. This is to prevent superscripted expressions like 23 from being
interpreted as 23 by simplistic parsers. More sophisticated numeric parsers, such as general
mathematical expression parsers, should correctly identify these compatibility superscript
and subscript characters as digits and interpret them appropriately. Note that the compat-
ibility superscript digits are not encoded in a single, contiguous range.
For mathematical notation, the use of superscript or subscript styling of ASCII digits is
preferred over the use of compatibility superscript or subscript digits. See Unicode Techni-
cal Report #25, “Unicode Support for Mathematics,” for more discussion of this topic.
Numeric Bullets. The other sets of compatibility digits listed in Table 22-4 are typically
derived from East Asian legacy character sets, where their most common use is as num-
bered text bullets. Most occur as part of sets which extend beyond the value 9 up to 10, 12,
or even 50. Most are also defective as sets of digits because they lack a value for 0. None is
given the Numeric_Type of Decimal. Only the basic set of simple circled digits is given
compatibility decompositions to ASCII digits. The rest either have compatibility decompo-
sitions to digits plus punctuation marks or have no decompositions at all. Effectively, all of
Symbols 820 22.3 Numerals
these numeric bullets should be treated as dingbat symbols with numbers printed on them;
they should not be parsed as representations of numerals.
Glyph Variants of Decimal Digits. Some variations of decimal digits are considered glyph
variants and are not separately encoded. These include the old style variants of digits, as
shown in Figure 22-7. Glyph variants of the digit zero with a centered dot or a diagonal
slash to distinguish it from the uppercase letter “O”, or of the digit seven with a horizontal
bar to distinguish it from handwritten forms for the digit one, are likewise not separately
encoded.
In a few cases, such as for a small number of mathematical symbols, there may be a strong
rationale for the unambiguous representation of a certain glyph variant of a decimal digit.
In particular, the glyph variant of the digit zero with a short diagonal stroke “2” can be
unambiguously represented with the standardized variation sequence <U+0030,
U+FE00>.
Significant regional glyph variants for the Eastern-Arabic Digits U+06F0..U+06F9 also
occur, but are not separately encoded. See Table 9-2 for illustrations of those variants.
Accounting Numbers. Accounting numbers are variant forms of digits or other numbers
designed to deter fraud. They are used in accounting systems or on various financial
instruments such as checks. These numbers often take shapes which cannot be confused
with other digits or letters, and which are difficult to convert into another digit or number
by adding on to the written form. When such numbers are clearly distinct characters, as
opposed to merely glyph variants, they are separately encoded in the standard. The use of
accounting numbers is particularly widespread in Chinese and Japanese, because the Han
ideographs for one, two, and three have simple shapes that are easy to convert into other
numbers by forgery. See Table 4-6, for a list of the most common alternate ideographs used
as accounting numbers for the traditional Chinese numbering system.
Characters for accounting numbers are occasionally encoded separately for other scripts as
well. For example, U+19DA new tai lue tham digit one is an accounting form for the
digit one which cannot be confused with the vowel sign -aa and which cannot easily be
converted into the digit for three.
Ethiopic Numerals. The Ethiopic script contains digits and other numbers for a traditional
number system which is not a decimal place-value notation. This traditional system does
not use a zero. It is further described in Section 19.1, Ethiopic.
Mende Kikakui Numerals. The Mende Kikakui script has a unique set of numerals, consti-
tuting a set of digits one through nine, used with a set of multiplier subscripts for powers of
ten from 10 through 1,000,000. For more details on the structure of this numeral system,
including examples, see Section 19.8, Mende Kikakui.
Medefaidrin Numerals. The numerals used with the Medefaidrin script (see Section 19.10,
Medefaidrin) constitute a novel, vigesimal radix system, with “digits” in the range 0 to 19.
The Medefaidrin script is used only by a small community for religious purposes, so little is
known about the practical use of these numerals.
Mayan Numerals. Mayan writing used a set of vigesimal numerals, including a sign for
zero. These signs are very well-known from Mayan calendrical inscriptions. They are strik-
ing in form, consisting of a series of horizontal bars with varying numbers of large dots
above the bars, and so are easy to spot in inscriptions, amidst all the other hieroglyphic
signs based on heads, animals, and so forth. The Mayan numerals are so well known, in
fact, that they have gained a degree of modern re-use, appearing, for example, in page num-
bering of small documents published in Guatemala or the Yucatan. To accommodate such
modern use of Mayan numerals, the full set has been encoded in the range
U+1D2E0..U+1D2F3 in a dedicated Mayan Numerals block.
Until the analysis and encoding of the complex Mayan hieroglyphic script can be com-
pleted, these Mayan numerals stand by themselves. They are not given a Mayan Script
property value, but are instead just treated as numeric symbols with the Script property
Common.
Cuneiform Numerals. Sumero-Akkadian numerals were used for sexagesimal systems.
There was no symbol for zero, but by Babylonian times, a place value system was in use.
Thus the exact value of a digit depended on its position in a number. There was also ambi-
guity in numerical representation, because a symbol such as U+12079 cuneiform sign
dish could represent either 1 or 1 × 60 or 1 × (60 × 60), depending on the context. A
numerical expression might also be interpreted as a sexagesimal fraction. So the sequence
<1, 10, 5> might be evaluated as 1 × 60 + 10 + 5 = 75 or 1 × 60 × 60 + 10 + 5 = 3615 or 1 +
(10 + 5)/60 = 1.25. Many other complications arise in Cuneiform numeral systems, and
they clearly require special processing distinct from that used for modern decimal radix
systems. For more information, see Section 11.1, Sumero-Akkadian.
Other Ancient Numeral Systems. A number of other ancient numeral systems have char-
acters encoded for them. Many of these ancient systems are variations on tallying systems.
In numerous cases, the data regarding ancient systems and their use is incomplete, because
of the fragmentary nature of the ancient text corpuses. Characters for numbers are
encoded, however, to enable complete representation of the text which does exist.
Ancient Aegean numbers were used with the Linear A and Linear B scripts, as well as the
Cypriot syllabary. They are described in Section 8.2, Linear B.
Symbols 822 22.3 Numerals
Many of the ancient Semitic scripts had very similar numeral systems which used tally-
shaped numbers for one, two, and three, and which then grouped those, along with some
signs for tens and hundreds, to form larger numbers. See the discussion of these systems in
Section 10.3, Phoenician and, in particular, the discussion with examples of number forma-
tion in Section 10.4, Imperial Aramaic.
Greek Numerals. The ancient Greeks used a set of acrophonic numerals, also known as
Attic numerals. These are represented in the Unicode Standard using capital Greek letters.
A number of extensions for the Greek acrophonic numerals, which combine letterforms in
odd ways, or which represent local regional variants, are separately encoded in the Ancient
Greek Numbers block, U+10140..U+1018A.
Greek also has an alphabetic numeral system, called Milesian or Alexandrian numerals.
These use the first third of the Greek alphabet to represent 1 through 9, the middle third
for 10 through 90, and the last third for 100 through 900. U+0374 greek numeral sign
(the dexia keraia) marks letters as having numeric values in modern typography. U+0375
greek lower numeral sign (the aristeri keraia) is placed on the left side of a letter to
indicate a value in the thousands.
In Byzantine and other Greek manuscript traditions, numbers were often indicated by a
horizontal line drawn above the letters being used as numbers. The Coptic script uses sim-
ilar conventions. See Section 7.3, Coptic.
Ordinary Coptic numbers are often distinguished from Coptic letters by marking them
with a line above. (See Section 7.3, Coptic.) A visually similar convention is also seen for
Coptic epact numbers, where an entire numeric sequence may be marked with a wavy line
above. This mark is represented by U+0605 arabic number mark above. As when used
with Arabic digits, arabic number mark above precedes the sequence of Coptic epact
numbers in the underlying representation, and is rendered across the top of the entire
sequence for display.
Ottoman Siyaq. The Ottoman, or Turkish, Siyaq numbers are encoded in the Ottoman
Siyaq Numbers block (U+1ED00..U+1ED4F). These are also known as Siyakat numbers.
The system contains several alternate forms for numbers, which may be historical reten-
tions. These alternate forms are encoded as distinct characters for the numbers two
through ten and for a few other numbers of higher orders.
The Ottoman Siyaq system includes a specialized multiplier character, U+1ED2E otto-
man siyaq marratan (from the Arabic word marratan, “multiplier”). The multiplier is
used in combination with one hundred and one thousand for expressing the millions and
larger orders.
Ottoman Siyaq also uses a number of fractions. These fractions may be written in sequence
after the number, or may be rendered beneath the number. Because of their distinctive
shapes, two of the fractions are encoded as separate numeric symbols: U+1ED3C otto-
man siyaq fraction one half and U+1ED3D ottoman siyaq fraction one sixth.
In some Ottoman Siyaq sources, a baseline dot indicates the end of a numerical sequence,
and is placed after the last number. The dot can be represented either by U+002E full
stop or U+06D4 arabic full stop, depending on the desired shape of the numerical ter-
minator.
Indic Siyaq. The Indic Siyaq tradition is known in India and other parts of South Asia as
raqm or rakam, from the Arabic word raqm, meaning “account.” Indic Siyaq is encoded in
the Indic Siyaq Numbers block (U+1EC70..U+1ECBF). Like other Siyaq traditions, Indic
Siyaq uses stylized monograms of the Arabic names for numbers, but the numbers for large
decimal orders are derived from words of Indic languages. The period during which Siyaq
was introduced in India is difficult to determine. The system was in common use under the
Mughals by the 17th century, and remained in use into the middle of the 20th century.
There are two major styles of Siyaq used in India: the northern style and the “Deccani” or
southern style. In general, the number forms and notation system of the two are identical.
Minor points of difference lie in the orthography for the thousands, ten thousands, and
lakhs.
The Indic Siyaq numbers are generally used within an Arabic script environment and
within Urdu and Persian linguistic contexts. They may also occur in multilingual environ-
ments alongside other scripts. Arabic-Indic digits occasionally occur within Siyaq
sequences, particularly for the representation of small currency units.
CJK Numerals
CJK Ideographic Traditional Numerals. The traditional Chinese system for writing
numerals is not a decimal radix system. It is decimal-based, but uses a series of decimal
counter symbols that function somewhat like tallies. So for example, the representation of
the number 12,346 in the traditional system would be by a sequence of CJK ideographs
with numeric values as follows: <one, ten-thousand, two, thousand, three, hundred, four,
ten, six>. See Table 4-5 for a list of all the CJK ideographs for digits and decimal counters
used in this system. The traditional system is still in widespread use, not only in China and
Symbols 826 22.3 Numerals
other countries where Chinese is used, but also in countries whose writing adopted Chi-
nese characters—most notably, in Japan. In both China and Japan the traditional system
now coexists with very common use of the European digits.
Chinese Counting-Rod Numerals. Counting-rod numerals were used in pre-modern East
Asian mathematical texts in conjunction with counting rods used to represent and manip-
ulate numbers. The counting rods were a set of small sticks, several centimeters long that
were arranged in patterns on a gridded counting board. Counting rods and the counting
board provided a flexible system for mathematicians to manipulate numbers, allowing for
considerable sophistication in mathematics.
The specifics of the patterns used to represent various numbers using counting rods varied,
but there are two main constants: Two sets of numbers were used for alternate columns;
one set was used for the ones, hundreds, and ten-thousands columns in the grid, while the
other set was used for the tens and thousands. The shapes used for the counting-rod
numerals in the Unicode Standard follow conventions from the Song dynasty in China,
when traditional Chinese mathematics had reached its peak. Fragmentary material from
many early Han dynasty texts shows different orientation conventions for the numerals,
with horizontal and vertical marks swapped for the digits and tens places.
Zero was indicated by a blank square on the counting board and was either avoided in
written texts or was represented with U+3007 ideographic number zero. (Historically,
U+3007 ideographic number zero originated as a dot; as time passed, it increased in size
until it became the same size as an ideograph. The actual size of U+3007 ideographic
number zero in mathematical texts varies, but this variation should be considered a font
difference.) Written texts could also take advantage of the alternating shapes for the
numerals to avoid having to explicitly represent zero. Thus 6,708 can be distinguished from
678, because the former would be /'(, whereas the latter would be &0(.
Negative numbers were originally indicated on the counting board by using rods of a dif-
ferent color. In written texts, a diagonal slash from lower right to upper left is overlaid
upon the rightmost digit. On occasion, the slash might not be actually overlaid. U+20E5
combining reverse solidus overlay should be used for this negative sign.
The predominant use of counting-rod numerals in texts was as part of diagrams of count-
ing boards. They are, however, occasionally used in other contexts, and they may even
occur within the body of modern texts.
Suzhou-Style Numerals. The Suzhou-style numerals are CJK ideographic number forms
encoded in the CJK Symbols and Punctuation block in the ranges U+3021..U+3029 and
U+3038..U+303A.
The Suzhou-style numerals are modified forms of CJK ideographic numerals that are used
by shopkeepers in China to mark prices. They are also known as “commercial forms,”
“shop units,” or “grass numbers.” They are encoded for compatibility with the CNS 11643-
1992 and Big Five standards. The forms for ten, twenty, and thirty, encoded at
U+3038..U+303A, are also encoded as CJK unified ideographs: U+5341, U+5344, and
U+5345, respectively. (For twenty, see also U+5EFE and U+5EFF.)
Symbols 827 22.3 Numerals
These commercial forms of Chinese numerals should be distinguished from the use of
other CJK unified ideographs as accounting numbers to deter fraud. See Table 4-6 in
Section 4.6, Numeric Value, for a list of ideographs used as accounting numbers.
Why are the Suzhou numbers called Hangzhou numerals in the Unicode names? No one
has been able to trace this back. Hangzhou is a district in China that is near the Suzhou dis-
trict, but the name “Hangzhou” does not occur in other sources that discuss these number
forms.
Fractions
The Number Forms block (U+2150..U+218F) contains a series of vulgar fraction charac-
ters, encoded for compatibility with legacy character encoding standards. These characters
are intended to represent both of the common forms of vulgar fractions: forms with a
right-slanted division slash, such as G, as shown in the code charts, and forms with a hori-
zontal division line, such as H, which are considered to be alternative glyphs for the same
fractions, as shown in Figure 22-8. A few other vulgar fraction characters are located in the
Latin-1 block in the range U+00BC..U+00BE.
GH
The unusual fraction character, U+2189 vulgar fraction zero thirds, is in origin a
baseball scoring symbol from the Japanese television standard, ARIB STD B24. For base-
ball scoring, this character and the related fractions, U+2153 vulgar fraction one third
and U+2154 vulgar fraction two thirds, use the glyph form with the slanted division
slash, and do not use the alternate stacked glyph form.
The vulgar fraction characters are given compatibility decompositions using U+2044 “/”
fraction slash. Use of the fraction slash is the more generic way to represent fractions in
text; it can be used to construct fractional number forms that are not included in the collec-
tions of vulgar fraction characters. For more information on the fraction slash, see “Other
Punctuation” in Section 6.2, General Punctuation.
ues such as 5/16 are written additively by using two of the atomic symbols: 5/16 = 1/4 + 1/
16, and so on. Some regional variation is found in the exact shape of the fraction signs used.
For example, in Kannada, the fraction signs in the U+A833..U+A835 range are displayed
with horizontal bars, instead of bars slanting upward to the right.
The signs for the fractions 1/4, 1/2, and 3/4 sometimes take different forms when they are
written independently, without a currency or quantity mark. These independent forms
were used more generally in Maharashtra and Gujarat, and they appear in materials writ-
ten and printed in the Devanagari and Gujarati scripts. The independent fraction signs are
represented by using middle dots to the left and right of the regular fraction signs.
U+A836 north indic quarter mark is used in some regional orthographies to explicitly
indicate fraction signs for 1/4, 1/2, and 3/4 in cases where sequences of other marks could
be ambiguous in reading.
This block also contains several other symbols that are not strictly number forms. They are
used in traditional representation of numeric amounts for currency, weights, and other
measures in the North Indic orthographies which use the fraction signs. U+A837 north
indic placeholder mark is a symbol used in currency representations to indicate the
absence of an intermediate value. U+A839 north indic quantity mark is a unit mark for
various weights and measures.
The North Indic fraction signs are related to fraction signs that have specific forms and are
separately encoded in some North Indic scripts. See, for example, U+09F4 bengali cur-
rency numerator one. Similar forms are attested for the Oriya script.
Symbols 829 22.4 Superscript and Subscript Symbols
Latin letters in the Superscripts and Subscripts block, U+2071 superscript latin small
letter i and U+207F superscript latin small letter n are considered part of that set
of modifier letters; the difference in the naming conventions for them is an historical arti-
fact, and is not intended to convey a functional distinction in the use of those characters in
the Unicode Standard.
There are also a number of superscript or subscript symbols encoded in the Spacing Mod-
ifier Letters block (U+02B0..U+02FF). These symbols also often have the words “modifier
letter” in their names, but are distinguished from most modifier letters by having the Gen-
eral_Category property value Sk. Like most modifier letters, the usual function of these
superscript or subscript symbols is to indicate particular modifications of sound values in
phonetic transcriptional systems. Characters such as U+02C2 modifier letter left
arrowhead or U+02F1 modifier letter low left arrowhead should not be used to
represent normal mathematical relational symbols such as U+003C “<” less-than sign in
superscripted or subscripted expressions.
Finally, a small set of superscripted CJK ideographs, used for the Japanese system of syn-
tactic markup of Classical Chinese text for reading, is located in the Kanbun block
(U+3190..U+319F).
Symbols 831 22.5 Mathematical Symbols
Unifications. Mathematical operators such as implies and if and only if ↔ have been
unified with the corresponding arrows (U+21D2 rightwards double arrow and
U+2194 left right arrow, respectively) in the Arrows block.
The operator U+2208 element of is occasionally rendered with a taller shape than shown
in the code charts. Mathematical handbooks and standards consulted treat these charac-
ters as variants of the same glyph. U+220A small element of is a distinctively small ver-
sion of the element of that originates in mathematical pi fonts.
The operators U+226B much greater-than and U+226A much less-than are some-
times rendered in a nested shape. The nested shapes are encoded separately as U+2AA2
double nested greater-than and U+2AA1 double nested less-than.
A large class of unifications applies to variants of relation symbols involving negation. Vari-
ants involving vertical or slanted negation slashes and negation slashes of different lengths
are not separately encoded. For example, U+2288 neither a subset of nor equal to is
the archetype for several different glyph variants noted in various collections.
In two instances in this block, essentially stylistic variants are separately encoded: U+2265
greater-than or equal to is distinguished from U+2267 greater-than over equal
to; the same distinction applies to U+2264 less-than or equal to and U+2266 less-
than over equal to. Further instances of the encoding of such stylistic variants can be
found in the supplemental blocks of mathematical operators. The primary reason for such
duplication is for compatibility with existing standards.
Disunifications. A number of mathematical operators have been disunified from related or
similar punctuation characters, as shown in Table 22-5.
ators render on the math centerline, rather than the text baseline. Additionally, the angle or
length of the operator counterparts of certain slashes or bars may differ from the corre-
sponding punctuation marks. For certain pairs, such as colon and ratio, there are dis-
tinctions in the behavior of inter-character spacing; ratio is rendered as a relational
operator which takes visible space on both sides, whereas the punctuation mark colon
does not require such additional space in rendering.
The distinction between middle dot and dot operator deserves special consideration.
dot operator is preferred for mathematical use, where it signifies multiplication. This
allows for rendering consistent with other mathematical operators, with unambiguous
character properties and mathematical semantics. middle dot is a legacy punctuation
mark, with multiple uses, and with quite variable layout in different fonts. For the typo-
graphical convention of a raised decimal point, in contexts where simple layout is the prior-
ity and where automated parsing of decimal expressions is not required, middle dot is the
preferred representation.
In cases where there ordinarily is no rendering distinction between a punctuation mark
and its use in mathematics, such as for U+0021 ! exclamation mark used for factorial or
for U+002E full stop used for a baseline decimal point, there is no disunification, and
only a single character has been encoded.
Greek-Derived Symbols. Several mathematical operators derived from Greek characters
have been given separate encodings because they are used differently from the correspond-
ing letters. These operators may occasionally occur in context with Greek-letter variables.
They include U+2206 Δ increment, U+220F è n-ary product, and U+2211 n-ary
summation. The latter two are large operators that take limits.
Other duplicated Greek characters are those for U+00B5 μ micro sign in the Latin-1 Sup-
plement block, U+2126 Ω ohm sign in Letterlike Symbols, and several characters among
the APL functional symbols in the Miscellaneous Technical block. Most other Greek char-
acters with special mathematical semantics are found in the Greek block because dupli-
cates were not required for compatibility. Additional sets of mathematical-style Greek
alphabets are found in the Mathematical Alphanumeric Symbols block.
N-ary Operators. N-ary operators are distinguished from binary operators by their larger
size and by the fact that in mathematical layout, they take limit expressions.
Invisible Operators. In mathematics, some operators or punctuation are often implied but
not displayed. For a set of invisible operators that can be used to mark these implied oper-
ators in the text, see Section 22.6, Invisible Mathematical Operators.
Minus Sign. U+2212 “–” minus sign is a mathematical operator, to be distinguished from
the ASCII-derived U+002D “-” hyphen-minus, which may look the same as a minus sign
or be shorter in length. (For a complete list of dashes in the Unicode Standard, see
Table 6-3.) U+22EE..U+22F1 are a set of ellipses used in matrix notation. U+2052 “%” com-
mercial minus sign is a specialized form of the minus sign. Its use is described in
Section 6.2, General Punctuation.
Symbols 835 22.5 Mathematical Symbols
Delimiters. Many mathematical delimiters are unified with punctuation characters. See
Section 6.2, General Punctuation, for more information. Some of the set of ornamental
brackets in the range U+2768..U+2775 are also used as mathematical delimiters. See
Section 22.9, Miscellaneous Symbols. See also Section 22.7, Technical Symbols, for specialized
characters used for large vertical or horizontal delimiters.
Bidirectional Layout. In a bidirectional context, with the exception of arrows, the glyphs
for mathematical operators and delimiters are adjusted as described in Unicode Standard
Annex #9, “Unicode Bidirectional Algorithm.” See Section 4.7, Bidi Mirrored, and “Paired
Punctuation” in Section 6.2, General Punctuation.
Other Elements of Mathematical Notation. In addition to the symbols in these blocks,
mathematical and scientific notation makes frequent use of arrows, punctuation charac-
ters, letterlike symbols, geometrical shapes, and miscellaneous and technical symbols.
For an extensive discussion of mathematical alphanumeric symbols, see Section 22.2, Let-
terlike Symbols. For additional information on all the mathematical operators and other
symbols, see Unicode Technical Report #25, “Unicode Support for Mathematics.”
row, for use in mathematical and scientific notation, and should be distinguished from the
corresponding wide forms of white square brackets, angle brackets, and double angle
brackets used in CJK typography. (See the discussion of the CJK Symbols and Punctuation
block in Section 6.2, General Punctuation.) Note especially that the “bra” and “ket” angle
brackets (U+2329 left-pointing angle bracket and U+232A right-pointing angle
bracket, respectively) are deprecated. Their use is strongly discouraged, because of their
canonical equivalence to CJK angle brackets. This canonical equivalence is likely to result
in unintended spacing problems if these characters are used in mathematical formulae.
The flattened parentheses encoded at U+27EE..U+27EF are additional, specifically-styled
mathematical parentheses. Unlike the mathematical and CJK brackets just discussed, the
flattened parentheses do not have corresponding wide CJK versions which they would
need to be contrasted with.
Long Division. U+27CC long division is an operator intended for the representation of
long division expressions, as may be seen in elementary and secondary school mathemati-
cal textbooks, for example. In use and rendering it shares some characteristics with
U+221A square root; in rendering, the top bar may be stretched to extend over the top of
the denominator of the division expression. Full support of such rendering may, however,
require specialized mathematical software.
Fractional Slash and Other Diagonals. U+27CB mathematical rising diagonal and
U+27CD mathematical falling diagonal are limited-use mathematical symbols, to be
distinguished from the more widely used solidi and reverse solidi operators encoded in the
Basic Latin, Mathematical Operators, and Miscellaneous Mathematical Symbols-B blocks.
Their glyphs are invariably drawn at a 45 degree angle, instead of the more upright slants
typical for the solidi operators. The box drawing characters U+2571 and U+2572, whose
glyphs may also be found at a 45 degree angle in some fonts, are not intended to be used as
mathematical symbols. One usage recorded for U+27CB and U+27CD is in the notation
for spaces of double cosets. The former corresponds to the LaTeX entity \diagup and the
latter to \diagdown.
an infinitesimal negative value. The glyphs for tiny and miny resemble the plus sign and
minus sign, respectively, but should be shown distinctly, with thickened ends to their bars.
Arrows: U+2190–U+21FF
Arrows are used for a variety of purposes: to imply directional relation, to show logical der-
ivation or implication, and to represent the cursor control keys.
Accordingly, the Unicode Standard includes a fairly extensive set of generic arrow shapes,
especially those for which there are established usages with well-defined semantics. It does
not attempt to encode every possible stylistic variant of arrows separately, especially where
their use is mainly decorative. For most arrow variants, the Unicode Standard provides
encodings in the two horizontal directions, often in the four cardinal directions. For the
single and double arrows, the Unicode Standard provides encodings in eight directions.
Bidirectional Layout. In bidirectional layout, arrows are not automatically mirrored,
because the direction of the arrow could be relative to the text direction or relative to an
absolute direction. Therefore, if text is copied from a left-to-right to a right-to-left context,
or vice versa, the character code for the desired arrow direction in the new context must be
used. For example, it might be necessary to change U+21D2 rightwards double arrow
to U+21D0 leftwards double arrow to maintain the semantics of “implies” in a right-
to-left context. For more information on bidirectional layout, see Unicode Standard Annex
#9, “Unicode Bidirectional Algorithm.”
Standards. The Unicode Standard encodes arrows from many different international and
national standards as well as corporate collections.
Unifications. Arrows expressing mathematical relations have been encoded in the Arrows
block as well as in the supplemental arrows blocks. An example is U+21D2 right-
wards double arrow, which may be used to denote implies. Where available, such usage
information is indicated in the annotations to individual characters in the code charts.
However, because the arrows have such a wide variety of applications, there may be several
semantic values for the same Unicode character value.
Symbols 838 22.5 Mathematical Symbols
Supplemental Arrows
The Supplemental Arrows-A (U+27F0..U+27FF), Supplemental Arrows-B (U+2900..
U+297F), Miscellaneous Symbols and Arrows (U+2B00..U+2BFF), and Supplemental
Arrows-C (U+1F800..U+1F8FF) blocks contain a large repertoire of arrows to supplement
the main set in the Arrows block. Many of the supplemental arrows in the Miscellaneous
Symbols and Arrows block, particularly in the range U+2B30..U+2B4C, are encoded to
ensure the availability of left-right symmetric pairs of less common arrows, for use in bidi-
rectional layout of mathematical text.
Long Arrows. The long arrows encoded in the range U+27F5..U+27FF map to standard
SGML entity sets supported by MathML. Long arrows represent distinct semantics from
their short counterparts, rather than mere stylistic glyph differences. For example, the
shorter forms of arrows are often used in connection with limits, whereas the longer ones
are associated with mappings. The use of the long arrows is so common that they were
assigned entity names in the ISOAMSA entity set, one of the suite of mathematical symbol
entity sets covered by the Unicode Standard.
Crops and Quine Corners. Crops and quine corners are most properly used in two-dimen-
sional layout but may be referred to in plain text. This usage is shown in Figure 22-9.
acters. They should not be used in stored mathematical text, although they are often used
in the data stream created by display and print drivers.
Table 22-6 shows which pieces are intended to be used together to create specific symbols.
For example, an instance of U+239B can be positioned relative to instances of U+239C and
U+239D to form an extra-tall (three or more line) left parenthesis. The center sections
encoded here are meant to be used only with the top and bottom pieces encoded adjacent
to them because the segments are usually graphically constructed within the fonts so that
they match perfectly when positioned at the same x coordinates.
Decimal Exponent Symbol. U+23E8 decimal exponent symbol is for compatibility with
the Russian standard GOST 10859-64, as well as the paper tape and punch card standard,
Alcor (DIN 66006). It represents a fixed token introducing the exponent of a real number
in scientific notation, comparable to the more common usage of “e” in similar notations:
1.621e5. It was used in the early computer language ALGOL-60, and appeared in some
Soviet-manufactured computers, such as the BESM-6 and its emulators. In the Unicode
Standard it is treated simply as an atomic symbol; it is not considered to be equivalent to a
generic subscripted form of the numeral “10” and is not given a decomposition. The verti-
cal alignment of this symbol is slightly lower than the baseline, as shown in Figure 22-10.
Dental Symbols. The set of symbols from U+23BE to U+23CC form a set of symbols from
JIS X 0213 for use in dental notation.
Metrical Symbols. The symbols in the range U+23D1..U+23D9 are a set of spacing sym-
bols used in the metrical analysis of poetry and lyrics.
Electrotechnical Symbols. The Miscellaneous Technical block also contains a smattering
of electrotechnical symbols. These characters are not intended to constitute a complete
encoding of all symbols used in electrical diagrams, but rather are compatibility characters
encoded primarily for mapping to other standards. The symbols in the range
U+238D..U+2394 are from the character set with the International Registry number 181.
U+23DA earth ground and U+23DB fuse are from HKSCS-2001.
User Interface Symbols. The characters U+231A, U+231B, and U+23E9 through U+23FA
are often found in user interfaces for media players, clocks, alarms, and timers, as well as in
text discussing those user interfaces. The black medium triangles (U+23F4..U+23F7) are
the preferred shapes for User Interface purposes, rather than the similar geometric shapes
located in the Geometric Shapes block: U+25A0..U+25FF. The Miscellaneous Symbols and
Pictographs block also contains many user interface symbols in the ranges
U+1F500..U+1F518, U+1F53A..U+1F53D and U+1F5BF..U+1F5DD, as well as clock face
symbols in the range U+1F550..U+1F567.
Standards. This block contains a large number of symbols from ISO/IEC 9995-7:1994,
Information technology—Keyboard layouts for text and office systems—Part 7: Symbols used
to represent functions.
ISO/IEC 9995-7 contains many symbols that have been unified with existing and closely
related symbols in Unicode. These symbols are shown with their ordinary shapes in the
Symbols 844 22.7 Technical Symbols
code charts, not with the particular glyph variation required by conformance to ISO/IEC
9995-7. Implementations wishing to be conformant to ISO/IEC 9995-7 in the depiction of
these symbols should make use of a suitable font.
Half-block fill characters are included for each half of a display cell, plus a graduated series
of vertical and horizontal fractional fills based on one-eighth parts. The fractional fills do
not form a logically complete set but are intended only for backward compatibility. There
is also a set of quadrant fill characters, U+2596..U+259F, which are designed to comple-
ment the half-block fill characters and U+2588 full block. When emulating terminal
applications, fonts that implement the block element characters should be designed so that
adjacent glyphs for characters such as U+2588 full block create solid patterns with no
gaps between them.
Standards. The box drawing and block element characters were derived from GB 2312, KS X
1001, a variety of industry standards, and several terminal graphics sets. The Videotex
Mosaic characters, which have similar appearances and functions, are unified against these
sets.
For more details on the use of geometrical shapes in mathematics, see Unicode Technical
Report #25, “Unicode Support for Mathematics.”
Standards. The Geometric Shapes are derived from a large range of national and vendor
character standards. The squares and triangles at U+25E7..U+25EE are derived from the
Linotype font collection. U+25EF large circle is included for compatibility with the JIS X
0208-1990 Japanese standard.
This block also contains a set of colored circles and squares in the range
U+1F7E0..U+1F7EB. Those colored circles and squares are intended for use with emoji, to
augment the colored circles and other colored sets for emoji. Table 22-7 shows these sets,
including white and black circles and squares, and red and blue circles from other blocks.
Those sets are listed in the order: white, black, red, blue, orange, yellow, green, purple,
brown. Unlike emoji modifiers for skin tone (see Unicode Technical Standard #51, “Uni-
code Emoji”), the symbols for colored circles and squares are simply graphical symbols
which may convey the concepts of colors, but with no immediate implications for render-
Symbols 848 22.8 Geometrical Symbols
ing of glyphs with those particular colors. For example, a user could specify a yellow circle
symbol together with a ribbon emoji symbol to convey the notion of a “yellow ribbon,” but
there would be no expectation that the font would combine the two characters and draw an
actual yellow ribbon. These colored circles and squares are often used decoratively in emoji
text, with no other semantic intent.
Symbols 849 22.9 Miscellaneous Symbols
• Glyph shape: Emoji symbols may have a great deal of flexibility in the choice of
glyph shape used to render them.
• Color: Many characters in an emoji context (such as cell phone e-mail or text
messages) are displayed in color, sometimes as a multicolor image. While this
is particularly true of emoji symbols, there are other cases where non-emoji
symbols, such as game symbols, may be displayed in color.
• Animation: Some characters in an emoji context are presented in animated
form, usually as a repeating sequence of two to four images.
Emoji symbols may be presented using color or animation, but need not be. Because many
characters in the carrier emoji sets or other sources are unified with Unicode characters
that originally came from other sources, it may not always be clear whether a character
should be presented using an emoji style. However, for most such characters, variation
sequences have been defined which can specify text or emoji presentation. Unicode Tech-
nical Standard #51, “Unicode Emoji,” provides some guidance about which characters
should have which presentation style in various environments.
Color Words in Unicode Character Names. The representative glyph shown in the code
charts for a character is always monochrome. The character name may include a term such
as black or white, or in the case of characters from the carrier emoji sets, other color
terms such as blue or orange. Neither the monochrome nature of the representative
glyph nor any color term in the character name are meant to imply any requirement or
limitation on how the glyph may be presented (see also “Images in the Code Charts and
Character Lists” in Section 24.1, Character Names List). The use of black or white in
names such as black medium square or white medium square is generally intended to
contrast filled versus outline shapes, or a darker color fill versus a lighter color fill; it is not
intended to suggest that the character must be presented in black or white, respectively.
Similarly, the color terms in names such as blue heart or orange book are intended only
to help identify the corresponding characters in the carrier emoji sets; the characters may
be presented using color, or in monochrome using different styles of shading or cross-
hatching, for example.
symbols, the Yijing (I Ching) trigrams, planet and zodiacal symbols, game symbols, musi-
cal dingbats, and recycling symbols. (For other moon phases, see the circle-based shapes in
the Geometric Shapes block.)
Standards. The symbols in these blocks are derived from a large range of national and ven-
dor character standards. Among them, characters from the Japanese Association of Radio
Industries and Business (ARIB) standard STD-B24 are widely represented in the Miscella-
neous Symbols block. The symbols from ARIB were initially used in the context of digital
broadcasting, but in many cases their usage has evolved to more generic purposes. The
Miscellaneous Symbols and Pictographs block includes many characters from the carrier
emoji sets and the Wingdings/Webdings collections.
Weather Symbols. The characters in the ranges U+2600..U+2603, U+26C4..U+26CB, and
U+1F321..U+1F32C, as well as U+2614 umbrella with rain drops are weather symbols.
These commonly occur as map symbols or in other contexts related to weather forecasting
in digital broadcasting or on websites.
Moon and Sun Symbols. There are a variety of moon and sun symbols encoded in the Mis-
cellaneous Symbols block (U+2609, U+263C..U+263E) and in the Miscellaneous Symbols
and Pictographs block (U+1F311..U+1F31E). Some of these are used in astrological charts,
while others are merely playful symbols showing faces. Various crescent signs for the moon
do not necessarily represent particular phases of the moon.
The moon symbols in the range U+1F311..U+1F318, in particular, represent a systematic
set of eight symbols for the phases of the moon. These symbols appear, for example, in
moon charts, almanacs, tide tables, and similar documents to represent particular phases
of the moon. There is a notable difference in interpretation of symbols for phases of the
moon between Northern Hemisphere users and Southern Hemisphere users, with the
graphical orientation of waxing and waning phases reversed. So, for example, in the South-
ern Hemisphere, U+1F312 waxing crescent moon symbol would usually be interpreted
as representing the waning crescent moon, instead.
The use of these moon symbols (U+1F311..U+1F318) should follow the shape of the
graphic symbols, as shown in the code charts. Users should not simply assume from the
character names that the symbols are intended to represent astronomical positions of the
moon.
Traffic Signs. In general, traffic signs are quite diverse, tend to be elaborate in form and
differ significantly between countries and locales. For the most part they are inappropriate
for encoding as characters. However, there are a small number of conventional symbols
which have been used as characters in contexts such as digital broadcasting or mobile
phones. The characters in the ranges U+26CC..U+26CD and U+26CF..U+26E1 are traffic
sign symbols of this sort, encoded for use in digital broadcasting. Additional traffic signs
are in included in the Transport and Map Symbols block.
Dictionary and Map Symbols. The characters in the range U+26E8..U+26FF are dictio-
nary and map symbols used in the context of digital broadcasting. Numerous other sym-
bols in this block and scattered in other blocks also have conventional uses as dictionary or
Symbols 852 22.9 Miscellaneous Symbols
map symbols. For example, these may indicate special uses for words, or indicate types of
buildings, points of interest, particular activities or sports, and so on.
Plastic Bottle Material Code System. The seven numbered logos encoded from U+2673 to
U+2679, ,-./012, are from “The Plastic Bottle Material Code System,” which was
introduced in 1988 by the Society of the Plastics Industry (SPI). This set consistently uses
thin, two-dimensional curved arrows suitable for use in plastics molding. In actual use, the
symbols often are combined with an abbreviation of the material class below the triangle.
Such abbreviations are not universal; therefore, they are not present in the representative
glyphs in the code charts.
Recycling Symbol for Generic Materials. An unnumbered plastic resin code symbol
U+267A 3 recycling symbol for generic materials is not formally part of the SPI sys-
tem but is found in many fonts. Occasional use of this symbol as a generic materials code
symbol can be found in the field, usually with a text legend below, but sometimes also sur-
rounding or overlaid by other text or symbols. Sometimes the universal recycling sym-
bol is substituted for the generic symbol in this context.
Universal Recycling Symbol. The Unicode Standard encodes two common glyph variants
of this symbol: U+2672 + universal recycling symbol and U+267B 4 black univer-
sal recycling symbol. Both are used to indicate that the material is recyclable. The white
form is the traditional version of the symbol, but the black form is sometimes substituted,
presumably because the thin outlines of the white form do not always reproduce well.
Paper Recycling Symbols. The two paper recycling symbols, U+267C x recycled paper
symbol and U+267D y partially-recycled paper symbol, can be used to distinguish
between fully and partially recycled fiber content in paper products or packaging. They are
usually accompanied by additional text.
Gender Symbols. The characters in the range U+26A2..U+26A9 are gender symbols. These
are part of a set with U+2640 female sign, U+2642 male sign, U+26AA medium white
circle, and U+26B2 neuter. They are used in sexual studies and biology, for example.
Some of these symbols have other uses as well, as astrological or alchemical symbols.
Genealogical Symbols. The characters in the range U+26AD..U+26B1 are sometimes seen
in genealogical tables, where they indicate marriage and burial status. They may be aug-
mented by other symbols, including the small circle indicating betrothal.
Game Symbols. The Miscellaneous Symbols block also contains a variety of small symbol
sets intended for the representation of common game symbols or tokens in text. These
include symbols for playing card suits, often seen in manuals for bridge and other card
games, as well as a set of dice symbols. The chess symbols are often seen in figurine alge-
braic notation. In addition, there are symbols for game pieces or notation markers for go,
shogi (Japanese chess), and draughts (checkers).
Larger sets of game symbols are encoded in their own blocks. See the discussion of playing
cards, chess symbols, mahjong tile symbols, and domino tile symbols later in this section.
Symbols 853 22.9 Miscellaneous Symbols
Animal Symbols. The animal symbol characters in the range U+1F400..U+1F42C are
encoded primarily to cover the emoji sets used by Japanese cell phone carriers. Animal
symbols are widely used in Asia as signs of the zodiac, and that is part of the reason for their
inclusion in the cell phone sets. However, the particular animal symbols seen in Japan and
China are not the only animals used as zodiacal symbols throughout Asia. The set of ani-
mal symbols encoded in this block includes other animal symbols used as zodiacal symbols
in Vietnam, Thailand, Persia, and other Asian countries. These zodiacal uses are specifi-
cally annotated in the Unicode code charts.
Other animal symbols have no zodiacal associations, and are included simply to cover the
carrier emoji sets. A few of the animal symbols have conventional uses to designate types of
meat on menus. Later additions of animal symbols fill perceived gaps in the set, responding
to the wide popularity of animal symbols in Unicode-based emoji implementations.
Cultural Symbols. The five cultural symbols encoded in the range U+1F5FB..U+1F5FF
mostly designate cultural landmarks of particular importance to Japan. They are encoded
for compatibility with emoji sets used by Japanese cell phone carriers, and are not intended
to set a precedent for encoding additional sets of cultural landmarks or other pictographic
cultural symbols as characters.
Hand Symbols. The pictographic symbols for hands encoded in the ranges U+1F90F,
U+1F918..U+1F91F, U+1F446..U+1F450, and U+1F58E..U+1F5A3, as well as in the
U+270A..U+270D range in the Dingbats block, represent various hand gestures. The inter-
pretations associated with such gestures vary significantly among cultures.
Emoji Modifiers. The emoji modifiers U+1F3FB..U+1F3FF designate five different skin
tones based on the Fitzpatrick scale. These may be displayed in isolation as color or half-
tone swatches, or they may form a ligature with a preceding emoji character representing a
person or body part in order to specify a particular appearance for that character.
Miscellaneous Symbols in Other Blocks. In addition to the blocks described in this sec-
tion, which are devoted entirely to sets of miscellaneous symbols, there are many other
blocks which contain small numbers of otherwise uncategorized symbols. See, for example,
the Miscellaneous Symbols and Arrows block U+2B00..U+2B7F and the Enclosed Alpha-
numeric Supplement block U+1F100..U+1F1FF. Some of these blocks contain symbols
which extend or complement sets of symbols contained in the Miscellaneous Symbols
block.
Emoticons: U+1F600–U+1F64F
Emoticons (from “emotion” plus “icon”) originated as a way to convey emotion or attitude
in e-mail messages, using ASCII character combinations such as :-) to indicate a smile—
and by extension, a joke—and :-( to indicate a frown. In East Asia, a number of more elab-
orate sequences have been developed, such as (")(-_-)(") showing an upset face with hands
raised.
Over time, many systems began replacing such sequences with images, and also began pro-
viding a way to input emoticon images directly, such as a menu or palette. The carrier
Symbols 854 22.9 Miscellaneous Symbols
emoji sets used by Japanese cell phone providers contain a large number of characters for
emoticon images, and most of the characters in this block are from those sets. They are
divided into a set of humanlike faces, a smaller of set of cat faces that parallel some of the
humanlike faces, and a set of gesture symbols that combine a human or monkey face with
arm and hand positions.
Several emoticons are also encoded in the Miscellaneous Symbols block at U+2639..
U+263B and in the Supplemental Symbols and Pictographs block at U+1F910..U+1F917
and U+1F920..1F927.
Dingbats: U+2700–U+27BF
Most of the characters in the Dingbats block are derived from a well-established set of
glyphs, the ITC Zapf Dingbats series 100, which constitutes the industry standard “Zapf
Dingbat” font currently available in most laser printers. The order of the Dingbats block
basically follows the PostScript encoding. Dingbat characters derived from the Wingdings
and Webdings sets are encoded in other blocks, particularly in the Miscellaneous Symbols
and Pictographs block, U+1F300..U+1F5FF.
Unifications and Additions. Where a dingbat from the ITC Zapf Dingbats series 100 could
be unified with a generic symbol widely used in other contexts, only the generic symbol
was encoded. Examples of such unifications include card suits, black star, black tele-
phone, and black right-pointing index (see the Miscellaneous Symbols block); black
circle and black square (see the Geometric Shapes block); white encircled numbers 1 to
10 (see the Enclosed Alphanumerics block); and several generic arrows (see the Arrows
block). Those four entries appear elsewhere in this chapter. Other dingbat-like characters,
primarily from the carrier emoji sets, are encoded in the gaps that resulted from this unifi-
cation.
In other instances, other glyphs from the ITC Zapf Dingbats series 100 glyphs have come
to be recognized as having applicability as generic symbols, despite having originally been
encoded in the Dingbats block. For example, the series of negative (black) circled numbers
1 to 10 are now treated as generic symbols for this sequence, the continuation of which can
be found in the Enclosed Alphanumerics block. Other examples include U+2708 airplane
and U+2709 envelope, which have definite semantics independent of the specific glyph
shape, and which therefore should be considered generic symbols rather than symbols rep-
resenting only the Zapf Dingbats glyph shapes.
Symbols 855 22.9 Miscellaneous Symbols
For many of the remaining characters in the Dingbats block, their semantic value is pri-
marily their shape; unlike characters that represent letters from a script, there is no well-
established range of typeface variations for a dingbat that will retain its identity and there-
fore its semantics. It would be incorrect to arbitrarily replace U+279D triangle-headed
rightwards arrow with any other right arrow dingbat or with any of the generic arrows
from the Arrows block (U+2190..U+21FF). However, exact shape retention for the glyphs
is not always required to maintain the relevant distinctions. For example, ornamental char-
acters such as U+2741 eight petalled outlined black florette have been successfully
implemented in font faces other than Zapf Dingbats with glyph shapes that are similar, but
not identical to the ITC Zapf Dingbats series 100.
The following guidelines are provided for font developers wishing to support this block of
characters. Characters showing large sets of contrastive glyph shapes in the Dingbats block,
and in particular the various arrow shapes at U+2794..U+27BE, should have glyphs that
are closely modeled on the ITC Zapf Dingbats series 100, which are shown as representa-
tive glyphs in the code charts. The same applies to the various stars, asterisks, snowflakes,
drop-shadowed squares, check marks, and x’s, many of which are ornamental and have
elaborate names describing their glyphs.
Where the preceding guidelines do not apply, or where dingbats have more generic appli-
cability as symbols, their glyphs do not need to match the representative glyphs in the code
charts in every detail.
Ornamental Brackets. The 14 ornamental brackets encoded at U+2768..U+2775 are part
of the set of Zapf Dingbats. Although they have always been included in Zapf Dingbats
fonts, they were unencoded in PostScript versions of the fonts on some platforms. The
Unicode Standard treats these brackets as punctuation characters.
and used several parallel systems of symbols while retaining many symbols created by
Greek, Syriac, and medieval Arabic writers. Alchemical works published in what is best
described as a textbook tradition in the seventeenth and eighteenth centuries routinely
included tables of symbols that probably served to spread their use. They became obsolete
as alchemy gave way to chemistry. Nevertheless, alchemical symbols continue to be used
extensively today in scholarly literature, creative works, New Age texts, and in the gaming
and graphics industries.
This block contains a core repertoire of symbols recognized and organized into tables by
European writers working in the alchemical textbook tradition approximately 1620–1720.
This core repertoire includes all symbols found in the vast majority of the alchemical works
of major figures such as Newton, Boyle, and Paracelsus. Some of the most common
alchemical symbols have multiple meanings, and are encoded in the Miscellaneous Sym-
bols block, where their usage as alchemical symbols is annotated. For example, U+2609
sun is also an alchemical symbol for gold.
The character names for the alchemical symbols are in English. Their equivalent Latin
names, which often were in greater currency during the period of greatest use of these sym-
bols, are provided as aliases in the code charts. Some alchemical names in English directly
derive from the Latin name, such as aquafortis and aqua regia, so in a number of cases the
English and Latin names are identical.
Domino tile symbols are used for the “double-six” set of tiles, which is the most common
set of dominoes and the only one widely attested in manuals and textual discussion using
graphical tile symbols.
The domino tile symbols do not represent the domino pieces per se, but instead constitute
graphical symbols for particular orientations of the dominoes, because orientation of the
tiles is significant in discussion of dominoes play. Each visually distinct rotation of a dom-
ino tile is separately encoded. Thus, for example, both U+1F081 domino tile vertical-
04-02 and U+1F04F domino tile horizontal-04-02 are encoded, as well as U+1F075
domino tile vertical-02-04 and U+1F043 domino tile horizontal-02-04. All four of
those symbols represent the same game tile, but each orientation of the tile is visually dis-
tinct and requires its own symbol for text. The digits in the character names for the domino
tile symbols reflect the dot patterns on the tiles.
Two symbols do not represent particular tiles of the double-six set of dominoes, but
instead are graphical symbols for a domino tile turned facedown.
and receiver. Without such information or agreement, someone viewing an online docu-
ment may see substantially different glyphs from what the writer intended.
Basic playing card suit symbols are encoded in the Miscellaneous Symbols block in the
range U+2660..U+2667.
STD B24, and from various East Asian industry standards, such as the Japanese cell phone
carrier emoji sets, or corporate glyph registries.
Allocation. The Unicode Standard includes five blocks allocated for the encoding of vari-
ous enclosed and square symbols. Each of those blocks is described briefly in the text that
follows, to indicate which subsets of these symbols it contains and to highlight any other
special considerations that may apply to each block. In addition, there are a number of cir-
cled digit and number symbols encoded in the Dingbats block (U+2700..U+27BF). Those
circled symbols occur in the ITC Zapf dingbats series 100, and most of them were encoded
with other Zapf dingbat symbols, rather than being allocated in the separate blocks for
enclosed and square symbols. Finally, a small number of circled symbols from ISO/IEC
8859-1 or other sources can be found in the Latin-1 Supplement block (U+0080..U+00FF)
or the Letterlike Symbols block (U+2100..U+214F).
Decomposition. Nearly all of the enclosed and square symbols in the Unicode Standard are
considered compatibility characters, encoded for interoperability with other character sets.
A significant majority of those are also compatibility decomposable characters, given
explicit compatibility decompositions in the Unicode Character Database. The general
patterns for these decompositions are described here. For full details for any particular one
of these symbols, see the code charts or consult the data files in the UCD.
Parenthesized symbols are decomposed to sequences of opening and closing parentheses
surrounding the letter or digit(s) of the symbol. Square symbols consisting of digit(s) fol-
lowed by a full stop or a comma are decomposed into the digit sequence and the full stop or
comma. Square symbols consisting of stacks of Katakana syllables are decomposed into the
corresponding sequence of Katakana characters and are given the decomposition tag
“<square>”. Similar principles apply to square symbols consisting of sequences of Latin let-
ters and symbols. Chinese telegraphic symbols, consisting of sequences of digits and CJK
ideographs, are given compatibility decompositions, but do not have the decomposition tag
“<square>”.
Circled symbols consisting of a single letter or digit surrounded by a simple circular
graphic element are given compatibility decompositions with the decomposition tag “<cir-
cle>”. Circled symbols with more complex graphic styles, including double circled and
negative circled symbols, are simply treated as atomic symbols, and are not decomposed.
The same pattern is applied to enclosed symbols where the enclosure is a square graphic
element instead of a circle, except that the decomposition tag in those cases is “<square>”.
Occasionally a “circled” symbol that involves a sequence of Latin letters is preferentially
represented with an ellipse surrounding the letters, as for U+1F12E j circled wz, the
German Warenzeichen. Such elliptic shape is considered to be a typographical adaptation of
the circle, and does not constitute a distinct decomposition type in the Unicode Standard.
It is important to realize that the decomposition of enclosed symbols in the Unicode Stan-
dard does not make them canonical equivalents to letters or digits in sequence with com-
bining enclosing marks such as U+20DD % combining enclosing circle. The
combining enclosing marks are provided in the Unicode Standard to enable the represen-
tation of occasional enclosed symbols not otherwise encoded as characters. There is also
Symbols 863 22.10 Enclosed and Square
no defined way of indicating the application of a combining enclosing mark to more than
a single base character. Furthermore, full rendering support of the application of enclosing
combining marks, even to single base characters, is not widely available. Hence, in most
instances, if an enclosed symbol is available in the Unicode Standard as a single encoded
character, it is recommended to simply make use of that composed symbol.
Casing. There are special considerations for the casing relationships of enclosed or square
symbols involving letters of the Latin alphabet. The circled letters of the Latin alphabet
come in an uppercase set (U+24B6..U+24CF) and a lowercase set (U+24D0..U+24EA).
Largely because the compatibility decompositions for those symbols are to a single letter
each, these two sets are given the derived properties, Uppercase and Lowercase, respec-
tively, and case map to each other. The superficially similar parenthesized letters of the
Latin alphabet also come in an uppercase set (U+1F110..U+1F129) and a lowercase set
(U+24BC..U+24B5), but are not case mapped to each other and are not given derived cas-
ing properties. This difference is in part because the compatibility decompositions for
these parenthesized symbols are to sequences involving parentheses, instead of single let-
ters, and in part because the uppercase set was encoded many years later than the lower-
case set. Square symbols consisting of arbitrary sequences of Latin letters, which
themselves may be of mixed case, are simply treated as caseless symbols in the Unicode
Standard.
The enclosed symbols in the range U+3248..U+324F, which consist of circled numbers ten
through eighty on white circles centered on black squares, are encoded for compatibility
with the Japanese television standard, ARIB STD B24. In that standard, they are intended
to represent symbols for speed limit signs, expressed in kilometers per hour.
regional indicator symbols should be rendered. However, current industry practice widely
interprets pairs of regional indicator symbols as representing a flag associated with the cor-
responding ISO 3166 region code. This practice is detailed in the separate Unicode Tech-
nical Standard #51, “Unicode Emoji.” That specification includes data tables that list
precisely which pairs are interpreted for any given version of UTS #51. Charts are also
available showing representative flag glyphs for these interpreted pairs, displayed as part of
the emoji symbol sets for many mobile platforms.
Conformance to the Unicode Standard does not require conformance to UTS #51. How-
ever, the interpretation and display of pairs of regional indicator symbols as specified in
UTS #51 is now widely deployed, so in practice it is not advisable to attempt to interpret
pairs of regional indicator symbols as representing anything other than an emoji flag.
Regional indicator symbols have specialized properties and behavior related to segmenta-
tion, which help to keep interpreted pairs together for line breaking, word segmentation,
and so forth.
The file EmojiSources.txt in the Unicode Character Database provides more information
about source mappings from pairs of regional indicator symbols to flag emoji in older car-
rier emoji sets. Provision of roundtrip mappings to those flag emoji was the original impe-
tus to include regional indicator symbols in the Unicode Standard.
Chapter 23
Special Areas and Format
Characters 23
This chapter describes several kinds of characters that have special properties as well as
areas of the codespace that are set aside for special purposes:
The Unicode Standard contains code positions for the 64 control characters and the DEL
character found in ISO standards and many vendor character sets. The choice of control
function associated with a given character code is outside the scope of the Unicode Stan-
dard, with the exception of those control characters specified in this chapter.
Layout controls are not themselves rendered visibly, but influence the behavior of algo-
rithms for line breaking, word breaking, glyph selection, and bidirectional ordering.
Surrogate code points are restricted use. The numeric values for surrogates are used in
pairs in UTF-16 to access 1,048,576 supplementary code points in the range
U+10000..U+10FFFF.
Variation selectors allow the specification of standardized variants of characters. This abil-
ity is particularly useful where the majority of implementations would treat the two vari-
ants as two forms of the same character, but where some implementations need to
differentiate between the two. By using a variation selector, such differentiation can be
made explicit.
Private-use characters are reserved for private use. Their meaning is defined by private
agreement.
Noncharacters are code points that are permanently reserved and will never have charac-
ters assigned to them.
The Specials block contains characters that are neither graphic characters nor traditional
controls.
Tag characters were intended to support a general scheme for the internal tagging of text
streams in the absence of other mechanisms, such as markup languages. The use of tag
characters for language tagging is deprecated.
Special Areas and Format Characters 868 23.1 Control Codes
extra-textual information. When converting escape sequences into and out of Unicode text,
they should be converted on a character-by-character basis. For instance, “ESC-A” <1B
41> would be converted into the Unicode coded character sequence <001B, 0041>. Inter-
pretation of U+0041 as part of the escape sequence, rather than as latin capital letter a, is
the responsibility of the higher-level protocol that makes use of such escape sequences.
This approach allows for low-level conversion processes to conformantly convert escape
sequences into and out of the Unicode Standard without needing to actually recognize the
escape sequences as such.
If a process uses escape sequences or other configurations of control code sequences to
embed additional information about text (such as formatting attributes or structure), then
such sequences constitute a higher-level protocol that is outside the scope of the Unicode
Standard.
The control codes in Table 23-1 have the Bidi_Class property values of S, B, or WS, rather
than the default of BN used for other control codes. (See Unicode Standard Annex #9,
“Unicode Bidirectional Algorithm.”) In particular, U+001C..U+001E and U+001F have the
Bidi_Class property values B and S, respectively, so that the Bidirectional Algorithm recog-
nizes their separator semantics.
The control codes U+0009..U+000D and U+0085 have the White_Space property. They
also have line breaking property values that differ from the default CM value for other con-
trol codes. (See Unicode Standard Annex #14, “Unicode Line Breaking Algorithm.”)
Special Areas and Format Characters 870 23.1 Control Codes
U+0000 null may be used as a Unicode string terminator, as in the C language. Such usage
is outside the scope of the Unicode Standard, which does not require any particular formal
language representation of a string or any particular usage of null.
Newline Function. In particular, one or more of the control codes U+000A line feed,
U+000D carriage return, and the Unicode equivalent of the EBCDIC next line can encode
a newline function. A newline function can act like a line separator or a paragraph separator,
depending on the application. See Section 23.2, Layout Controls, for information on how to
interpret a line or paragraph separator. The exact encoding of a newline function depends
on the application domain. For information on how to identify a newline function, see
Section 5.8, Newline Guidelines.
Special Areas and Format Characters 871 23.2 Layout Controls
Zero Width Space. The U+200B zero width space indicates a word break or line break
opportunity, even though there is no intrinsic width associated with this character. Zero-
width space characters are intended to be used in languages that have no visible word spac-
ing to represent word break or line break opportunities, such as Thai, Myanmar, Khmer,
and Japanese.
The “zero width” in the character name for ZWSP should not be understood too literally.
While this character ordinarily does not result in a visible space between characters, text
justification algorithms may add inter-character spacing (letter spacing) between charac-
ters separated by a ZWSP. For example, in Table 23-2, the row labeled “Display 4” illus-
trates incorrect suppression of inter-character spacing in the context of a ZWSP.
This behavior for ZWSP contrasts with that for fixed-width space characters, such as
U+2002 en space. Such spaces have a specified width that is typically unaffected by justifi-
cation and which should not be increased (or reduced) by inter-character spacing (see
Section 6.2, General Punctuation).
In some languages such as German and Russian, increased letter spacing is used to indicate
emphasis. Implementers should be aware of this issue.
Zero-Width Spaces and Joiner Characters. The zero-width spaces are not to be confused
with the zero-width joiner characters. U+200C zero width non-joiner and U+200D
zero width joiner have no effect on word or line break boundaries, and zero width no-
break space and zero width space have no effect on joining or linking behavior. The
zero-width joiner characters should be ignored when determining word or line break
boundaries. See “Cursive Connection” later in this section.
Hyphenation. U+00AD soft hyphen (SHY ) indicates an intraword break point, where a
line break is preferred if a word must be hyphenated or otherwise broken across lines. Such
break points are generally determined by an automatic hyphenator. SHY can be used with
any script, but its use is generally limited to situations where users need to override the
behavior of such a hyphenator. The visible rendering of a line break at an intraword break
point, whether automatically determined or indicated by a SHY, depends on the surround-
Special Areas and Format Characters 873 23.2 Layout Controls
ing characters, the rules governing the script and language used, and, at times, the meaning
of the word. The precise rules are outside the scope of this standard, but see Unicode Stan-
dard Annex #14, “Unicode Line Breaking Algorithm,” for additional information. A com-
mon default rendering is to insert a hyphen before the line break, but this is insufficient or
even incorrect in many situations.
Contrast this usage with U+2027 hyphenation point, which is used for a visible indica-
tion of the place of hyphenation in dictionaries. For a complete list of dash characters in
the Unicode Standard, including all the hyphens, see Table 6-3.
The Unicode Standard includes two nonbreaking hyphen characters: U+2011 non-
breaking hyphen and U+0F0C tibetan mark delimiter tsheg bstar. See Section 13.4,
Tibetan, for more discussion of the Tibetan-specific line breaking behavior.
Line and Paragraph Separator. The Unicode Standard provides two unambiguous char-
acters, U+2028 line separator and U+2029 paragraph separator, to separate lines and
paragraphs. They are considered the default form of denoting line and paragraph boundar-
ies in Unicode plain text. A new line is begun after each line separator. A new paragraph
is begun after each paragraph separator. As these characters are separator codes, it is not
necessary either to start the first line or paragraph or to end the last line or paragraph with
them. Doing so would indicate that there was an empty paragraph or line following. The
paragraph separator can be inserted between paragraphs of text. Its use allows the cre-
ation of plain text files, which can be laid out on a different line width at the receiving end.
The line separator can be used to indicate an unconditional end of line.
A paragraph separator indicates where a new paragraph should start. Any interparagraph
formatting would be applied. This formatting could cause, for example, the line to be bro-
ken, any interparagraph line spacing to be applied, and the first line to be indented. A line
separator indicates that a line break should occur at this point; although the text continues
on the next line, it does not start a new paragraph—no interparagraph line spacing or para-
graphic indentation is applied. For more information on line separators, see Section 5.8,
Newline Guidelines.
ligature to create the most appropriate line layout. However, the rendering system cannot
define the locations where ligatures are possible because there are many languages in
which ligature formation requires more information. For example, in some languages, lig-
atures are never formed across syllable boundaries.
On occasion, an author may wish to override the normal automatic selection of connecting
glyphs or ligatures. Typically, this choice is made to achieve one of the following effects:
• Cause nondefault joining appearance (for example, as is sometimes required in
writing Persian using the Arabic script)
• Exhibit the joining-variant glyphs themselves in isolation
• Request a ligature to be formed where it normally would not be
• Request a ligature not to be formed where it normally would be
The Unicode Standard provides two characters that influence joining and ligature glyph
selection: U+200C zero width non-joiner and U+200D zero width joiner. The zero
width joiner and non-joiner request a rendering system to have more or less of a connec-
tion between characters than they would otherwise have. Such a connection may be a sim-
ple cursive link, or it may include control of ligatures.
The zero width joiner and non-joiner characters are designed for use in plain text; they
should not be used where higher-level ligation and cursive control is available. (See the
W3C specification, “Unicode in XML and Other Markup Languages,” for more informa-
tion.) Moreover, they are essentially requests for the rendering system to take into account
when laying out the text; while a rendering system should consider them, it is perfectly
acceptable for the system to disregard these requests.
The ZWJ and ZWNJ are designed for marking the unusual cases where ligatures or cursive
connections are required or prohibited. These characters are not to be used in all cases
where ligatures or cursive connections are desired; instead, they are meant only for over-
riding the normal behavior of the text.
Joiner. U+200D zero width joiner is intended to produce a more connected rendering
of adjacent characters than would otherwise be the case, if possible. In particular:
• If the two characters could form a ligature but do not normally, ZWJ requests
that the ligature be used.
• Otherwise, if either of the characters could cursively connect but do not nor-
mally, ZWJ requests that each of the characters take a cursive-connection form
where possible.
In a sequence like <X, ZWJ, Y>, where a cursive form exists for X but not for Y, the presence
of ZWJ requests a cursive form for X. Otherwise, where neither a ligature nor a cursive con-
nection is available, the ZWJ has no effect. In other words, given the three broad categories
below, ZWJ requests that glyphs in the highest available category (for the given font) be
used:
Special Areas and Format Characters 875 23.2 Layout Controls
1. Ligated
2. Cursively connected
3. Unconnected
Non-joiner. U+200C zero width non-joiner is intended to break both cursive connec-
tions and ligatures in rendering.
ZWNJ requests that glyphs in the lowest available category (for the given font) be used.
For those unusual circumstances where someone wants to forbid ligatures in a sequence
XY but promote cursive connection, the sequence <X, ZWJ, ZWNJ, ZWJ, Y> can be used.
The ZWNJ breaks ligatures, while the two adjacent joiners cause the X and Y to take adja-
cent cursive forms (where they exist). Similarly, if someone wanted to have X take a cursive
form but Y be isolated, then the sequence <X, ZWJ, ZWNJ, Y> could be used (as in previous
versions of the Unicode Standard). Examples are shown in Figure 23-3.
Cursive Connection. For cursive connection, the joiner and non-joiner characters typically
do not modify the contextual selection process itself, but instead change the context of a
particular character occurrence. By providing a non-joining adjacent character where the
adjacent character otherwise would be joining, or vice versa, they indicate that the render-
ing process should select a different joining glyph. This process can be used in two ways: to
prevent a cursive joining or to exhibit joining glyphs in isolation.
In Figure 23-1, the insertion of the ZWNJ overrides the normal cursive joining of sad and
lam.
π + › → fiª
0635 0644
→
π + Ã + › ›π
0635 200C 0644
In Figure 23-2, the normal display of ghain without ZWJ before or after it uses the nominal
(isolated) glyph form. When preceded and followed by ZWJ characters, however, the ghain
is rendered with its medial form glyph in isolation.
Õ → Õ
063A
Ä + Õ + Ä → –
200D 063A 200D
Special Areas and Format Characters 876 23.2 Layout Controls
The examples in Figure 23-1 and Figure 23-2 are adapted from the Iranian national coded
character set standard, ISIRI 3342, which defines ZWNJ and ZWJ as “pseudo space” and
“pseudo connection,” respectively.
Examples. Figure 23-3 provides samples of desired renderings when the joiner or non-
joiner is inserted between two characters. The examples presume that all of the glyphs are
available in the font. If, for example, the ligatures are not available, the display would fall
back to the unligated forms. Each of the entries in the first column of Figure 23-3 shows
two characters in visual display order. The column headings show characters to be inserted
between those two characters. The cells below show the respective display when the joiners
in the heading row are inserted between the original two characters.
Character
As Is
Sequences
f i
0066 0069
f i or fi fi fi fi
0627 0644
062C 0645
062C 0648
For backward compatibility, between Arabic characters a ZWJ acts just like the sequence
<ZWJ, ZWNJ, ZWJ>, preventing a ligature from forming instead of requesting the use of a
ligature that would not normally be used. As a result, there is no plain text mechanism for
requesting the use of a ligature in Arabic text.
Transparency. The property value of Joining_Type = Transparent applies to characters
that should not interfere with cursive connection, even when they occur in sequence
between two characters that are connected cursively. These include all nonspacing marks
and most format control characters, except for ZWJ and ZWNJ themselves. Note, in partic-
ular, that enclosing combining marks are also transparent as regards cursive connection.
For example, using U+20DD combining enclosing circle to circle an Arabic letter in a
sequence should not cause that Arabic letter to change its cursive connections to neighbor-
ing letters. See Section 9.2, Arabic, for more on joining classes and the details regarding Ara-
bic cursive joining.
Joiner and Non-joiner in Indic Scripts. In Indic text, the ZWJ and ZWNJ are used to
request particular display forms. A ZWJ after a sequence of consonant plus virama requests
what is called a “half-form” of that consonant. A ZWNJ after a sequence of consonant plus
Special Areas and Format Characters 877 23.2 Layout Controls
Blocking Reordering. The CGJ has no visible glyph and no other format effect on neigh-
boring characters but simply blocks reordering of combining marks. It can therefore be
used as a tool to distinguish two alternative orderings of a sequence of combining marks
for some exceptional processing or rendering purpose, whenever normalization would
otherwise eliminate the distinction between the two sequences.
For example, using CGJ to block reordering is one way to maintain distinction between
differently ordered sequences of certain Hebrew accents and marks. These distinctions are
necessary for analytic and text representational purposes. However, these characters were
assigned fixed-position combining classes despite the fact that they interact typographi-
cally. As a result, normalization treats differently ordered sequences as equivalent. In par-
ticular, the sequence
<lamed, patah, hiriq, finalmem>
is canonically equivalent to
<lamed, hiriq, patah, finalmem>
because the canonical combining classes of U+05B4 hebrew point hiriq and U+05B7
hebrew point patah are distinct. However, the sequence
<lamed, patah, CGJ, hiriq, finalmem>
is not canonically equivalent to the other two. The presence of the combining grapheme
joiner, which has ccc = 0, blocks the reordering of hiriq before patah by canonical reorder-
ing and thus allows a patah following a hiriq and a patah preceding a hiriq to be reliably
distinguished, whether for display or for other processing.
The use of CGJ with double diacritics is discussed in Section 7.9, Combining Marks; see
Figure 7-11.
CGJ and Collation. The Unicode Collation Algorithm normalizes Unicode text strings
before applying collation weighting. The combining grapheme joiner is ordinarily ignored
in collation key weighting in the UCA. However, whenever it blocks the reordering of com-
bining marks in a string, it affects the order of secondary key weights associated with those
combining marks, giving the two strings distinct keys. That makes it possible to treat them
distinctly in searching and sorting without having to tailor the weights for either the com-
bining grapheme joiner or the combining marks.
The CGJ can also be used to prevent the formation of contractions in the Unicode Colla-
tion Algorithm. For example, while “ch” is sorted as a single unit in a tailored Slovak colla-
tion, the sequence <c, CGJ, h> will sort as a “c” followed by an “h”. The CGJ can also be
used in German, for example, to distinguish in sorting between “ü” in the meaning of u-
umlaut, which is the more common case and often sorted like <u,e>, and “ü” in the mean-
ing u-diaeresis, which is comparatively rare and sorted like “u” with a secondary key
weight. This also requires no tailoring of either the combining grapheme joiner or the
sequence. Because CGJ is invisible and has the Default_Ignorable_Code_Point property,
data that are marked up with a CGJ should not cause problems for other processes.
Special Areas and Format Characters 879 23.2 Layout Controls
It is possible to give sequences of characters that include the combining grapheme joiner
special tailored weights. Thus the sequence <c, CGJ, h> could be weighted completely dif-
ferently from the contraction “ch” or from the way “c” and “h” would have sorted without
the contraction. However, such an application of CGJ is not recommended. For more
information on the use of CGJ with sorting, matching, and searching, see Unicode Techni-
cal Report #10, “Unicode Collation Algorithm.”
Rendering. For rendering, the combining grapheme joiner is invisible. However, some
older implementations may treat a sequence of grapheme clusters linked by combining
grapheme joiners as a single unit for the application of enclosing combining marks. For
more information on grapheme clusters, see Unicode Technical Report #29, “Unicode
Text Segmentation.” For more information on enclosing combining marks, see
Section 3.11, Normalization Forms.
CGJ and Joiner Characters. The combining grapheme joiner must not be confused with
the zero width joiner or the word joiner, which have very different functions. In particular,
inserting a combining grapheme joiner between two characters should have no effect on
their ligation or cursive joining behavior. Where the prevention of line breaking is the
desired effect, the word joiner should be used. For more information on the behavior of
these characters in line breaking, see Unicode Standard Annex #14, “Unicode Line Break-
ing Algorithm.”
As with other format control characters, bidirectional ordering controls affect the layout of
the text in which they are contained but should be ignored for other text processes, such as
sorting or searching. However, text processes that modify text content must maintain these
characters correctly, because matching pairs of bidirectional ordering controls must be
coordinated, so as not to disrupt the layout and interpretation of bidirectional text. Each
instance of a lre, rle, lro, or rlo is normally paired with a corresponding pdf. Likewise,
each instance of an lri, rli, or fsi is normally paired with a corresponding pdi.
U+200E left-to-right mark, U+200F right-to-left mark, and U+061C arabic let-
ter mark have the semantics of an invisible character of zero width, except that these char-
acters have strong directionality. They are intended to be used to resolve cases of
ambiguous directionality in the context of bidirectional texts; they are not paired. Unlike
U+200B zero width space, these characters carry no word breaking semantics. (See Uni-
code Standard Annex #9, “Unicode Bidirectional Algorithm,” for more information.)
The bidirectional overrides, embeddings, and isolates, as well as the annotation characters
are reasonably robust, because their behavior terminates at paragraph boundaries. Paired
format controls for representation of beams and slurs in music are recommended only for
specialized musical layout software, and also have limited scope.
Bidirectional overrides, embeddings, and isolates are default ignorable (that is,
Default_Ignorable_Code_Point = True); if they are not supported by an implementation,
they should not be rendered with a visible glyph. The paired stateful controls for musical
beams and slurs are likewise default ignorable.
The annotation characters, however, are different. When they are used and correctly inter-
preted by an implementation, they separate annotation text from the annotated text, and
the fully rendered text will typically distinguish the two parts quite clearly. Simply omitting
Special Areas and Format Characters 881 23.2 Layout Controls
any display of the annotation characters by an implementation which does not interpret
them would have the potential to cause significant misconstrual of text content. Hence, the
annotation characters are not default ignorable; an implementation which does not inter-
pret them should render them with visible glyphs, using one of the techniques discussed in
Section 5.3, Unknown and Missing Characters. See “Annotation Characters” in Section 23.8,
Specials for more discussion.
Other paired stateful controls in the standard are deprecated, and their use should be
avoided. They are listed in Table 23-5.
The tag characters, originally intended for the representation of language tags, are particu-
larly fragile under editorial operations that move spans of text around. See Section 5.10,
Language Information in Plain Text, for more information about language tagging.
Special Areas and Format Characters 882 23.3 Deprecated Format Characters
presentation forms (for example, U+FE80..U+FEFC), then these forms should be pre-
sented without shape modification. This state (inhibited) is the default state in the absence
of any character shaping selector or a higher-level protocol.
From the point of encountering a U+206D activate arabic form shaping format char-
acter up to a subsequent U+206C inhibit arabic form shaping (if any), any Arabic
presentation forms that appear in the backing store should be presented with shape modi-
fication by means of the character shaping (glyph selection) process.
The shaping selectors have no effect on nominal Arabic characters (U+0660..U+06FF),
which are always subject to character shaping (glyph selection).
Numeric Shape Selectors. The numeric shape selector format characters allow the selec-
tion of the shapes in which the digits U+0030..U+0039 are to be rendered. These format
characters do not nest.
From the point of encountering a U+206E national digit shapes format character up to
a subsequent U+206F nominal digit shapes (if any), the European digits (U+0030..
U+0039) should be depicted using the appropriate national digit shapes as specified by
means of appropriate agreements. For example, they could be displayed with shapes such
as the arabic-indic digits (U+0660..U+0669). The actual character shapes (glyphs) used
to display national digit shapes are not specified by the Unicode Standard.
From the point of encountering a U+206F nominal digit shapes format character up to
a subsequent U+206E national digit shapes (if any), the European digits (U+0030..
U+0039) should be depicted using glyphs that represent the nominal digit shapes shown in
the code tables for these digits. This state (nominal) is the default state in the absence of
any numeric shape selector or a higher-level protocol.
Special Areas and Format Characters 884 23.4 Variation Selectors
be treated as a case mapping pair, or a private agreement could specify that a private-use
character is to be rendered and otherwise treated as a combining mark.
To exchange private-use characters in a semantically consistent way, users may also
exchange privately defined data which describes how each private-use character is to be
interpreted. The Unicode Standard provides no predefined format for such a data
exchange.
Normalization. The canonical and compatibility decompositions of any private-use char-
acter are equal to the character itself (for example, U+E000 decomposes to U+E000). The
Canonical_Combining_Class of private-use characters is defined as 0 (Not_Reordered).
These values are normatively defined by the Unicode Standard and cannot be changed by
private agreement. The treatment of all private-use characters for normalization forms
NFC, NFD, NFKD, and NFKC is also normatively defined by the Unicode Standard on the
basis of these decompositions. (See Unicode Standard Annex #15, “Unicode Normaliza-
tion Forms.”) No private agreement may change these forms—for example, by changing
the standard canonical or compatibility decompositions for private-use characters. The
implication is that all private-use characters, no matter what private agreements they are
subject to, always normalize to themselves and are never reordered in any Unicode nor-
malization form.
This does not preclude private agreements on other transformations. Thus one could
define a transformation “MyCompanyComposition” that was identical to NFC except that
it mapped U+E000 to “a”. The forms NFC, NFD, NFKD, and NFKC themselves, however,
cannot be changed by such agreements.
example of the latter case would be the assignment of a character code to a vendor-specific
logo character such as Apple’s apple character.
Note, however, that systems vendors may need to support full end-user definability for all
private-use characters, for such purposes as gaiji support or for transient cross-mapping
tables. The use of noncharacters (see Section 23.7, Noncharacters, and Definition D14 in
Section 3.4, Characters and Encoding) is the preferred way to make use of non-interchange-
able internal system sentinels of various sorts.
End-User Subarea. The end-user subarea is intended for private-use character definitions
by end users or for scratch allocations of character space by end-user applications.
Allocation of Subareas. Vendors may choose to reserve ranges of private-use characters in
the corporate use subarea and make some defined portion of the end-user subarea avail-
able for completely free end-user definition. The convention of separating the two subareas
is merely a suggestion for the convenience of system vendors and software developers. No
firm dividing line between the two subareas is defined in this standard, as different users
may have different requirements. No provision is made in the Unicode Standard for avoid-
ing a “stack-heap collision” between the two subareas; in other words, there is no guaran-
tee that end users will not define a private-use character at a code point that overlaps and
conflicts with a particular corporate private-use definition at the same code point. Avoid-
ing such overlaps in definition is up to implementations and users.
23.7 Noncharacters
Noncharacters: U+FFFE, U+FFFF, and Others
Noncharacters are code points that are permanently reserved in the Unicode Standard for
internal use. They are not recommended for use in open interchange of Unicode text data.
See Section 3.2, Conformance Requirements and Section 3.4, Characters and Encoding, for
the formal definition of noncharacters and conformance requirements related to their use.
The Unicode Standard sets aside 66 noncharacter code points. The last two code points of
each plane are noncharacters: U+FFFE and U+FFFF on the BMP, U+1FFFE and
U+1FFFF on Plane 1, and so on, up to U+10FFFE and U+10FFFF on Plane 16, for a total
of 34 code points. In addition, there is a contiguous range of another 32 noncharacter code
points in the BMP: U+FDD0..U+FDEF. For historical reasons, the range
U+FDD0..U+FDEF is contained within the Arabic Presentation Forms-A block, but those
noncharacters are not “Arabic noncharacters” or “right-to-left noncharacters,” and are not
distinguished in any other way from the other noncharacters, except in their code point
values.
Applications are free to use any of these noncharacter code points internally. They have no
standard interpretation when exchanged outside the context of internal use. However, they
are not illegal in interchange, nor does their presence cause Unicode text to be ill-formed.
The intent of noncharacters is that they are permanently prohibited from being assigned
interchangeable meanings by the Unicode Standard. They are not prohibited from occur-
ring in valid Unicode strings which happen to be interchanged. This distinction, which
might be seen as too finely drawn, ensures that noncharacters are correctly preserved when
“interchanged” internally, as when used in strings in APIs, in other interprocess protocols,
or when stored.
If a noncharacter is received in open interchange, an application is not required to inter-
pret it in any way. It is good practice, however, to recognize it as a noncharacter and to take
appropriate action, such as replacing it with U+FFFD replacement character, to indi-
cate the problem in the text. It is not recommended to simply delete noncharacter code
points from such text, because of the potential security issues caused by deleting uninter-
preted characters. (See conformance clause C7 in Section 3.2, Conformance Requirements,
and Unicode Technical Report #36, “Unicode Security Considerations.”)
In effect, noncharacters can be thought of as application-internal private-use code points.
Unlike the private-use characters discussed in Section 23.5, Private-Use Characters, which
are assigned characters and which are intended for use in open interchange, subject to
interpretation by private agreement, noncharacters are permanently reserved (unassigned)
and have no interpretation whatsoever outside of their possible application-internal pri-
vate uses.
U+FFFF and U+10FFFF. These two noncharacter code points have the attribute of being
associated with the largest code unit values for particular Unicode encoding forms. In
UTF-16, U+FFFF is associated with the largest 16-bit code unit value, FFFF16. U+10FFFF
Special Areas and Format Characters 892 23.7 Noncharacters
is associated with the largest legal UTF-32 32-bit code unit value, 10FFFF16. This attribute
renders these two noncharacter code points useful for internal purposes as sentinels. For
example, they might be used to indicate the end of a list, to represent a value in an index
guaranteed to be higher than any valid character value, and so on.
U+FFFE. This noncharacter has the intended peculiarity that, when represented in UTF-
16 and then serialized, it has the opposite byte sequence of U+FEFF, the byte order mark.
This means that applications should reserve U+FFFE as an internal signal that a UTF-16
text stream is in a reversed byte format. Detection of U+FFFE at the start of an input
stream should be taken as a strong indication that the input stream should be byte-
swapped before interpretation. For more on the use of the byte order mark and its interac-
tion with the noncharacter U+FFFE, see Section 23.8, Specials.
Special Areas and Format Characters 893 23.8 Specials
23.8 Specials
The Specials block contains code points that are interpreted as neither control nor graphic
characters but that are provided to facilitate current software practices.
For information about the noncharacter code points U+FFFE and U+FFFF, see
Section 23.7, Noncharacters.
In UTF-8, the BOM corresponds to the byte sequence <EF16 BB16 BF16>. Although there
are never any questions of byte order with UTF-8 text, this sequence can serve as signature
for UTF-8 encoded text where the character set is unmarked. As with a BOM in UTF-16,
this sequence of bytes will be extremely rare at the beginning of text files in other character
encodings. For example, in systems that employ Microsoft Windows ANSI Code Page
1252, <EF16 BB16 BF16> corresponds to the sequence <i diaeresis, guillemet, inverted ques-
tion mark> “ï » ¿”.
For compatibility with versions of the Unicode Standard prior to Version 3.2, the code
point U+FEFF has the word-joining semantics of zero width no-break space when it is not
used as a BOM. In new text, these semantics should be encoded by U+2060 word joiner.
See “Line and Word Breaking” in Section 23.2, Layout Controls, for more information.
Where the byte order is explicitly specified, such as in UTF-16BE or UTF-16LE, then all
U+FEFF characters—even at the very beginning of the text—are to be interpreted as zero
width no-break spaces. Similarly, where Unicode text has known byte order, initial U+FEFF
characters are not required, but for backward compatibility are to be interpreted as zero
width no-break spaces. For example, for strings in an API, the memory architecture of the
processor provides the explicit byte order. For databases and similar structures, it is much
more efficient and robust to use a uniform byte order for the same field (if not the entire
database), thereby avoiding use of the byte order mark.
Systems that use the byte order mark must recognize when an initial U+FEFF signals the
byte order. In those cases, it is not part of the textual content and should be removed before
processing, because otherwise it may be mistaken for a legitimate zero width no-break
space. To represent an initial U+FEFF zero width no-break space in a UTF-16 file, use
U+FEFF twice in a row. The first one is a byte order mark; the second one is the initial zero
width no-break space. See Table 23-6 for a summary of encoding scheme signatures.
If U+FEFF had only the semantics of a signature code point, it could be freely deleted from
text without affecting the interpretation of the rest of the text. Carelessly appending files
together, for example, can result in a signature code point in the middle of text. Unfortu-
nately, U+FEFF also has significance as a character. As a zero width no-break space, it indi-
cates that line breaks are not allowed between the adjoining characters. Thus U+FEFF
affects the interpretation of text and cannot be freely deleted. The overloading of semantics
for this code point has caused problems for programs and protocols. The new character
U+2060 word joiner has the same semantics in all cases as U+FEFF, except that it cannot
Special Areas and Format Characters 895 23.8 Specials
be used as a signature. Implementers are strongly encouraged to use word joiner in those
circumstances whenever word joining semantics are intended.
An initial U+FEFF also takes a characteristic form in other charsets designed for Unicode
text. (The term “charset” refers to a wide range of text encodings, including encoding
schemes as well as compression schemes and text-specific transformation formats.) The
characteristic sequences of bytes associated with an initial U+FEFF can serve as signatures
in those cases, as shown in Table 23-7.
Most signatures can be deleted either before or after conversion of an input stream into a
Unicode encoding form. However, in the case of BOCU-1 and UTF-7, the input byte
sequence must be converted before the initial U+FEFF can be deleted, because stripping
the signature byte sequence without conversion destroys context necessary for the correct
interpretation of subsequent bytes in the input sequence.
Specials: U+FFF0–U+FFF8
The nine unassigned Unicode code points in the range U+FFF0..U+FFF8 are reserved for
special character definitions.
Text display
Felix
Annotation Annotation
characters Text stream characters
For example, consider the task of embedding a language tag for Japanese. The Japanese tag
from BCP 47 is “ja” (composed of ISO 639 language id) or, alternatively, “ja-JP” (composed
of ISO 639 language id plus ISO 3166 country id). Because BCP 47 specifies that language
tags are not case significant, it is recommended that for language tags, the entire tag be low-
ercased before conversion to tag characters.
Thus the entire language tag “ja-JP” would be converted to the tag characters as follows:
<U+E0001, U+E006A, U+E0061, U+E002D, U+E006A, U+E0070>
The language tag, in its shorter, “ja” form, would be expressed as follows:
<U+E0001, U+E006A, U+E0061>
Tag Scope and Nesting. The value of an established tag continues from the point at which
the tag is embedded in text until either
A. The text itself goes out of scope, as defined by the application, for
example, for line-oriented protocols, when reaching the end-of-line or
end-of-string; for text streams, when reaching the end-of-stream; and so
on),
or
B. The tag is explicitly canceled by the U+E007F cancel tag character.
Tags of the same type cannot be nested in any way. For example, if a new embedded lan-
guage tag occurs following text that was already language tagged, the tagged value for sub-
sequent text simply changes to that specified in the new tag.
Tags of different types can have interdigitating scope, but not hierarchical scope. In effect,
tags of different types completely ignore each other, so that the use of language tags can be
completely asynchronous with the use of future tag types. These relationships are illus-
trated in Figure 23-5.
Canceling Tag Values. The main function of cancel tag is to make possible operations
such as blind concatenation of strings in a tagged context without the propagation of inap-
propriate tag values across the string boundaries. There are two uses of cancel tag. To
cancel a tag value of a particular type, prefix the cancel tag character with the tag identi-
fication character of the appropriate type. For example, the complete string to cancel a lan-
guage tag is <U+E0001, U+E007F>. The value of the relevant tag type returns to the default
state for that tag type—namely, no tag value specified, the same as untagged text. To cancel
any tag values of any type that may be in effect, use cancel tag without a prefixed tag iden-
tification character.
Currently there is no observable difference in the two uses of cancel tag, because only
one tag identification character (and therefore one tag type) is defined. Inserting a bare
cancel tag in places where only the language tag needs to be canceled could lead to unan-
ticipated side effects if this text were to be inserted in the future into a text that supports
more than one tag type.
Special Areas and Format Characters 900 23.9 Tag Characters
the tags themselves visible, it is advisable that the tag characters be rendered using the corre-
sponding ASCII character glyphs (perhaps modified systematically to differentiate them
from normal ASCII characters). The tag character values have been chosen, however, so that
the tag characters will be interpretable in most debuggers even without display support.
Processing. Sequential access to the text is generally straightforward. If language codes are
not relevant to the particular processing operation, then they should be ignored. Random
access to stateful tags is more problematic. Because the current state of the text depends on
tags that appeared previous to it, the text must be searched backward, sometimes all the
way to the start. With these exceptions, tags pose no particular difficulties as long as no
modifications are made to the text.
Range Checking for Tag Characters. Tag characters are encoded in Plane 14 to support
easy range checking. The following C/C++ source code snippets show efficient implemen-
tations of range checks for characters U+E0000..U+E007F expressed in each of the three
significant Unicode encoding forms. Range checks allow implementations that do not
want to support these tag characters to efficiently filter for them.
Range check expressed in UTF-32:
if ( ((unsigned) *s) - 0xE0000 <= 0x7F )
Range check expressed in UTF-16:
if ( ( *s == 0xDB40 ) && ( ((unsigned)*(s+1)) - 0xDC00 <= 0x7F ) )
Range check expressed in UTF-8:
if ( ( *s == 0xF3 ) && ( *(s+1) == 0xA0 ) &&
( ( *(s+2) & 0xFE ) == 0x80 ) )
Alternatively, the range checks for UTF-32 and UTF-16 can be coded with bit masks. Both
versions should be equally efficient.
Range check expressed in UTF-32:
if ( ((*s) & 0xFFFFFF80) == 0xE0000 )
Range check expressed in UTF-16:
if ( ( *s == 0xDB40 ) && ( *(s+1) & 0xDC80) == 0xDC00 )
Editing and Modification. Inline tags present particular problems for text changes,
because they are stateful. Any modifications of the text are more complicated, as those
modifications need to be aware of the current language status and the <start>...<end> tags
must be properly maintained. If an editing program is unaware that certain tags are stateful
and cannot process them correctly, then it is very easy for the user to modify text in ways
that corrupt it. For example, a user might delete part of a tag or paste text including a tag
into the wrong context.
Dangers of Incomplete Support. Even programs that do not interpret the tags should not
allow editing operations to break initial tags or leave tags unpaired. Unpaired tags should
be discarded upon a save or send operation.
Special Areas and Format Characters 902 23.9 Tag Characters
Chapter 24
About the Code Charts 24
Disclaimer
Character images shown in the code charts are not prescriptive. In actual fonts,
considerable variations are to be expected.
The Unicode code charts present the characters of the Unicode Standard. This chapter
explains the conventions used in the code charts and provides other useful information
about the accompanying names lists.
Characters are organized into related groups called blocks (see D10b in Section 3.4, Charac-
ters and Encoding). Many scripts are fully contained within a single block, but other scripts,
including some of the most widely used scripts, have characters divided across several
blocks. Separate blocks contain common punctuation characters and different types of
symbols.
A character names list follows the code chart for each block. The character names list item-
izes every character in that block and provides supplementary information in many cases.
A full list of the character names and associated annotations, formatted as a text file,
NamesList.txt, is available in the Unicode Character Database. That text file contains syn-
tax conventions which are used by the tooling that formats the PDF versions of the code
charts and character names lists. For the full specification of those conventions, see
NamesList.html in the Unicode Character Database.
An index to distinctive character names can also be found on the Unicode website.
For information about access to the code charts, the character name index, and the road-
map for future allocations, see Section B.3, Other Unicode Online Resources.
About the Code Charts 904 24.1 Character Names List
The fonts used for other scripts are similar to Times in that each represents a common,
widely used design, with variable stroke width and serifs or similar devices, where applica-
ble, to show each character as distinctly as possible. Sans-serif fonts with uniform stroke
width tend to have less visibly distinct characters. In the code charts, sans-serif fonts are
used for archaic scripts that predate the invention of serifs, for example.
Alternative Forms. Some characters have alternative forms. For example, even the ASCII
character U+0061 latin small letter a has two common alternative forms: the “a” used
in Times and the “—” that occurs in many other font styles. In a Times-like font, the char-
acter U+03A5 greek capital letter upsilon looks like “Y”; the form Y is common in
other font styles.
A different case is U+010F latin small letter d with caron, which is commonly type-
set as @ instead of A. In such cases, the code charts show the more common variant in pref-
erence to a more didactic archetypical shape.
Many characters have been unified and have different appearances in different language
contexts. The shape shown for U+2116 ñ numero sign is a fullwidth shape as it would be
used in East Asian fonts. In Cyrillic usage, M is the universally recognized glyph. See
Figure 22-2.
In certain cases, characters need to be represented by more or less condensed, shifted, or
distorted glyphs to make them fit the format of the code charts. For example, U+0D10 ê
malayalam letter ai is shown in a reduced size to fit the character cell.
When characters are used in context, the surrounding text gives important clues as to iden-
tity, size, and positioning. In the code charts, these clues are absent. For example, U+2075
ısuperscript five is shown much smaller than it would be in a Times-like text font.
Whenever a more obvious choice for representative glyph may be insufficient to aid in the
proper identification of the encoded character, a more distinct variant has been selected as
representative glyph instead.
Orientation. Representative glyphs for character in the code charts are oriented as they
would normally appear in text with the exception of scripts which are predominantly laid
out in vertical lines, as for Mongolian and Phags-pa. Commercial production fonts show
Mongolian glyphs with their images turned 90 degrees counterclockwise, which is the
appropriate orientation for Mongolian text that is laid out horizontally, such as for embed-
ding in horizontally formatted, left-to-right Chinese text. For normal vertical display of
Mongolian text, layout engines typically lay out horizontally, and then rotate the formatted
text 90 degrees clockwise. Starting with Unicode 7.0, the code charts display Mongolian
glyphs in their horizontal orientation, following the conventions of commercial Mongo-
lian fonts. Glyphs in the Phags-pa code chart are treated similarly.
About the Code Charts 906 24.1 Character Names List
This convention is also used for some graphic characters which are only distinguished by
special behavior from another character of the same appearance.
2011 L non-breaking hyphen
→ 002D - hyphen-minus
→ 00AD Á soft hyphen
» <noBreak> 2010 -
The dashed box convention also applies to the glyphs of combining characters which have
no visible display of their own, such as variation selectors (see Section 23.4, Variation Selec-
tors).
FE00 y variation selector-1
• these are abbreviated VS1, and so on
Sometimes, the combining status of the character is indicated by including a dotted circle
inside the dashed box, for example for the consonant-stacking viramas.
17D2 A khmer sign coeng
• functions to indicate that the following Khmer letter is to be rendered
subscripted
• shape shown is arbitrary and is not visibly rendered
Even though the presence of the dashed box in the code charts indicates that a character is
likely to be a space character, a control character, a format character, or a combining char-
acter, it cannot be used to infer the actual General_Category value of that character.
Reserved Characters. Character codes that are marked “<reserved>” are unassigned and
reserved for future encoding. Reserved codes are indicated by a ê glyph. To ensure read-
ability, many instances of reserved characters have been suppressed from the names list.
Reserved codes may also have cross references to assigned characters located elsewhere.
2073 ê <reserved>
→ 00B3 3 superscript three
Noncharacters. Character codes that are marked “<not a character>” refer to noncharac-
ters. They are designated code points that will never be assigned to a character. These codes
are indicated by a ò glyph. Noncharacters are shown in the code charts only where they
occur together with other characters in the same block. For a complete list of noncharac-
ters, see Section 23.7, Noncharacters.
FFFF ò <not a character>
Deprecated Characters. Deprecated characters are characters whose use is strongly dis-
couraged, but which are retained in the standard indefinitely so that existing data remain
well defined and can be correctly interpreted. (See D13 in Section 3.4, Characters and
Encoding.) Deprecated characters are explicitly indicated in the Unicode code charts using
annotations or subheads.
About the Code Charts 908 24.1 Character Names List
Character Names
The character names in the code charts precisely match the normative character names in
the Unicode Character Database. Character names are unique and stable. By convention,
they are in uppercase. For more information on character names, see Section 4.8, Name.
Informative Aliases
An informative alias is an informal, alternate name for a character. Aliases are provided to
assist in the correct identification of characters, in some cases providing more commonly
known names than the normative character name used in the standard. For example:
002E . full stop
= period, dot, decimal point
Informative aliases are indicated with a “=” in the names list, and by convention are shown
in lowercase, except when they include a proper name. (Note that a “=” in the names list
may also introduce a normative alias, which is distinguished from an informative alias by
being shown in uppercase. See the following discussion of normative aliases.)
Multiple aliases for a character may be given in a single informative alias line, in which case
each alias is separated by a comma. In other cases, multiple informative alias lines may
appear in a single entry. Informative aliases can be used to indicate distinct functions that
a character may have; this is particularly common for symbols. For example:
2206 Δ increment
= Laplace operator
= forward difference
= symmetric difference of sets
In some complex cases involving many informative aliases, rather than introduce a sepa-
rate line for each set of related aliases, an informative alias line may also separate groups of
aliases with semicolons:
1F70A A alchemical symbol for vinegar
= crucible; acid; distill; atrament; vitriol; red sulfur; borax; wine; alkali
salt; mercurius vivus, quick silver
Informative aliases for different characters are not guaranteed to be unique. They are
maintained editorially, and may be changed, added to, or even be deleted in future versions
of the standard, as information accumulates about particular characters and their uses.
Informative aliases may serve as useful alternate choices for identifying characters in user
interfaces. The formal character names in the standard may differ in unexpected ways from
the more commonly used names for the characters. For example:
00B6 B pilcrow sign
= paragraph sign
Unicode 1.0 Names. Some character names from The Unicode Standard, Version 1.0 are
indicated in the names list. These are provided only for their historical interest. Where they
About the Code Charts 909 24.1 Character Names List
occur, they also are introduced with a “=” and are shown in lowercase. In addition they are
explicitly annotated with a following “1.0” in parentheses. For example:
01C3 C latin letter retroflex click
= latin letter exclamation mark (1.0)
If a Unicode 1.0 name and one or more other informative aliases occurs in a single entry,
the Unicode 1.0 name will be given first. For example:
00A6 D broken bar
= broken vertical bar (1.0)
= parted rule (in typography)
Note that informative aliases other than Unicode 1.0 names may also contain clarifying
annotations in parentheses.
Jamo Short Names. In the Hangul Jamo block, U+1100..U+11FF, the normative jamo
short names from Jamo.txt in the UCD are displayed for convenience of reference. These
are also indicated with a “=” in the names list and are shown in uppercase to imply their
normative status. For example:
1101 E hangul choseong ssangkiyeok
= GG
The Jamo short names do not actually have the status of alternate names; instead they are
simply string values associated with the jamo characters, for use by the Unicode Hangul
Syllable Name Generation algorithm. See Section 3.12, Conjoining Jamo Behavior.
Normative Aliases
A normative character name alias is a formal, unique, and stable alternate name for a char-
acter. In limited circumstances, characters are given normative character name aliases
where there is a defect in the character name. These normative aliases do not replace the
character name, but rather allow users to refer formally to the character without requiring
the use of a defective name. For more information, see Section 4.8, Name.
Normative aliases which provide information about corrections to defective character
names or which provide alternate names in wide use for a Unicode format character are
printed in the character names list, preceded by a special symbol . Normative aliases
serving other purposes, if listed, are shown by convention in all caps, following an “=”.
Normative aliases of type “figment” for control codes are not listed. Normative aliases
which represent commonly used abbreviations for control codes or format characters are
shown in all caps, enclosed in parentheses. In contrast, informative aliases are shown in
lowercase. For the definitive list of normative aliases, also including their type and suitable
for machine parsing, see NameAliases.txt in the UCD.
FE18 q presentation form for vertical right white lenticular brakcet
presentation form for vertical right white lenticular bracket
• misspelling of “BRACKET” in character name is a known defect
» <vertical> 3017
About the Code Charts 910 24.1 Character Names List
Cross References
Cross references (preceded by →) are used to indicate a related character of interest, but
without indicating the exact nature of the relation. Cross references are most commonly
used to indicate a different character of similar or occasionally identical appearance, which
might be confused with the character in question. Cross references are also used to indicate
characters with similar names or functions, but with distinct appearances. Cross references
may also be used to show linguistic relationships, such as letters used for transliteration in
a different script. Some blocks start with a list of cross references that simply point to
related characters of interest in other blocks. Examples of various types of cross references
follow.
Explicit Inequality. The cross reference indicates that two (or more) characters are not
identical, although the representative glyphs that depict them are identical or very close in
appearance.
003A : colon
• also used to denote division or scale; for that mathematical use 2236 is
preferred
→ 0589 : armenian full stop
→ 05C3 F hebrew punctuation sof pasuq
→ 2236 : ratio
→ A789 G modifier letter colon
Related Functions. The cross reference indicates that two (or more) characters have simi-
lar functions, although the representative glyphs are distinct. See, for example, the cross
references to division slash, divides, and ratio in the names list entry for U+00F7 divi-
sion sign:
00F7 ÷ division sign
= obelus
• occasionally used as an alternate, more visually distinct version of 2212
or 2011 in some contexts
• historically used as a punctuation mark to denote questionable passages
in manuscripts
→ 070B H syriac harklean obelus
→ 2052 % commercial minus sign
→ 2212 – minus sign
→ 2215 M division slash
→ 2223 Q divides
→ 2236 : ratio
→ 2797 I heavy division sign
In addition to related mathematical functions, cross references may show other related
functions, such as use of distinct symbols in different phonetic transcription systems to
represent the same sounds. For example, the cross reference to U+0296 in the following
entry shows the IPA equivalent for U+01C1:
About the Code Charts 911 24.1 Character Names List
Case Mappings
When a case mapping corresponds solely to a difference based on small versus capital in
the names of the characters, the case mapping is not given in the names list but only in the
Unicode Character Database.
0041 A latin capital letter a
01F2 Dz latin capital letter d with small letter z
» 0044 D 007A z
When the case mapping cannot be predicted from the name, the casing information is
sometimes given in a note.
00DF ß latin small letter sharp s
= Eszett
• German
• not used in Swiss High German
• uppercase is “SS” or S 1E9E
• typographically the glyph for this character can be based on a ligature of
017F ˇ with either 0073 Û or with an old-style glyph for 007A z (the latter
similar in appearance to s 0292). Both forms exist interchangeably today.
→ 03B2 ≤ greek small letter beta
For more information about case and case mappings, see Section 4.2, Case.
Decompositions
The decomposition sequence (one or more letters) given for a character is either its canon-
ical mapping or its compatibility mapping. The canonical mapping is marked with an iden-
tical to symbol ·.
00E5 å latin small letter a with ring above
• Danish, Norwegian, Swedish, Walloon
· 0061 a 030A Ää
212B Å angstrom sign
· 00C5 Å latin capital letter a with ring above
Compatibility mappings are marked with an almost equal to symbol ». Formatting infor-
mation may be indicated with a formatting tag, shown inside angle brackets.
01F2 Dz latin capital letter d with small letter z
» 0044 D 007A z
FF21 ° fullwidth latin capital letter a
» <wide> 0041 A
About the Code Charts 913 24.1 Character Names List
The following compatibility formatting tags are used in the Unicode Character Database:
In the character names list accompanying the code charts, the “<compat>” label is sup-
pressed, but all other compatibility formatting tags are explicitly listed in the compatibility
mapping.
Decomposition mappings are not necessarily full decompositions. For example, the
decomposition for U+212B ≈ angstrom sign can be further decomposed using the
canonical mapping for U+00C5 ≈ latin capital letter a with ring above. (For more
information on decomposition, see Section 3.7, Decomposition.)
Compatibility decompositions do not attempt to retain or emulate the formatting of the
original character. For example, compatibility decompositions with the <noBreak> for-
matting tag do not use U+2060 word joiner to emulate nonbreaking behavior; compati-
bility decompositions with the <circle> formatting tag do not use U+20DD combining
enclosing circle; and compatibility decompositions with formatting tags <initial>,
<medial>, <final>, or <isolate> for explicit positional forms do not use ZWJ or ZWNJ. The
one exception is the use of U+2044 fraction slash to express the <fraction> semantics of
compatibility decompositions for vulgar fractions.
In the character names list, each variation sequence for standardized variants is listed in
the entry for the base character for that sequence. In some cases a character may be associ-
ated with multiple variation sequences. A standardized variation sequence is identified in
the character names list with an initial swung dash “~”.
228A U subset of with not equal to
~ 228A FE00 V with stroke through bottom members
The glyphs for most emoji variation sequences cannot be displayed by the font technology
used to print the code charts. In those cases, the glyphs for the standardized variation
sequences are omitted from the names list. Representative glyphs for both the colorful
emoji presentation style and the text style of all emoji variation sequences can be found in
the emoji charts section of the Unicode website.
Characters for which one or more standardized variants have been defined are displayed in
the code charts with a special convention: the code chart cell for such characters has a
small black triangle in its upper-right corner.
Characters which have one or more positional glyph variants, but no standardized variants
have a small open triangle in the upper-right corner of their code chart cell.
Blocks containing characters for which standardized variation sequences and/or positional
glyph variants are shown in the names list also have a separate summary listing at the end
of the block, displaying the variants in a large font size. Each entry in these summary list-
ings is shown as follows:
The list of standardized variation sequences in the character names list matches the list
defined in the data file StandardizedVariants.txt in the Unicode Character Database. Emoji
variation sequences are not included in these summary listings at the ends of blocks,
because of the limitations in font technology used for the code chart display. Ideographic
variation sequences defined in the Ideographic Variation Database are also not included.
See Section 23.4, Variation Selectors for more information.
Standardized Variation Sequences to select glyphs appropriate for display of CJK compati-
bility ideographs are shown not with the corresponding CJK unified ideograph, but rather
with the CJK compatibility ideograph defining the glyph to be selected. All CJK compatibil-
ity ideographs have a canonical decomposition to a CJK unified ideograph for historical
About the Code Charts 915 24.1 Character Names List
reasons. This means that direct use of CJK compatibility ideographs is problematical,
because they are not stable under normalization. To indicate that one of the compatibility
glyph shapes is desired, the indicated variation selector can be used with the CJK unified
ideograph. In the CJK Compatibility Ideographs and CJK Compatibility Supplement
blocks, the canonical decomposition and the relevant standardized variation sequence are
shown together with respective representative glyphs for the sources defined for the CJK
compatibility ideograph; see Figure 24-5.
Note that there are no indications of variation sequences in the charts for CJK unified ideo-
graphs. See the Ideographic Variation Database (IVD) for information on registered varia-
tion sequences for CJK unified ideographs.
Positional Forms
In cursive scripts which have contextually defined positional forms for letters, such as
Mongolian, the basic positional forms may appear in the charts as shown in Figure 24-1.
This example shows initial, medial, and final forms of a letter for Mongolian. For Mongo-
lian, such forms appear in the charts in the summary listings together with any entries for
standardized variation sequences. Note that the terminology “first form,” “second form,”
and so forth is specific to Mongolian. Identification of contextually defined positional
forms for letters in other scripts may use different terminology. As of Unicode 10.0, the
charts omit such forms for cursive scripts other than Mongolian, but such positional forms
may be added in future versions.
Mongolian currently uses script-specific variation selectors for the second and other forms
of Mongolian characters. Each form is selected by a combination of position in the word
and variation selector, if any, but there is no fixed association between a specific variation
selector and the name for a given form.
About the Code Charts 916 24.1 Character Names List
Block Headers
The code charts are segmented by the format tooling into blocks. (See Definition D10b in
Section 3.4, Characters and Encoding.) The page headers for the code charts are based on
the normative values of the Block property defined in Blocks.txt in the Unicode Character
Database, with a few exceptions. For example, the ASCII and Latin-1 ranges have their
block headers adjusted editorially to reflect the presence of C0 and C1 control characters in
those ranges. This means that the Block property value for the block associated with the
range U+0080..U+00FF is “Latin-1 Supplement”, but the block header used in the code
charts is “C1 Controls and Latin-1 Supplement”.
The start and end code points printed in the block headers in the code charts and character
names list reflect the ranges that are printed on that page, and thus should not be confused
with the normative ranges listed in Blocks.txt.
On occasion, the code chart format tooling also introduces artificial block headers to
enable the display of code charts for noncharacters that are outside the range of any nor-
mative block range. For example, the two noncharacters U+3FFFE..U+3FFFF are artifi-
cially displayed in a code chart with a block header “Unassigned”, showing a range
U+3FF80..U+3FFFF.
As a result of these and other editorial considerations, implementers are cautioned not to
attempt to pull block range values from the code charts, nor to attempt to parse them from
the NamesList.txt file in the Unicode Character Database. Instead, normative values for
block ranges and names should always depend on Blocks.txt.
Subheads
The character names list contains a number of informative subheads that help divide up
the list into smaller sublists of similar characters. For example, in the Miscellaneous Sym-
bols block, U+2600..U+26FF, there are subheads for “Astrological symbols,” “Chess sym-
bols,” and so on. Such subheads are editorial and informative; they should not be taken as
providing any definitive, normative status information about characters in the sublists they
mark or about any constraints on what characters could be encoded in the future at
reserved code points within their ranges. The subheads are subject to change.
About the Code Charts 917 24.2 CJK Ideographs
To assist in reference and lookup, each CJK Unified Ideograph is accompanied by a repre-
sentative glyph of its Unicode radical and by its Unicode radical-stroke counts. These are
printed directly underneath the Unicode code point for the character. A radical-stroke
index to all of the CJK ideographs is also provided separately on the Unicode website.
Chart for the Main CJK Block. For the CJK Unified Ideographs block (U+4E00..U+9FFF)
the glyphs are arranged in the following order: G, H, and T sources are grouped under the
header “C.” J, K, and V sources are listed under their respective headers. Each row contains
positions for all six sources, and if a particular source is undefined for CJK Unified Ideo-
graphs, that position is left blank in the row. This format is illustrated by Figure 24-2. If a
character has a U source, it is shown at the H source position, unless other sources are
About the Code Charts 918 24.2 CJK Ideographs
present, in which case it is shown below the H source position on a line by itself. Note that
this block does not contain any characters with M sources. The KP sources are not shown
due to lack of reliable glyph information.
Figure 24-2. CJK Chart Format for the Main CJK Block
HEX C J K V
4F1A
⼈ 9.4
G0-3B61 H-894E T3-2275 J0-3271 K2-216D V1-4B24
Charts for CJK Extensions. The code charts for all of the extension blocks for CJK Unified
Ideographs use a more condensed format. That format dispenses with the “C, J, K, V, and
H” headers and leaves no holes for undefined sources. For those blocks, sources are always
shown in the following order: G, T, J, K, KP, V, H, M, and U. The first letters of the source
information provide the source type for all sources except G. KP sources are omitted from
the code charts because of the lack of an appropriately vetted font for display.
The multicolumn code charts for CJK Extensions A and B use the condensed format with
three source columns per entry, and with entries arranged in three columns per page. An
entry may have additional rows, if it is associated with more than three sources, as illus-
trated in Figure 24-3 for CJK Extension A.
The multicolumn code charts for the CJK Unified Ideographs Extension B block
(U+20000..U+2A6DF), which are formatted like those for the Extension A block, have the
additional idiosyncrasy that the first source shown always corresponds to the “UCS2003”
representative glyph. Those representative glyphs were the only ones used up through Ver-
sion 5.1 of the standard for that block. The multicolumn code charts for the CJK Unified
Ideographs Extension B block were introduced in Version 5.2. This format is illustrated in
Figure 24-4.
The multicolumn code charts for the other extension blocks for CJK Unified Ideographs
use the condensed format with two source columns per entry, and with entries arranged in
four columns per page. An entry may have additional rows if it is associated with more than
two sources.
About the Code Charts 919 24.2 CJK Ideographs
1.5 1.7
J3-2E22 UCS2003 GHZ-80007.04 UCS2003 V0-305F
Compatibility Ideographs
The format of the code charts for the CJK Compatibility Ideograph blocks is largely similar
to the CJK chart format for Extension A, as illustrated in Figure 24-5. However, several
additional notational elements described in Section 24.1, Character Names List are used. In
particular, for each CJK compatibility ideograph other than the small list of unified ideo-
graphs included in these charts, a canonical decomposition is shown. The ideographic
variation sequence for each compatibility CJK ideograph is listed below the canonical
decomposition, introduced with a tilde sign.
The twelve CJK unified ideographs in the CJK Compatibility Ideographs block have no
canonical decompositions or corresponding ideographic variation sequences; instead,
each is clearly labeled with an annotation identifying it as a CJK unified ideograph.
Character names are not provided for any CJK Compatibility Ideograph blocks because
the name of a compatibility ideograph simply consists of its Unicode code point preceded
by cjk compatibility ideograph-.
About the Code Charts 920 24.3 Hangul Syllables
Appendix A
Notational Conventions A
This appendix describes the typographic conventions used throughout this core specifica-
tion.
Code Points
In running text, an individual Unicode code point is expressed as U+n, where n is four to
six hexadecimal digits, using the digits 0–9 and uppercase letters A–F (for 10 through 15,
respectively). Leading zeros are omitted, unless the code point would have fewer than four
hexadecimal digits—for example, U+0001, U+0012, U+0123, U+1234, U+12345,
U+102345.
• U+0416 is the Unicode code point for the character named cyrillic capital
letter zhe.
The U+ may be omitted for brevity in tables or when denoting ranges.
A range of Unicode code points is expressed as U+xxxx–U+yyyy or U+xxxx..U+yyyy, where
xxxx and yyyy are the first and last Unicode values in the range, and the en dash or two dots
indicate a contiguous range inclusive of the endpoints. For ranges involving supplementary
characters, the code points in the ranges are expressed with five or six hexadecimal digits.
• The range U+0900–U+097F contains 128 Unicode code points.
• The Plane 16 private-use characters are in the range U+100000..U+10FFFD.
Character Names
In running text, a formal Unicode name is shown in small capitals (for example, greek
small letter mu), and alternative names (aliases) appear in italics (for example, umlaut).
Italics are also used to refer to a text element that is not explicitly encoded (for example,
pasekh alef) or to set off a non-English word (for example, the Welsh word ynghyd).
For more information on Unicode character names, see Section 4.8, Name.
For notational conventions used in the code charts, see Section 24.1, Character Names List.
Character Blocks
When referring to the normative names of character blocks in the text of the standard, the
character block name is titlecased and is used with the term “block.” For example:
the Latin Extended-B block
Notational Conventions 922
Optionally, an exact range for the character block may also be cited:
the Alphabetic Presentation Forms block (U+FB00..U+FB4F)
These references to normative character block names should not be confused with the
headers used throughout the text of the standard, particularly in the block description
chapters, to refer to particular ranges of characters. Such headers may be abbreviated in
various ways and may refer to subranges within character blocks or ranges that cross char-
acter block boundaries. For example:
Sequences
A sequence of two or more code points may be represented by a comma-delimited list, set
off by angle brackets. For this purpose, angle brackets consist of U+003C less-than sign
and U+003E greater-than sign. Spaces are optional after the comma, and U+ notation
for the code point is also optional—for example, “<U+0061, U+0300>”.
When the usage is clear from the context, a sequence of characters may be represented with
generic short names, as in “<a, grave>”, or the angle brackets may be omitted.
In contrast to sequences of code points, a sequence of one or more code units may be rep-
resented by a list set off by angle brackets, but without comma delimitation or U+ notation.
For example, the notation “<nn nn nn nn>” represents a sequence of bytes, as for the UTF-
8 encoding form of a Unicode character. The notation “<nnnn nnnn>” represents a
sequence of 16-bit code units, as for the UTF-16 encoding form of a Unicode character.
Rendering
A figure such as Figure A-1 depicts how a sequence of characters is typically rendered.
A + $¨ → Ä
0041 0308
The sequence under discussion is depicted on the left of the arrow, using representative
glyphs and code points below them. A possible rendering of that sequence is depicted on
the right side of the arrow.
Notational Conventions 923
Miscellaneous
Phonemic transcriptions are shown between slashes, as in Khmer /khnyom/.
Phonetic transcriptions are shown between square brackets, using the International Pho-
netic Alphabet. (Full details on the IPA can be found on the International Phonetic Associ-
ation’s website, https://www.internationalphoneticassociation.org/.)
A leading asterisk is used to represent an incorrect or nonoccurring linguistic form.
In this specification, the word “Unicode” when used alone as a noun refers to the Unicode
Standard.
Unambiguous dates of the current common era, such as 1999, are unlabeled. In cases of
ambiguity, ce is used. Dates before the common era are labeled with bce.
The term byte, as used in this standard, always refers to a unit of eight bits. This corre-
sponds to the use of the term octet in some other standards.
Extended BNF
The Unicode Standard and technical reports use an extended BNF format for describing
syntax. As different conventions are used for BNF, Table A-1 lists the notation used here.
For more information about character classes, see Unicode Technical Standard #18, “Uni-
code Regular Expressions.”
Operators
Operators used in this standard are listed in Table A-3.
Appendix B
Unicode Publications and
Resources B
This appendix provides information about the Unicode Consortium and its activities, par-
ticularly regarding publications other than the Unicode Standard. The Unicode Consor-
tium publishes a number of technical standards and technical reports. Section B.2, Unicode
Publications describes the kinds of reports in more detail.
The Unicode website also has many useful online resources. Section B.3, Other Unicode
Online Resources, provides a guide to the kinds of information available.
Unicode Publications and Resources 928 B.1 The Unicode Consortium
Other Activities
Going beyond developing technical standards, the Unicode Consortium acts as registra-
tion authority for the registration of script identifiers under ISO 15924, and it has a techni-
cal committee dedicated to the maintenance of the Unicode Common Locale Data
Repository (CLDR). The repository contains a large and rapidly growing body of data used
in the locale definition for software internationalization. For further information about
these and other activities of the Unicode Consortium, visit:
http://www.unicode.org
Unicode Publications and Resources 929 B.2 Unicode Publications
Online Unicode Character Database. This page supplies information about the online
Unicode Character Database (UCD), including links to documentation files and the most
up-to-date version of the data files, as well as instructions on how to access any particular
version of the UCD.
http://www.unicode.org/ucd/
Online Unihan Database. The online Unihan Database provides interactive access to all
of the property information associated with CJK ideographs in the Unicode Standard.
http://www.unicode.org/charts/unihan.html
Pipeline. This page lists characters, standardized variation sequences, and named charac-
ter sequences which have reached some level of approval and/or are in international ballot-
ing, but which have not yet been published in a version of the Unicode Standard. The
pipeline provides some visibility about what characters will soon be in the standard.
http://www.unicode.org/alloc/Pipeline.html
Policies. These pages describe Unicode Consortium policies on stability, patents, and Uni-
code website privacy. The stability policies are particularly important for implementers,
documenting invariants for the Unicode Standard that allow implementations to be com-
patible with future and past versions.
http://www.unicode.org/policies/
References. This online page lists sources and up-to-date references for the Unicode Stan-
dard, as well as resources by script.
http://www.unicode.org/references/
Roadmap. This section of the Unicode website provides a roadmap for planning future
allocation of scripts and major blocks of symbols. The roadmap is organized by plane, and
provides information about the locations of published, approved, and proposed blocks,
often with links to current proposals. The roadmap provides the long term perspective on
future work by the encoding committees.
http://www.unicode.org/roadmaps/
Unicode Common Locale Data Repository (CLDR). Machine-readable repository, in
XML format, of locale information for use in application and system development.
http://www.unicode.org/cldr/
Updates and Errata. This page lists periodic updates with corrections of typographic
errors and new clarifications of the text.
http://www.unicode.org/errata/
Versions. This page describes the version numbering used in the Unicode Standard, the
nature of the Unicode character repertoire, and ways to cite and reference the Unicode
Standard, the Unicode Character Database, and Unicode Technical Reports. It also speci-
Unicode Publications and Resources 932 B.3 Other Unicode Online Resources
fies the exact contents of each and every version of the Unicode Standard, back to Unicode
1.0.0.
http://www.unicode.org/versions/
Where Is My Character? This page provides basic guidance to finding Unicode characters,
especially those whose glyphs do not appear in the charts, or that are represented by
sequences of Unicode characters.
http://www.unicode.org/standard/where/
Appendix C
Relationship to ISO/IEC 10646 C
The Unicode Consortium maintains a strong working relationship with ISO/IEC
JTC1/SC2/WG2, the working group developing International Standard 10646. Today both
organizations are firmly committed to maintaining the synchronization between the Uni-
code Standard and ISO/IEC 10646. Each standard nevertheless uses its own form of refer-
ence and, to some degree, separate terminology. This appendix gives a brief history and
explains how the standards are related.
Relationship to ISO/IEC 10646 934 C.1 History
C.1 History
Having recognized the benefits of developing a single universal character code standard,
members of the Unicode Consortium worked with representatives from the International
Organization for Standardization (ISO) during the summer and fall of 1991 to pursue this
goal. Meetings between the two bodies resulted in mutually acceptable changes to both
Unicode Version 1.0 and the first ISO/IEC Draft International Standard DIS 10646.1,
which merged their combined repertoire into a single numerical character encoding. This
work culminated in The Unicode Standard, Version 1.1.
ISO/IEC 10646-1:1993, Information Technology—Universal Multiple-Octet Coded Charac-
ter Set (UCS)—Part 1: Architecture and Basic Multilingual Plane, was published in May
1993 after final editorial changes were made to accommodate the comments of voting
members. The Unicode Standard, Version 1.1, reflected the additional characters intro-
duced from the DIS 10646.1 repertoire and incorporated minor editorial changes.
Merging The Unicode Standard, Version 1.0, and DIS 10646.1 consisted of aligning the
numerical values of identical characters and then filling in some groups of characters that
were present in DIS 10646.1, but not in the Unicode Standard. As a result, the encoded
characters (code points and names) of ISO/IEC 10646-1:1993 and The Unicode Standard,
Version 1.1, are precisely the same.
Versions 2.0, 2.1, and 3.0 of the Unicode Standard successively added more characters,
matching a series of amendments to ISO/IEC 10646-1. The Unicode Standard, Version 3.0, is
precisely aligned with the second edition of ISO/IEC 10646-1, known as ISO/IEC 10646-
1:2000.
In 2001, Part 2 of ISO/IEC 10646 was published as ISO/IEC 10646-2:2001. Version 3.1 of the
Unicode Standard was synchronized with that publication, which added supplementary
characters for the first time. Subsequently, Versions 3.2 and 4.0 of the Unicode Standard
added characters matching further amendments to both parts of ISO/IEC 10646. The Uni-
code Standard, Version 4.0, is precisely aligned with the third version of ISO/IEC 10646 (first
edition), published as a single standard merging the former two parts: ISO/IEC 10646:2003.
Versions 4.1 and 5.0 of the Unicode Standard added characters matching Amendments 1 and
2 to ISO/IEC 10646:2003. Version 5.0 also added four characters for Sindhi support from
Amendment 3 to ISO/IEC 10646:2003. Version 5.1 added the rest of the characters from
Amendment 3 and all of the characters from Amendment 4 to ISO/IEC 10646:2003. Version
5.2 added all of the characters from Amendments 5 and 6 to ISO/IEC 10646:2003. Version
6.0 added all of the characters from Amendments 7 and 8 to ISO/IEC 10646:2003.
In 2010, ISO approved republication of ISO/IEC 10646 as a second edition, ISO/IEC
10646:2011, consolidating all of the contents of Amendments 1 through 8 to the 2003 first
edition. The Unicode Standard, Version 6.0 is aligned with that second edition of the Inter-
national Standard, with the addition of U+20B9 indian rupee sign, accelerated into Ver-
sion 6.0 based on approval for the third edition of ISO/IEC 10646.
Relationship to ISO/IEC 10646 935 C.1 History
The Unicode Standard, Version 6.1 is aligned with the third edition of the International
Standard: ISO/IEC 10646:2012. The third edition was approved for publication without
an intervening amendment to the second edition. The Unicode Standard, Version 6.2 added
a single character, U+20BA turkish lira sign. Version 6.3 added five more characters,
including new bidirectional format controls.
The Unicode Standard, Version 7.0 is aligned with Amendments 1 and 2 to ISO/IEC
10646:2012. Those amendments include the six characters which were added in Version
6.2 and Version 6.3, as well as many others. Version 7.0 also includes U+20BD ruble sign,
accelerated into Version 7.0 based on approval for the fourth edition of ISO/IEC 10646.
The Unicode Standard, Version 8.0 is aligned with Amendment 1 of ISO/IEC 10646:2014,
the fourth edition of ISO/IEC 10646. Version 8.0 also includes U+20BE lari sign, nine
additional CJK unified ideographs, and 41 emoji characters, based on approval for Amend-
ment 2 to the fourth edition of ISO/IEC 10646.
The Unicode Standard, Version 9.0 is aligned with Amendments 1 and 2 to ISO/IEC
10646:2014, the fourth edition of ISO/IEC 10646. Version 9.0 also includes the Adlam
script (87 characters), the Newa script (92 characters), 72 additional emoji characters, 19
television symbols, two other pictographic symbols, and one other punctuation mark,
based on approval for the fifth edition of ISO/IEC 10646.
The Unicode Standard, Version 10.0 is aligned with ISO/IEC 10646:2017, the fifth edition of
ISO/IEC 10646. Version 10.0 also includes three other Zanabazar Square characters, 285
hentaigana, and 56 emoji characters, based on approval for Amendment 1 to the fifth edi-
tion of ISO/IEC 10646.
The Unicode Standard, Version 11.0 is aligned with Amendment 1 to ISO/IEC 10646:2017,
the fifth edition of ISO/IEC 10646. Version 11.0 also includes 46 Mtavruli Georgian capital
letters, 5 urgently needed CJK unified ideographs, and 66 emoji characters, based on
approval for Amendment 2 to the fifth edition of ISO/IEC 10646.
The Unicode Standard, Version 12.0 is aligned with Amendments 1 and 2 to ISO/IEC
10646:2017, the fifth edition of ISO/IEC 10646. Version 12.0 also includes U+1E94B
adlam nasalization mark and 61 emoji characters, based on approval for the sixth edi-
tion of ISO/IEC 10646.
Table C-1 gives the timeline for these efforts.
Unicode 1.0
The combined repertoire presented in ISO/IEC 10646 is a superset of The Unicode Stan-
dard, Version 1.0, repertoire as amended by The Unicode Standard, Version 1.0.1. The Uni-
code Standard, Version 1.0, was amended by the Unicode 1.0.1 Addendum to make the
Unicode Standard a proper subset of ISO/IEC 10646. This effort entailed both moving and
eliminating a small number of characters.
Unicode 2.0
The Unicode Standard, Version 2.0, covered the repertoire of The Unicode Standard, Version
1.1 (and IS 10646), plus the first seven amendments to IS 10646, as follows:
Amd. 1: UTF-16
Amd. 2: UTF-8
Amd. 3: Coding of C1 Controls
Amd. 4: Removal of Annex G: UTF-1
Amd. 5: Korean Hangul Character Collection
Amd. 6: Tibetan Character Collection
Amd. 7: 33 Additional Characters (Hebrew, Long S, Dong)
In addition, The Unicode Standard, Version 2.0, covered Technical Corrigendum No. 1 (on
renaming of AE ligature to letter) and such Editorial Corrigenda to ISO/IEC 10646 as
were applicable to the Unicode Standard. The euro sign and the object replacement char-
acter were added in Version 2.1, per amendment 18 of ISO/IEC 10646-1.
Unicode 3.0
The Unicode Standard, Version 3.0, is synchronized with the second edition of ISO/IEC
10646-1. The latter contains all of the published amendments to 10646-1; the list includes
the first seven amendments, plus the following:
Amd. 8: Addition of Annex T: Procedure for the Unification and Arrangement of CJK
Ideographs
Amd. 9: Identifiers for Characters
Amd. 10: Ethiopic Character Collection
Amd. 11: Unified Canadian Aboriginal Syllabics Character Collection
Amd. 12: Cherokee Character Collection
Amd. 13: CJK Unified Ideographs with Supplementary Sources (Horizontal Exten-
sion)
Amd. 14: Yi Syllables and Yi Radicals Character Collection
Amd. 15: Kangxi Radicals, Hangzhou Numerals Character Collection
Relationship to ISO/IEC 10646 938 C.1 History
Unicode 4.0
The Unicode Standard, Version 4.0, is synchronized with the third version of ISO/IEC
10646. The third version of ISO/IEC 10646 is the result of the merger of the second edition
of Part 1 (ISO/IEC 10646-1:2000) with the first edition of Part 2 (ISO/IEC 10646-2:2001)
into a single publication. The third version incorporates the published amendments to
10646-1 and 10646-2:
Amd. 1 (to part 1): Mathematical symbols and other characters
Amd. 2 (to part 1): Limbu, Tai Le, Yijing, and other characters
Amd. 1 (to part 2): Aegean, Ugaritic, and other characters
The third version of ISO/IEC 10646 also contains all the Editorial Corrigenda to date.
Unicode 5.0
The Unicode Standard, Version 5.0, is synchronized with ISO/IEC 10646:2003 plus its first
two published amendments:
Relationship to ISO/IEC 10646 939 C.1 History
Unicode 6.0
The Unicode Standard, Version 6.0, is synchronized with the second edition of ISO/IEC
10646. The second edition of the third version of ISO/IEC 10646 consolidates all of the
repertoire additions from the published eight amendments of ISO/IEC 10646:2003. These
include the first two amendments listed under Unicode 5.0, plus the following:
Amd. 3: Lepcha, Ol Chiki, Saurashtra, Vai, and other characters
Amd. 4: Cham, Game Tiles, and other characters
Amd. 5: Tai Tham, Tai Viet, Avestan, Egyptian Hieroglyphs, CJK Unified Ideographs
Extension C, and other characters
Amd. 6: Javanese, Lisu, Meetei Mayek, Samaritan, and other characters
Amd. 7: Mandaic, Batak, Brahmi, and other characters
Amd. 8: Additional symbols, Bamum supplement, CJK Unified Ideographs Extension
D, and other characters
One additional character, for the support of the new Indian currency symbol (U+20B9
indian rupee sign), was accelerated into Version 6.0, based on its approval for the third
edition of ISO/IEC 10646.
Unicode 7.0
The Unicode Standard, Version 7.0, is synchronized with the third edition of ISO/IEC
10646 plus its two published amendments:
Amd. 1: Linear A, Palmyrene, Manichaean, Khojki, Khudawadi, Bassa Vah,
Duployan, and other characters
Amd. 2: Caucasian Albanian, Psalter Pahlavi, Mahajani, Grantha, Modi, Pahawh
Hmong, Mende Kikakui, and other characters
One additional character, for the support of the new Russian currency symbol (U+20BD
ruble sign), was accelerated into Version 7.0, based on its approval for the fourth edition
of ISO/IEC 10646.
Relationship to ISO/IEC 10646 940 C.1 History
Unicode 8.0
The Unicode Standard, Version 8.0, is synchronized with the fourth edition of ISO/IEC
10646, plus its first published amendment:
Amd. 1: Cherokee supplement and other characters
An additional 51 characters were accelerated into Version 8.0, based on their approval for
Amendment 2 to the fourth edition of ISO/IEC 10646. These include U+20BE lari sign,
for the support of the new Georgian currency symbol, nine additional CJK unified ideo-
graphs, and 41 emoji characters.
Unicode 9.0
The Unicode Standard, Version 9.0, is synchronized with the fourth edition of ISO/IEC
10646, plus its two published amendments:
Amd. 1: Cherokee supplement and other characters
Amd. 2: Bhaiksuki, Marchen, Tangut and other characters
An additional 273 characters were accelerated into Version 9.0, based on their approval for
the fifth edition of ISO/IEC 10646. These include characters for the Adlam script and the
Newa script, 72 emoji characters, 19 television symbols, and one other punctuation mark.
Unicode 10.0
The Unicode Standard, Version 10.0, is synchronized with the fifth edition of ISO/IEC
10646.
An additional 344 characters were accelerated into Version 10.0, based on their approval
for Amendment 1 to the fifth edition of ISO/IEC 10646. These include three additional
characters for the Zanabazar Square script, 285 hentaigana characters, and 56 emoji char-
acters.
Unicode 11.0
The Unicode Standard, Version 11.0, is synchronized with the fifth edition of ISO/IEC
10646, plus its first published amendment:
Amd. 1: Dogra, Gunjala Gondi, Makasar, Medefaidrin, Indic Siyaq Numbers, and
other characters
An additional 117 characters were accelerated into Version 11.0, based on their approval
for Amendment 2 to the fifth edition of ISO/IEC 10646. These include Mtavruli uppercase
Georgian letters, five additional CJK unified ideographs, and 66 emoji characters.
Relationship to ISO/IEC 10646 941 C.1 History
Unicode 12.0
The Unicode Standard, Version 12.0, is synchronized with the fifth edition of ISO/IEC
10646, plus its two published amendments:
Amd. 1: Dogra, Gunjala Gondi, Makasar, Medefaidrin, Indic Siyaq Numbers, and
other characters
Amd. 2: Nandinagari, Georgian extension, and other characters
An additional 62 characters were accelerated into Version 12.0, based on their approval for
the sixth edition of ISO/IEC 10646. These include U+1E94B adlam nasalization mark
and 61 emoji characters.
The synchronization of The Unicode Standard, Version 12.0, with ISO/IEC 10646:2017
plus its published amendments means that the repertoire, encoding, and names of all char-
acters are identical between the two standards at those version levels, except for the 62
additional characters from the sixth edition which were accelerated for publication in the
Unicode Standard. All other changes to the text of 10646 that have a bearing on the text of
the Unicode Standard have been taken into account in the revision of the Unicode Stan-
dard.
Relationship to ISO/IEC 10646 942 C.2 Encoding Forms in ISO/IEC 10646
Zero Extending
The character “A”, U+0041 latin capital letter a, has the unchanging numerical value
41 hexadecimal. This value may be extended by any quantity of leading zeros to serve in the
context of the following encoding standards and transformation formats (see Table C-2).
This design eliminates the problem of disparate values in all systems that use either of the
standards and their transformation formats.
Relationship to ISO/IEC 10646 943 C.3 UTF-8 and UTF-16
UTF-8
The ISO/IEC 10646 definition of UTF-8 is identical to UTF-8 as described under Defini-
tion D92 in Section 3.9, Unicode Encoding Forms.
UTF-8 can be used to transmit text data through communications systems that assume
that individual octets in the range of x00 to x7F have a definition according to ISO/IEC
4873, including a C0 set of control functions according to the 8-bit structure of ISO/IEC
2022. UTF-8 also avoids the use of octet values in this range that have special significance
during the parsing of file name character strings in widely used file-handling systems.
UTF-16
The ISO/IEC 10646 definition of UTF-16 is identical to UTF-16 as described under Defi-
nition D91 in Section 3.9, Unicode Encoding Forms.
Relationship to ISO/IEC 10646 944 C.4 Synchronization of the Standards
Appendix D
Version History of the Standard D
This appendix provides version history of the standard. Updates to data files are docu-
mented in Unicode Standard Annex #44, “Unicode Character Database.”
The Unicode Technical Committee updates the Unicode Standard to respond to the needs
of implementers and users while maintaining consistency with ISO/IEC 10646. The rela-
tionship between these versions of Unicode and ISO/IEC 10646 is shown in Table D-1. For
more detail on the relationship of Unicode and ISO/IEC 10646, see Appendix C, Relation-
ship to ISO/IEC 10646.
Table D-2, Table D-3, Table D-4, and Table D-5 document the number of code points allo-
cated in the different versions of the Unicode Standard. For an explanation of the types of
characters which count as graphic or format in these summary statistics, see Table 2-3,
Types of Code Points. In each table, the row labeled “Graphic + Format”, shown in boldface,
represents the traditional count of Unicode characters and is the typical answer to the
question, “How many characters are in the Unicode Standard?” The numbers cited in the
row labeled “Han Compatibility” in each table include the 12 unified CJK ideographs
encoded in the CJK Compatibility Ideographs block.
Version History of the Standard 951
Table D-2 lists the allocation of code points by type for the earliest, historic versions of the
Unicode Standard up to the publication of Version 3.0. In some cases the values in
Table D-2 differ slightly from summary statistics published in those very early versions of
the standard, primarily due to a refined accounting of the allocations in Unicode 1.0.
Table D-3 lists the allocation of code points by type for intermediate versions of the Uni-
code Standard, up to the publication of Version 5.1.
Table D-4 lists the allocation of code points for more recent versions of the Unicode Stan-
dard, up to the publication of Version 7.0.
Table D-5 lists the allocation of code points for Versions 8.0 to 12.0 of the Unicode Stan-
dard.
Appendix E
Han Unification History E
Efforts to standardize a comprehensive Han character repertoire go back at least as far as
the Eastern Han dynasty, when the important dictionary Shuowen Jiezi (121 ce) codified a
set of some 10,000 characters and variants, crystallizing earlier Qin dynasty initiatives at
orthographic reform. Subsequent dictionaries in China grew larger as each generation re-
combined the Shuowen script elements to create new characters. By the time the Qing
dynasty Kang Xi dictionary was completed in the 18th century, the character set had grown
to include more than 40,000 characters and variants. In relatively recent times many more
characters and variants have been created and catalogued, reflecting modern PRC simplifi-
cation and standardization initiatives, as well as ongoing inventories of legacy printed
texts.
The effort to create a unified Han character encoding was guided by the developing
national standards, driven by offshoots of the dictionary traditions just mentioned, and
focused on modern bibliographic and pedagogical lists of characters in common use in
various genres. Much of the early work to create national and transnational encoding stan-
dards was published in China and Japan in the late 1970s and early 1980s.
The Chinese Character Code for Information Interchange (CCCII), first published in Tai-
wan in 1980, identified a set of some 5,000 characters in frequent use in China, Taiwan,
and Japan. (Subsequent revisions of CCCII considerably expanded the set.) In somewhat
modified form, CCCII was adopted for use in the United States as ANSI Z39.64-1989, also
known as EACC, the East Asian Character Code For Bibliographic Use. EACC encoded
some 16,000 characters and variants, organized using a twelve-layer variant mapping
mechanism.
In 1980, Takahashi Tokutaro of Japan’s National Diet Library proposed ISO standardiza-
tion of a character set for common use among East Asian countries. This proposal included
a report on the first Japanese Industrial Standard for kanji coding (JIS C 6226-1978). Pub-
lished in January 1978, JIS C 6226-1978 was growing in influence: it encoded a total of
6,349 kanji arranged in two levels according to frequency of use, and approximately 500
other characters, including Greek and Cyrillic.
Han Unification History 956 E.1 Development of the URO
As a result of the agreement to merge the Unicode Standard and ISO/IEC 10646, the Uni-
code Consortium agreed to adopt the unified Han character repertoire that was to be
developed by the CJK-JRG.
The first CJK-JRG meeting was held in Tokyo in July 1991. The group recognized that
there was a compelling requirement for unification of the existing CJK ideographic charac-
ters into one coherent coding standard. Two basic decisions were made: to use GB 13000
(previously merged with the Unicode Han repertoire) as the basis for what would be
termed “The Unified Repertoire and Ordering,” and to verify the unification results based
on rules that had been developed by Professor Miyazawa Akira and other members of the
Japanese delegation.
The formal review of GB 13000 began immediately. Subsequent meetings were held in
Beijing and Hong Kong. On March 27, 1992, the CJK-JRG completed the Unified Reper-
toire and Ordering (URO), Version 2.0. This repertoire was subsequently published both
by the Unicode Consortium in The Unicode Standard, Version 1.0, Volume 2, and by ISO
in ISO/IEC 10646-1:1993.
Han Unification History 958 E.2 Ideographic Rapporteur Group
In 2015, the IRG completed work on the sixth supplement to the URO, a collection of
approximately 7,500 characters from various sources. The Extension F collection was first
encoded in the Unicode Standard, Version 10.0.
The IRG finished the work on the seventh supplement to the URO in late 2018, which
includes submissions from China, SAT, South Korea, TCA, the United Kingdom, and the
United States. This happened too late for Extension G to be considered for Version 12.0.
Work on the eighth supplement is already underway, which also includes submissions
from Vietnam.
Han Unification History 960 E.3 CJK Sources
Appendix F
Documentation of CJK Strokes F
This appendix provides additional documentation regarding each of the CJK stroke char-
acters encoded in the CJK Strokes block (U+31C0..U+31EF). For a general introduction to
CJK characters and CJK strokes, see Section 18.1, Han.
The information in Table F-1 gives five types of identifying data for each CJK stroke. Each
stroke is also exemplified in a spanning lower row, with a varying number of examples, as
appropriate. The information contained in each of the five columns and in the examples
row is described more specifically below.
• Stroke: A representative glyph for each CJK stroke character, with its Unicode
code point shown underneath.
• Acronym: The abbreviation used in the Unicode character name for the CJK
stroke character.
• Pinyin: The Hanyu Pinyin (Modern Standard Chinese) romanization of the
stroke name (as given in the next column), hyphenated to make clear the rela-
tionship between the romanization of the stroke name and the acronym value.
• Name: A traditional name for this stroke, as written in Han characters.
• Variant: Alternative (context-specific) forms of the representative glyph for
this stroke, if any.
• Examples: Representative glyphs and variant forms of CJK unified ideographs,
exemplifying typical usage of this stroke type in Han characters. Each example
glyph (or variant) is followed by the Unicode code point for the CJK unified
ideograph character it represents, for easy reference.
The CJK stroke characters in the table are ordered according to the traditional “Five
Types.”
Documentation of CJK Strokes 962
H héng
31D0
T tí
31C0
S shù
31D1
SG shù-gōu
31DA
P piě
31D2
SP shù-piě
31D3
D diăn
31D4
N nà
31CF
5927, 4EBA, 5929, (5165), 5C3A, 8D70, 662F, 8FB9,
5EF7
TN tí-nà
31DD
4E40, 5C10, (516B), (5165), (6587), 590A(),
(5EFB)
Documentation of CJK Strokes 964
HZ héng-zhé
31D5
HP héng-piě
31C7
HG héng-gōu
31D6
SZ shù-zhé
31D7
SW shù-wān
31C4
ST shù-tí
31D9
PZ piě-zhé
31DC
PD piě-diăn
31DB
PG piě-gōu
31E2
WG wān-gōu
31C1
XG xié-gōu
31C2
BXG biăn-xié-gōu
31C3
心5FC3(“” HKSCS)
534D, 2067F
HZG héng-zhé-gōu
31C6
200CC, 4E60, 7FBD, 5305, 52FB, 8461, 7528, 9752, 752B,
52FA, 6708, 4E5C, 4E5F
HZWG
héng-zhé-wān-
31C8 gōu
SZZ shù-zhé-zhé
31DE
200D1, 5350, 4E9E, 9F0E, 5433, 4E13, 279AE, 244F7, 249A1
HZZZ
héng-zhé-zhé-
31CE zhé
51F8, 21E2D
Documentation of CJK Strokes 968
HZZP
héng-zhé-zhé-
31CB piě
HXWG
héng-xié-wān-
31E0 gōu
HPWG
héng-piě-wān-
31CC gōu
SZWG
shù-zhé-wān-
31C9 gōu
HZZZG
héng-zhé-zhé-
31E1 zhé-gōu
2010E, 4E43, 5B55, 4ECD
Q quān
31E3
I Index
The index covers the contents of this core specification. To find topics in the Unicode Stan-
dard Annexes, Unicode Technical Standards, and Unicode Technical Reports, use the
search feature on the Unicode website.
For definitions of terms used, see the glossary on the Unicode website. To find the code
points for specific characters or the code ranges for particular scripts, use the Character
Index on the Unicode website. (See Section B.3, Other Unicode Online Resources.)
B Breton . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294
Buginese . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 681–682
Balinese . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 683–688
Buhid . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 678
Bamum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 767–768
Bangla . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 472–478 Bulgarian . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313
bullets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275
base characters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 327
numeric . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 819
definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
multiple . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 Burmese see Myanmar
Byelorussian . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313
ordered before combining marks . . . . . . 220, 327
byte order mark (BOM) (U+FEFF) ..40, 67, 130–133,
Basic Multilingual Plane (BMP) . . . . . . . . . . . . . . . 1, 44
allocation areas . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 893–895
byte ordering
representation in UTF-16 . . . . . . . . . . . . . . . . . . . 36
changing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
Basque . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294
Bassa Vah . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 769 conformance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
byte serialization . . . . . . . . . . . . . . . . . . . . . . . . . . . 40, 67
Batak . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 694–695
Byzantine Musical Symbols . . . . . . . . . . . . . . . . . . . 794
benefits of Unicode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Bengali . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 472–478
Bhaiksuki . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 575–576 C
Bidi Class (normative property) . . . . . . . . . . . . . . . . 171 C language
Bidi Mirrored (normative property) . . . . . . . . . . . . 178 wchar_t and Unicode . . . . . . . . . . . . . . . . . . . . . 200
Bidi Mirroring Glyph (informative property) . . . . 179 C0 and C1 control codes . . . . . . . . . . . . . . .31, 187, 868
BidiMirroring.txt . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179 Cambodian see Khmer
Bidirectional Algorithm, Unicode . . . . . . . . . . . .53, 84 Canadian Aboriginal Syllabics . . . . . . . . . . . . . 778–779
bidirectional ordering . . . . . . . . . . . . . . . . . . . . . . . . . . 20 candrabindu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 470, 600
controls . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 879 canonical composite characters
bidirectional text . . . . . . . . . . . . . . . . . . . . . . . . . . .53, 84 see canonical decomposable characters
Middle Eastern scripts . . . . . . . . . . . . . . . . . . . . . 359 canonical composition algorithm . . . . . . . . . . . . . . 138
nonspacing marks in . . . . . . . . . . . . . . . . . . . . . . 223 canonical decomposable characters
punctuation in . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261 definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
big-endian . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 canonical decomposition . . . . . . . . . . . . . . . . . . . . . . 63
definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
Bihari . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 468 mappings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
binary comparison and sort order canonical equivalence
caution for UTF-16 . . . . . . . . . . . . . . . . . . . . . . . . . 36 definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
UTF differences . . . . . . . . . . . . . . . . . . . . . . 231, 233 nonspacing marks . . . . . . . . . . . . . . . . . . . . . . . . 225
UTF-8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 canonical equivalent character sequences
block . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45, 90, 255, 903 conformance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
headers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 916 canonical mappings
BMP see Basic Multilingual Plane see canonical decomposition mappings
BNF (Backus-Naur Form) . . . . . . . . . . . . . . . . . . . . . 923 canonical ordering algorithm . . . . . . . . . . . . . . . . . . 137
BOCU-1 see UTN #6, BOCU-1 canonical precomposed characters
MIME-Compatible Unicode Compression see canonical decomposable characters
Bodhi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 521 Cantonese . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 710
Bodo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 467 capital letters . . . . . . . . . . . . . . . . . . . . . . . .164, 236, 287
BOM (U+FEFF) . . . . . . . . . 40, 67, 130–133, 893–895 Carian . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343
Bopomofo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 727–729 carriage return (U+000D) (CR) . . . . . . . . . . . 209, 869
boundaries, text . . . . . . . . . . . . . 61, 189, 217–218, 228 carriage return and line feed (CRLF) . . . . . . . . . . . 209
see also UAX #14, Unicode Line Breaking Algo- case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295
rithm and text processes . . . . . . . . . . . . . . . . . . . . . . . . . 12
see also UAX #29, Unicode Text Segmentation beyond ASCII . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237
boustrophedon . . . . . . . . . . . . . . . . . . . . . . . . . . . .53, 349 camelcase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239
box drawing symbols . . . . . . . . . . . . . . . . . . . . . . . . . . 845 case folding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240
Brahmi . . . . . . . . . . . . . . . . 445, 563, 565–568, 569, 631 case operations (conformance) . . . . . .85, 152–158
Braille . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 786–787 case operations and normalization . . . . . . . . . 242
Index 971
discussion list for Unicode . . . . . . . . . . . . . . . . . . . . . 930 encoding model for Unicode characters . . . . . . 33, 42
Dogra . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 627–628 see also UTR #17, Unicode Character Encoding
Dogri . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 467 Model
Domino Tiles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 856 encoding schemes . . . . . . . . . . . . . . . . . . . . . . . . . . 40–43
dotless i . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 238, 291 encoding schemes, Unicode
dotted circle see Unicode encoding schemes
in code charts . . . . . . . . . . . . . . . . . . . . . . . . 107, 328 endian ordering
in fallback rendering . . . . . . . . . . . . . . . . . . . . . . . 222 see byte order mark (BOM) (U+FEFF)
to indicate diacritic . . . . . . . . . . . . . . . . . . . . . . . . . 55 end-user subarea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 889
to indicate vowel sign placement . . . . . . . . . . . . . 56 English . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293
double diacritics . . . . . . . . . . . . . . . . . . . . . 114, 190, 329 equivalent sequences . . . . . . . . . . . . . . . . . . . . . . . . . 206
Duployan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 798–799 as Unicode design principle . . . . . . . . . . . . . . . . . 23
Dutch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293, 294 case-insensitivity . . . . . . . . . . . . . . . . . . . . . 231, 240
dynamic composition combining characters in matching . . . . . . . . . . 219
as Unicode design principle . . . . . . . . . . . . . . . . . 23 conformance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
Dzongkha . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 521 Hangul syllables . . . . . . . . . . . . . . . . . . . . . . . . . . 737
in sorting and searching . . . . . . . . . . . . . . . . . . . 230
E language-specific . . . . . . . . . . . . . . . . . . . . . . . . . 118
security implications . . . . . . . . . . . . . . . . . . . . . . 245
East Asian scripts . . . . . . . . . . . . . . . . . . . . . . . . 701–750
writing direction . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 see also canonical equivalence
see also compatibility equivalence
see also CJK ideographs
see also encoding forms, encoding schemes
Eastern Arabic-Indic digits . . . . . . . . . . . . . . . . . . . . 371
EBCDIC errata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxvi, 76, 931
escape sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 868
newline function . . . . . . . . . . . . . . . . . . . . . . . . . . 210
not used in Unicode . . . . . . . . . . . . . . . . . . . . . . 1, 4
editing, text boundaries for . . . . . . . . . . . . . . . 217–218
efficiency Esperanto . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294
Estonian . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294
as Unicode design principle . . . . . . . . . . . . . . . . . 15
Ethiopic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 752–755
Egyptian hieroglyphs . . . . . . . . . . . . . . . . . . . . 434–439
format controls . . . . . . . . . . . . . . . . . . . . . . 436–439 Etruscan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 345
European scripts . . . . . . . . . . . . . . . . . . . . . . . . . 287–336
Elbasan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 353
ancient . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 337–357
ellipsis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273–274
Elymaic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 422 eyelash-RA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 459
e-mail discussion list for Unicode . . . . . . . . . . . . . . 930
emoji . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 849, 930 F
animal symbols . . . . . . . . . . . . . . . . . . . . . . . . . . . 853 fallback rendering . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252
charts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 930 of nonspacing marks . . . . . . . . . . . . . . . . . . . . . . 222
cultural symbols . . . . . . . . . . . . . . . . . . . . . . . . . . 853 FAQ (Frequently Asked Questions) . . . . . . . . . . . . 930
zodiacal symbols . . . . . . . . . . . . . . . . . . . . . . . . . . 853 Faroese . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293
emoji modifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 853 Farsi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 367, 370
emoticons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 853 featural syllabaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257
Enclosed Alphanumerics . . . . . . . . . . . . . . . . . . . . . . 863 FF (U+000C form feed) . . . . . . . . . . . . . . . . . . . 209, 869
enclosing marks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335 file separator (U+001C) . . . . . . . . . . . . . . . . . . . . . . 869
definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 Finnish . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293
encoded characters . . . . . . . . . . . . . . . . . . . . . . . . . . 7, 29 Finno-Ugric Transcription (FUT)
allocation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .44–52 see Uralic Phonetic Alphabet (UPA)
definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 fixed-width Unicode encoding form (UTF-32) . . . 35,
encoding form conversion 124
definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 flat tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196
encoding forms . . . . . . . . . . . . . . . . . . . . . . . . . . . . .33–39 Flemish . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293
ISO/IEC 10646 definitions . . . . . . . . . . . . . . . . . 942 fleurons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 855
encoding forms, Unicode fonts
see Unicode encoding forms and Unicode characters . . . . . . . . . . . . . . . . . . . . 16
for mathematical alphabets . . . . . . . . . . . . 813–815
style variation for symbols . . . . . . . . . . . . . . . . . 803
Index 976
T tone marks
Bopomofo spacing . . . . . . . . . . . . . . . . . . . . 727, 728
tab (U+0009 character tabulation) . . . . . . . . . . . . . 869
Chinantec . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 326
tab, vertical (U+000B) . . . . . . . . . . . . . . . . . . . 209, 869
tables of character data . . . . . . . . . . . . . . . . . . . 196–197 Chinese . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 326
Tai Le . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 657
optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
Thai . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 631
supplementary characters . . . . . . . . . . . . . . . . . . 197
tag characters . . . . . . . . . . . . . . . . . . . . . . . . . . . 898–902 Vietnamese . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 292
traditional Chinese . . . . . . . . . . . . . . . . . . . . . . . . . . . 709
Tagalog . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 678
traffic signs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 851
Tagbanwa . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 678
tags, language . . . . . . . . . . . . . . . . . . . . . . . 215, 898–902 trailing surrogates
see low-surrogate code units
use strongly discouraged . . . . . . . . . . . . . . . . . . . 901
transcoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196–197
Tai Laing
digits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 818 tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196
Transport and Map Symbols . . . . . . . . . . . . . . . . . . 854
Tai Le . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 657–658
triangulation in transcoding . . . . . . . . . . . . . . . . . . . 196
Tai Tham . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 662–664
digits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 818 tries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196
truncation
Tai Viet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 665–667
combining character sequences . . . . . . . . 220–221
Tai Xuan Jing symbols . . . . . . . . . . . . . . . . . . . . . . . . . 859
surrogates and . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204
Takri . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 602–603
Tamil . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 489–498 Turkish . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294
case mapping of I . . . . . . . . . . . . . . . . . . . . . 238, 291
Tangut . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 748–750
cedilla . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 291
components . . . . . . . . . . . . . . . . . . . . . . . . . 749–750
radicals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 749 lira sign . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 807
two-stage tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
tashkil . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 368
tashkil, harakat, points . . . . . . . . . . . . . . . . . . . . . . . . 370
TCHAR in Win32 API . . . . . . . . . . . . . . . . . . . . . . . . 200 U
Technical Reports (UTR) . . . . . . . . . . . . . . . . . . . . . 929 U+ notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 924
Technical Standards (UTS) . . . . . . . . . . . . . . . xxvi, 929 U+10FFFF (not a character code) . . . . . . . . . . . . . . 891
abstracts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 930 U+FEFF (BOM) . . . . . . . . . . . . . . . . . . . . . . . . . 893–895
technical symbols . . . . . . . . . . . . . . . . . . . . . . . . 840–844 U+FFFE (not a character code) . . . . . . . . . . . . . . . . 892
Telugu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 499–501 U+FFFF (not a character code) . . . . . . . . . . . . . . . . 891
terminal emulation . . . . . . . . . . . . . . . . . . . . . . . . . . . 804 UAX (Unicode Standard Annex) . . . . . . . . . . .xxiv, 929
text boundaries . . . . . . . . . . . . . . 61, 189, 217–218, 228 as component of Unicode Standard . . . . . . . . . . 79
see also UAX #14, Unicode Line Breaking Algo- conformance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
rithm list of . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
see also UAX #29, Unicode Text Boundaries UCA see Unicode Collation Algorithm and see also
text elements . . . . . . . . . . . . . . . . . . . . . . . . . . . 6, 10, 217 UTS #10, Unicode Collation Algorithm
boundaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228 UCD see Unicode Character Database
for sorting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230 UCS (Universal Character Set)
variable-width nature . . . . . . . . . . . . . . . . . . . . . . . 38 see ISO/IEC 10646
text processes . . . . . . . . . . . . . . . . . . . . . . . . . . . 6, 10–13 UCS-2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 942
text rendering . . . . . . . . . . . . . . . . . . . . . . . . . . . .6, 10, 17 UCS-4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 942
text selection, boundaries for . . . . . . . . . . . . . 217–218 Ugaritic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 432
Thaana . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 515–516 Ukrainian . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313
Thai . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 631–634 unassigned code points . . . . . . . . . . . . . . . . . 30, 79, 201
Tibetan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 521–531 defined as reserved code points . . . . . . . . . . . . . 93
Tifinagh . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 757 handling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
Tigre . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 752 properties of . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
tilde (U+007E) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276 semantics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
Tirhuta . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 613–615 see also reserved code points
titlecase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164, 236 underscores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273
Todo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 533 undesignated code points . . . . . . . . . . . . . . . . . . . . . . 30
tone letters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325–326 Unicode 1.0 Name (informative property) . . . . . . 187
Index 985
W
Wancho . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 561
Warang Citi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 549
wchar_t
and Unicode encoding forms . . . . . . . . . . . . . . . . 38
in C language . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200
weak directional characters . . . . . . . . . . . . . . . . . . . . 171
weather symbols . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 851
website, Unicode Consortium . . . . . . . . . . . . . . . . . . 930
Weierstrass elliptic function symbol . . . . . . . . . . . . 810
well-formed
definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
Welsh . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294
Where Is My Character? . . . . . . . . . . . . . . . . . . . . . . . 932
wide characters
data type in C . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200
wiggly fence (U+29DB) . . . . . . . . . . . . . . . . . . . . . . . 836
Windows newline function . . . . . . . . . . . . . . . . . . . . 210
word breaks . . . . . . . . . . . . . . . . . . . . . . . . . 219, 871–873
in South Asian scripts . . . . . . . . . . . . 633, 641, 656
word joiner (U+2060) . . . . . . . . . . . . . . . . . . . . . . . . . 871
writing direction see directionality
writing systems . . . . . . . . . . . . . . . . . . . . . . . . . . 256–260
Wu (Shanghainese) . . . . . . . . . . . . . . . . . . . . . . . . . . . 710
X
Xibe . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 533
Xishuangbanna Dai . . . . . . . . . . . . . . . . . . . . . . . . . . . 659
Y
Yi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 739–741
Yiddish . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 361
Yijing Hexagram Symbols . . . . . . . . . . . . . . . . . . . . . 858
ypogegrammeni . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304
Z
Zanabazar Square . . . . . . . . . . . . . . . . . . . . . . . 585–587
Zapf Dingbats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 854
zero extension relation among encodings . . . . . . . 942
zero width joiner (U+200D) . . . . . . . . . . 369–370, 874
zero width no-break space (U+FEFF) . . . 67, 83, 871
initial . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133, 894
zero width non-joiner (U+200C) . . . . . . 369–370, 875
zero width space (U+200B) . . . . . . . . . . . . . . . . . . . . 872
for word breaks in South Asian scripts . 633, 641,
656
zero-width space characters . . . . . . . . . . . . . . . . . . . . 872
About this Publication
The main text of this book is set in Minion 3, designed by Rob Slimbach at Adobe Systems
Incorporated. The main text was typeset using Adobe FrameMaker 2015 running under
Windows 10. Figures were created with Adobe Illustrator. The code charts were produced
with Unibook chart formatting software supplied by ASMUS, Inc. The text of the character
names list is set in Myriad, designed by Carol Twombly and Robert Slimbach at Adobe Sys-
tems Incorporated.
For font contributors acknowledgements, see: http://www.unicode.org/charts/fonts.html