Unicode - ARABIC SCRIPT TUTORIAL
Unicode - ARABIC SCRIPT TUTORIAL
1 INTRODUCTION
Like all multi-lingual computing, Arabic computing is now firmly in the domain of
Unicode. Unicode is an industrial protocol with the status of international agreement. It
is designed to encode the elements of all known script systems in such a way that they
become interchangeable between programs and operating systems. Its implementation is
well underway.
Unicode eliminates the need to tamper with fonts to get special characters, but it is not a
font. For legible text on screen and paper, Unicode depends on compatible fonts with the
required characters, where necessary with additional dedicated font technology.
Thomas Milo –Arabic script Tutorial
Similar letters b t ṯ
Similar letters ǧ ḥ ḫ
Similar letters d ḏ r z
Similar letters s š ṣ ḍ ṭ ẓ
Similar letters ʿ ġ f q
Historical group k l m n
Rest h w y
Traditionally, a repetition of the vowel marks is used at the end of a word to indicate that
the indefinite article /-n/ is attached to the vowel:
Unicode deals with repeated vowel markers as if they are separate characters. This is a
legacy from the metal typesetting era, when it was impossible to compose such minute
superscript or subscript groups:
d. direction of writing
M Æ H Æ M Æ D
D Å M Å H Å M
MHMD
(pronounced: muḥammad)
For technical and pedagogical reasons, there is a strong tendency to eliminate or simplify
the connectivity of Arabic script; still even the simplest fonts maintain a minimal degree
of connection between letters. This approach removes from Arabic script its synthetic,
ideographic quality and turns it back into the analytic alphabet from which it evolved:
MHMD
Markers have a distinctly graphemic function. They combine with various skeletons to
form other letters, e.g. the dot-above is used by eight Arabic letters:
Other unmarked skeletons by themselves are already meaningful letters that differ from
the ones characterized by a marker, e.g.:
Con: For scholarly work, the merger of skeleton and marker denies the evolutionary
stages of the script, where the use of markers was casual, in a way similar to the use of
vowels. Therefore, modern industrial encoding as inherited by Unicode has the
disadvantage that:
- IT MISREPRESENTS HISTORICAL USAGE
In manuscripts and even in older prints, markers are often incomplete or unreliable
because markers were secondary, often redundant elements;
or because markers were added later to interpret or eliminate ambiguities;
because double markers sometimes co-exist to maintain original ambivalence.
4 ARCHIGRAPHEMES
ʿabdu “servant”
ʿīd “feast”
Transliteration Shape
EBD
(capitals are used to represent
indeterminate graphemes)
In this kind of spelling the skeletons are not “defective” graphemes, but valid
archigraphemes. An archigrapheme is the common element(s) between two or more
graphemes, minus the marker(s) that disambiguate them. The majority of historic texts
are written with archigraphemes.
Unicode does not – yet – have the data structure to deal with archigraphemes and
discrete markers as meaningful text elements.
5 GRAPHEMES
y w h n m l k q f ġ ʿ ẓ ṭ ḍ ṣ š s z r ḏ d ḫ ḥ ǧ ṯ t b a
This inconsistency is not a feature of the Arabic writing system, but a consequence of the
legacy approach adopted by Unicode. Accepting all graphemic markers as independent
secondary characters with their own code points would make these cases unambiguous.
The template for this solution already exists: in the latest version of the Unicode Standard,
the combination of composition elements ALEF and HAMZA ABOVE has been declared
canonically equivalent to the legacy pre-composed grapheme ALEF WITH HAMZA ABOVE:
U+0627 ARABIC LETTER ALEF U+0623 ARABIC LETTER ALEF WITH HAMZA ABOVE
U+0654 ARABIC HAMZA ABOVE
6 WRITING ARABIC
Here are two additional aspects of Arabic script that have consequences for rendering
systems:
a. horizontal and vertical connections
The traditional connection is still reflected in a number of ligatures.
traditional assimilation modified assimilation
ﺣﺤﺢ
b. unstable spelling caused by changing font technology
Spelling and font technology have mutually influenced each other since the fast
emergence of computer technology for Arabic script. The fast development of font
technology has the unintentional result that different fonts may require different spellings
for the same printed image. For instance, most fonts cannot deal with al-lāhu, “God”:
ALEF-FATHA ALEF-FATHA
LAM-LAM-SHADDA-FATHA-HEH-DAMMA LAM-LAM-HEH-DAMMA
correct data structure, wrong image wrong data structure, wrong vowel image
A related phenomenon occurs when older font technology cannot handle the
combination of ligatures and vowels, forcing the users into systematically misspelling
words, e.g., the word al-islāmu “Islam”:
correct data structure, wrong image wrong data structure, approximate image
ﺳﻠَﺎ ُم
ْ َا ْﻟِﺈ ﻼ ُم
َﺳ
ْﻹِ َا
For comparison, the correct image representing the above data structures:
complete vowels incomplete and misplaced vowels
a. font technology
A font is an industrial product designed to enable handling Arabic with technology that is
not designed for Arabic. In the design process, Arabic is an object that can be adapted at
will: corners can be cut and rules can be broken. The resulting script can be seen as an
“innovation”.
ِ ْ َ بتـــثب
یتینـــــــ ِ ْ َ ِ یتین ِ ْ َِ
ِ ْ َ بتثب
b. script analysis and synthesis
The term script synthesis describes the effort to analyze and synthesize traditional
calligraphic styles or high quality typesetting systems. In this approach Arabic is the
subject whose integrity needs to be preserved when it is reproduced in digital form. Here
the underlying technology is the innovation.
a. what to encode
Unicode uses a model resulting from earlier conferences about Middle Eastern
computing: contextual shapes of one and the same letter are all attributed to a single
nominal text code. This is the graphemic model:
(the GOAL variants of HEH and TEH MARBUTAH are also calligraphy-based mismatches)
1
bharam khul ǧāʾē ẓālim tērē qāmat kī darāzi kā - agar us tura ē pur pēč ū ḫam kā pēč u ḫam niklē
“O tyrant, the mistake about the tallness of your figure will be rectified - if the curls and twists of your hair
full of curls and twists are straightened out” (Ġālib, quoted in Finn Thiessen, A manual of Classical Persian
Prosody with chapters on Urdu, Karakhanidic and Ottoman prosody, Wiesbaden 1982, p.188)
10 BASIC LAY-OUT
b. Graphemic: Only word-separating spaces and final forms are valid line breaking points:
2
The sample (repeated in the text columns) illustrates the spelling evolution in Arabic, as well as the
complete phonologic, lexical and orthographic integration of Arabic words in Uyghur (spoken in China):
Arabic: muḥammad ʿabdu l-lāh nadīm ʿarab miṣrī;
Turkic: muhämmäd abdullah nadim äräb mısırlıq
(Mohammed, Abdallah, Nadeem [personal names], and “Arab”, “Egyptian" – from Arabic miṣr, “Egypt”)
11 LANGUAGES
Languages written with the Arabic script [millions of speakers]3
Arabic [221m]
Qurʾānic Arabic
Classical Arabic
Modern Standard Arabic
Colloquial Arabic dialects
Algerian [22m]
Baharna (Bahrain, Oman)
Chadian
Dhofari (Oman)
Egyptian [46m]
Hadrami (East Yemen, Oman)
Hassaniyya [2.6m] (Mauretania)
Hijazi (KSA)
Judeo-Iraqi (Israel)
Judeo-Moroccan
Judeo-Tripolitanian (Lebanon)
Judeo-Tunisian
Judeo-Yemeni (Yemen, Israel)
Libyan
Mesopotamian [14m] (Iraq, Iran, Syria)
Moroccan / Maghrebi [19.5m]
Najdi [10m] (Saudi Arabia, Iraq, Jordan, Syria)
North Levantine [15m] (Lebanon, Syria)
North Mesopotamian
Omani
Saidi [19m] (Egypt)
Sanaani (North Yemeni)
Shihhi (UAE)
South Levantine
Sudanese [19m] Geo: Sudan
Ta'izzi-Adeni (South Yemeni)
Tunisian
Indo-Aryan
Kurdish / Kurmanji / Northern Kurdish [26m]
Several of the Kurdish-specific letters in Unicode have no
corresponding positional forms in the PRESENTATION blocks
3
This is a rough compilation that does not distinguish between current and historical use of the Arabic
script; numbers of speakers have not been verified.
Sources: http://en.wikipedia.org; http://www.omniglot.com; http://www.travelphrases.info/fonts.html
Persian
Persian / Western Farsi (Persian of Iran) [70m]
Dari / Eastern Farsi (Persian of Afghanistan) [7m]
Tajiki (Persian of Tajikistan and Afghanistan [4.4m]
Pashto / Afghan [27m]
alias: Pathan, Pushto, Pashtoe, Pashtu, and Pukhto
Western Balochi / Baluchi (Balochistan: Pakistan, Iran, and Afghanistan;
Turkmenistan, the Arab countries of the Gulf, and Kenya)
Urdu [104m]
Kalami (Pakistan)
Punjabi, Lahnda (Pakistan)
Sindhi [9m] (Pakistan, Sind province, India)
Parkari (Pakistan)
Kashmiri / kashur [4.5m] (India, Pakistan, China, UK)
Saraiki / Multani / Derawali / Western Punjabi (Pakistan)
Pathwari (Pakistan)
Rajasthani (India)
Turkic
Uyghur [7.6m] (China)
Turkmen [6.4m] (Turkmenistan, Afghanistan, Germany, Iran, Iraq, Kazakhstan,
Kyrgyzstan, Pakistan, Russia, Tajikistan, Turkey, USA and Uzbekistan.
Kazak [8m] (Kazakstan, Russia and China)
Kyrghyz [1.5m] ( Kyrghyzstan, China)
Turkish /Osmanli
Chagatai
Tatar [7m] (Russian Republic of Tatarstan, and also in Afghanistan, Azerbaijan,
Belarus, China, Estonia, Finland, Georgia, Kazakhstan, Kyrgyzstan, Latvia,
Lithuania, Moldova, Tajikistan, Turkey (Europe), Turkmenistan, Ukraine, USA
and Uzbekistan)
African
Hausa / Ajami [39m]
Swahili / Kiswahili (Zanzibar, Tanzania - official, Kenya - official, Malawi,
Mozambique, E. Congo, Uganda, Rwanda, Burundi, Somalia, S Ethiopia.)
Mandinka [1.2m] (Senegal, Gambia (main language), Guinea-Bissau)
Wolof [6.7m] (Senegal - main language, Gambia, Mauritania)
Comorian (Comoros Islands)
Maba [0.25m] (Africa)
SE Asia
Malay / Jawi [18m] (Brunei - co-official script, Malaysia, Indonesia, Singapore,
Thailand) Malay written in Arabic is called Jawi.
Caucasian
Dargwa [2.5m] (Russian Republic of Dagestan)
European
Morisco (Spanish)
Bosnian (Serbian)
Ukrainian
Afghanistan, Algeria, Bahrain, Chad, China, Cyprus, Djibouti, Egypt, Eritrea, Iran, India,
Iraq, Israel, Jordan, Kenya, Kuwait, Lebanon, Libya, Mali, Mauritania, Morocco, Niger,
Oman, Palestinian West Bank & Gaza, Qatar, Saudi Arabia, Somalia, Sudan, Syria,
Tajikistan, Tanzania, Tunisia, Turkey, UAE, Uzbekistan and Yemen.