Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

UTF-32: Difference between revisions

Content deleted Content added
 
(27 intermediate revisions by 10 users not shown)
Line 1:
{{Short description|StoringEncoding Unicode characters as 4 bytes per code point}}
'''UTF-32''' (32-[[bit]] [[Unicode transformation format|Unicode Transformation Format]]), sometimes called UCS-4, is a fixed-length [[Character encoding|encoding]] used to encode Unicode [[code point]]s that uses exactly 32 bits (four [[byte]]s) per code point (but a number of leading bits must be zero as there are far fewer than 2<sup>32</sup> Unicode code points, needing actually only 21 bits).<ref name="4_or_3_bytes" /> UTF-32In is a fixed-length encodingcontrast, in contrast to all other Unicode transformation formats, which are variable-length encodings. Each 32-bit value in UTF-32 represents one Unicode code point and is exactly equal to that code point's numerical value.
 
The main advantage of UTF-32 is that the Unicode code points are directly indexed. Finding the ''Nth'' code point in a sequence of code points is a [[constant time|constant-time]] operation. In contrast, a [[variable-length code]] requires [[linear time|linear-time]] to count ''N'' code points from the start of the string. This makes UTF-32 a simple replacement in code that uses [[Integer|integers]] that are incremented by one to examine each location in a [[String (computer science)|string]], as was commonly done for [[ASCII]]. However, Unicode code points are rarely processed in complete isolation, such as [[combining character]] sequences and for emoji.<ref name=":0">{{Cite web |title=FAQ - UTF-8, UTF-16, UTF-32 & BOM |url=http://unicode.org/faq/utf_bom.html#utf32-2 |access-date=2022-09-04 |website=unicode.orgUnicode }}</ref>
 
The main disadvantage of UTF-32 is that it is space-inefficient, using four [[byte]]s per code point, including 11 bits that are always zero. Characters beyond the [[Basic Multilingual Plane|BMP]] are relatively rare in most texts (except, for e.g.example, in the case of texts with some popular emojis), and can typically be ignored for sizing estimates. This makes UTF-32 close to twice the size of [[UTF-16]]. It can be up to four times the size of [[UTF-8]] depending on how many of the characters are in the [[ASCII]] subset.<ref name=":0" />
 
== History ==
The original [[ISO/IEC 10646]] standard defines a 32-bit ''encoding form'' called '''UCS-4''', in which each code point in the [[Universal Character Set]] (UCS) is represented by a 31-bit value from 0 to 0x7FFFFFFF (the sign bit was unused and zero). In November 2003, Unicode was restricted by RFC&nbsp;3629 to match the constraints of the [[UTF-16]] encoding: explicitly prohibiting code points greater than U+10FFFF (and also the high and low surrogates U+D800 through U+DFFF). This limited subset defines UTF-32.<ref>{{Cite web|title=<!--Publicly Available Standards --> ISO/IEC 10646:2020 |url=https://standards.iso.org/ittf/PubliclyAvailableStandards/index.html|access-date=2021-10-12|website=standards.iso.orgISO Standards |quote=Clause 9.4: "Because surrogate code points are not UCS scalar values, UTF-32 code units in the range 0000 D800-0000 DFFF are ill-formed". Clause 4.57: "[UCS codespace] <!--doubt this is in any source, or then redundant previously: codespace--> consisting of the integers from 0 to 10 FFFF (hexadecimal)".<!--in an older(?) source like: "uses the UCS codespace which consists of the integers from 0 to 10FFFF."[http://std.dkuug.dk/JTC1/sc2/WG2/docs/n3967.zip/FCD-10646-00-Main.pdf]--> Clause 4.58: "[UCS scalar value] any UCS code point except high-surrogate and low-surrogate code points".}}</ref><ref name="4_or_3_bytes">{{Cite web |title=Mapping codepoints to Unicode encoding forms |url=https://scripts.sil.org/cms/scripts/page.php?site_id=nrsi&item_id=IWS-AppendixA |first1=Peter |last1=Constable |date=2001-06-13 |access-date=2022-10-03 |website=scripts.sil.orgComputers and Writing Systems - SIL International }}</ref> Although the ISO standard had (as of 1998 in Unicode 2.1) "reserved for private use" 0xE00000 to 0xFFFFFF, and 0x60000000 to 0x7FFFFFFF<ref>{{Cite web |title=THEAnnex UNIVERSALB CHARACTER- The Universal Character SETSet (UCS) |url=http://std.dkuug.dk/cen/tc304/guidecharactersets/guideannexb.html |access-date=2022-10-03 |website=DKUUG Standardizing |url-status=live |archive-url=https://web.archive.org/web/20220122081513/http://std.dkuug.dk/cen/tc304/guidecharactersets/guideannexb.html |archive-date= Jan 22, 2022 }}</ref> these areas were removed in later versions. Because the Principles and Procedures document of [[ISO/IEC JTC 1/SC 2]] Working Group 2 states that all future assignments of code points will be constrained to the Unicode range, UTF-32 will be able to represent all UCS code points and UTF-32 and UCS-4 are identical.<ref>{{Cite book |title=The Unicode Standard, version 6.0 |date=February 2011 |publisher=[[Unicode Consortium]] |isbn=978-1-936213-01-6 |location=Mountain View, CA |pages=573 |chapter=C.2 Encoding Forms in ISO/IEC 10646 |quote=It [UCS-4] is now treated simply as a synonym for UTF-32, and is considered the canonical form for representation of characters in 10646. |chapter-url=https://www.unicode.org/versions/Unicode6.0.0/appC.pdf}}</ref>
 
== Utility of fixed width ==
A fixed number of bytes per code point has a number of theoretical advantages, but each of these has problems in reality:
 
* Truncation becomes easier, but not significantly so compared to [[UTF-8]] and [[UTF-16]] (both of which can search backwards for the point to truncate by looking at 2–4 code units at most).{{efn|For UTF-8: Select point to truncate at. If the byte before it is 0-0x7F, or the byte after it is anything other than the continuation bytes 0x80-0xBF, the string can be truncated at that point. Otherwise search up to 3 bytes backwards for such a point and truncate at that. If not found, truncate at the original position. This works even if there are encoding errors in the UTF-8. UTF-16 is trivial and only has to back up one word at most.}}{{citation needed|date=January 2023}}
* Finding the ''Nth character'' character in a string. For fixed width, this is simply a [[Big O notation|O(1) problem]], while it is [[Big O notation|O(n) problem]] in a variable-width encoding. Novice programmers often vastly overestimate how useful this is.<ref name=manishearth>{{Cite web|title=Let's Stop Ascribing Meaning to Code Points - |website=In Pursuit of Laziness |first1=Manish |last1=Goregaokar |date=January 14, 2017 |url=https://manishearth.github.io/blog/2017/01/14/stop-ascribing-meaning-to-unicode-code-points/|access-date=2020-06-14|quote=Folks start implying that code points mean something, and that O(1) indexing or slicing at code point boundaries is a useful operation.|website=manishearth.github.io }}</ref> AmateurAlso programmerswhat oftena vastlyuser overestimatemight howcall usefula this"character" is: instill realityvariable-width, anfor algorithm that knows ''n'' without first examininginstance the ''n''-1[[combining characterscharacter]] beforesequence it (an O(n) problem) are very rare or non-existent.{{citation neededchar|date=January 2023á}} Incould additionbe Unicode2 code points are often not equivalent to what, the useremoji thinks{{char|👨‍🦲}} is a "character"three, for instance both of these Emoji are 3 code points: "👨‍🦲 Man: Bald"<ref>{{Cite web|title=👨‍🦲 Man: Bald Emoji|url=https://emojipedia.org/man-bald/|access-date=2021-10-12|website=Emojipedia|language=en}}</ref> and "👩‍🦰the Woman:ligature Red Hair".<ref>{{Cite webchar|title=👩‍🦰 Woman: Red Hair Emoji|url=https://emojipedia.org/woman-red-hair/|access-date=2021-10-12|website=Emojipedia|language=en}}</ref><ref>{{Cite web|title=↔️is Emoji ZWJ (Zero Width Joiner) Sequences|url=https://emojipediaone.org/emoji-zwj-sequence/|access-date=2021-10-12|website=emojipedia.org}}</ref>
* Quickly knowing the "width" of a string. In practice,However even with a "fixed width" fontfonts andhave restrictingvarying the characters to the BMPwidth, finding the string width from a count of code points is impossible. There areoften [[combiningCJK charactercharacters|combiningCJK formsideographs]] likeare 'é'twice as expressedwide,<ref usingname=manishearth/> twoplus codethe pointsalready-mentioned 'e'problems +with 'the ́ ' and "fixed width" may assign a widthnumber of 2 to [[CJK characters|CJK ideographs]], and some code points takenot multiplebeing characterequal positionsto perthe codenumber pointof ("[[grapheme]] clusters" for CJK)characters.<ref name=manishearth/>
 
== Use ==
The main use of UTF-32 is in internal APIs where the data is single code points or [[Glyph|glyphs]], rather than strings of characters. For instance, in modern text rendering, it is common{{citation needed|date=January 2023}} that the last step is to build a list of structures each containing [[Coordinate system|coordinates (x, y)]], attributes, and a single UTF-32 code point identifying the glyph to draw. Often non-Unicode information is stored in the "unused" 11 bits of each word.{{Citation needed|date=June 2017}}
 
Use of UTF-32 strings on Windows (where {{mono|[[wchar_t]]}} is 16 bits) is almost non-existent. On Unix systems, UTF-32 strings are sometimes, but rarely, used internally by applications, due to the type {{mono|wchar_t}} being defined as 32-bit.
Use of UTF-32 strings on Windows (where {{mono|wchar_t}} is 16 bits) is almost non-existent. On Unix systems, UTF-32 strings are sometimes, but rarely, used internally by applications, due to the type {{mono|[[wchar_t]]}} being defined as 32 bit. [[Python (programming language)|Python]] versions up to 3.2 can be compiled to use them instead of [[UTF-16]]; from version 3.3 onward all Unicode strings are stored in UTF-32 but with leading zero bytes optimized away "depending on the [code point] with the largest Unicode ordinal (1, 2, or 4 bytes)" to make all code points that size.<ref>{{cite web|last1=Löwis|first1=Martin|title=PEP 393 -- Flexible String Representation|url=https://legacy.python.org/dev/peps/pep-0393/|website=python.org|publisher=Python|access-date=26 October 2014}}</ref> [[Seed7]]<ref>{{cite web|url=http://seed7.sourceforge.net/faq.htm#unicode|title=The usage of UTF-32 has several advantages}}</ref> and [[Lasso (programming language)|Lasso]]{{Citation needed|date=June 2017}} programming languages encode all strings with UTF-32, in the belief that direct indexing is important, whereas the [[Julia (programming language)|Julia]] programming language moved away from builtin UTF-32 support with its 1.0 release, simplifying the language to having only UTF-8 strings (with all the other encodings considered legacy and moved out of the standard library to package<ref>{{Citation|title=JuliaStrings/LegacyStrings.jl: Legacy Unicode string types|date=2019-05-17|url=https://github.com/JuliaStrings/LegacyStrings.jl|publisher=JuliaStrings|access-date=2019-10-15}}</ref>) following the "UTF-8 Everywhere Manifesto".<ref>{{cite web |url=http://utf8everywhere.org/ |title=UTF-8 Everywhere Manifesto}}</ref>
 
UTF-32 is also forbidden as an HTML character encoding.<ref>{{cite web|access-date=2024-11-11 |language=en |title=HTML Standard |url=https://html.spec.whatwg.org/multipage/parsing.html#character-encodings |website=html.spec.whatwg.org}}<!-- auto-translated from French by Module:CS1 translator --></ref><ref>{{cite web|access-date=2024-11-11 |language=fr |title=Choisir et appliquer un encodage de caractères |url=https://www.w3.org/International/questions/qa-choosing-encodings.fr.html#avoid:~:text=particuli%C3%A8rement%20celle%20d%E2%80%99-,UTF-32,-. |website=www.w3.org}}<!-- auto-translated from French by Module:CS1 translator --></ref>
 
=== Programming languages ===
[[Python (programming language)|Python]] versions up to 3.2 can be compiled to use them{{Clarify|reason=“Python versions up to 3.2 can be compiled to use them” is unclear|date=November 2024}} instead of [[UTF-16]]; from version 3.3 onward, Unicode strings are stored in UTF-32 if there is at least 1 non-[[Basic Multilingual Plane|BMP]] character in the string, but with leading zero bytes optimized away "depending on the [code point] with the largest Unicode ordinal (1, 2, or 4 bytes)" to make all code points that size.<ref>{{cite web|last1=Löwis|first1=Martin|title=PEP 393 -- Flexible String Representation|url=https://legacy.python.org/dev/peps/pep-0393/|website=python.org|publisher=Python|access-date=26 October 2014}}</ref> <!-- In previous versions of Python, "\U0001F51F" (UTF-32) was equivalent to "\ud83d\udd1f" (UCS-2).
However, this is true in languages like JavaScript. This also means that a non-BMP character is not equivalent to its surrogate pair (example: <code>"\U0001F51F" != "\ud83d\udd1f"</code>) unlike most programming languages. -->
 
Use of UTF-32 strings on Windows (where {{mono|wchar_t}} is 16 bits) is almost non-existent. On Unix systems, UTF-32 strings are sometimes, but rarely, used internally by applications, due to the type {{mono|[[wchar_t]]}} being defined as 32 bit. [[Python (programming language)|Python]] versions up to 3.2 can be compiled to use them instead of [[UTF-16]]; from version 3.3 onward all Unicode strings are stored in UTF-32 but with leading zero bytes optimized away "depending on the [code point] with the largest Unicode ordinal (1, 2, or 4 bytes)" to make all code points that size.<ref>{{cite web|last1=Löwis|first1=Martin|title=PEP 393 -- Flexible String Representation|url=https://legacy.python.org/dev/peps/pep-0393/|website=python.org|publisher=Python|access-date=26 October 2014}}</ref> [[Seed7]]<ref>{{cite web|url=http://seed7.sourceforge.net/faq.htm#unicode|title=The usage of UTF-32 has several advantages}}</ref> and [[Lasso (programming language)|Lasso]]{{Citation needed|date=June 2017}} programming languages encode all strings with UTF-32, in the belief that direct indexing is important, whereas the [[Julia (programming language)|Julia]] programming language moved away from builtinbuilt-in UTF-32 support with its 1.0 release, simplifying the language to having only UTF-8 strings (with all the other encodings considered legacy and moved out of the standard library to package<ref>{{Citation|title=JuliaStrings/LegacyStrings.jl: Legacy Unicode string types|date=2019-05-17|url=https://github.com/JuliaStrings/LegacyStrings.jl|publisher=JuliaStrings|access-date=2019-10-15}}</ref>) following the "UTF-8 Everywhere Manifesto".<ref>{{cite web |url=http://utf8everywhere.org/ |title=UTF-8 Everywhere Manifesto}}</ref>
 
[[C++11]] has 2 built-in data types that use UTF-32. The <code>char32_t</code> data type stores 1 character in UTF-32. The <code>u32string</code> data type stores a string of UTF-32-encoded characters. A UTF-32-encoded character or string literal is marked with <code>U</code> before the character or string literal.<ref>{{Cite web |url=https://cplusplus.com/reference/string/u32string/ |access-date=2024-11-12 |website=cplusplus.com|title = u32string}}</ref><ref>{{Cite web |title=String literal - cppreference.com |url=https://en.cppreference.com/w/cpp/language/string_literal |access-date=2024-11-14 |website=en.cppreference.com}}</ref>
 
<syntaxhighlight lang="c++">
#include <string>
char32_t UTF32_character = U'🔟'; // also written as U'\U0001F51F'
std::u32string UTF32_string = U"UTF–32-encoded string"; // defined as `const char32_t*´
</syntaxhighlight>[[C Sharp (programming language)|C#]] has a <code>UTF32Encoding</code> class which represents Unicode characters as bytes, rather than as a string.<ref>{{Cite web |last=dotnet-bot |title=UTF32Encoding Class (System.Text) |url=https://learn.microsoft.com/en-us/dotnet/api/system.text.utf32encoding?view=net-8.0 |access-date=2024-11-27 |website=learn.microsoft.com |language=en-us}}</ref>
 
== Variants ==
Though technically invalid, the surrogate halves are often encoded and allowed. This allows invalid UTF-16 (such as Windows filenames) to be translated to UTF-32, similar to how the [[WTF-8]] variant of UTF-8 works. Sometimes paired surrogates are encoded instead of non-BMP characters, similar to [[CESU-8]]. Due to the large number of unused 32-bit values, it is also possible to preserve invalid UTF-8 by using non-Unicode values to encode UTF-8 errors, though there is no standard for this.
 
UTF-32 has 2 versions for big-endian and little-endian: '''UTF-32-BE''' and '''UTF-32-LE'''.
 
== See also ==
* [[Comparison of Unicode encodings]]
* [[UTF-16]]
 
 
== Notes ==