UTF-32: Difference between revisions

Browse history interactively

← Previous edit

Content deleted Content added

VisualWikitext

Revision as of 18:30, 16 December 2023 edit Theknightwho (talk \| contribs) Autopatrolled, Extended confirmed users, Template editors 13,177 edits m →Utility of fixed width ← Previous edit		Latest revision as of 22:02, 29 November 2024 edit undo Symbol & Font Hunter (talk \| contribs) 270 edits Fixed C# →Programming languages Tag: Visual edit
(27 intermediate revisions by 10 users not shown)
Line 1: {{Short description\|~~Storing~~Encoding Unicode characters as 4 bytes per code point}} '''UTF-32''' (32-[[bit]] [[Unicode transformation format\|Unicode Transformation Format]]), sometimes called UCS-4, is a fixed-length [[Character encoding\|encoding]] used to encode Unicode [[code point]]s that uses exactly 32 bits (four [[byte]]s) per code point (but a number of leading bits must be zero as there are far fewer than 2<sup>32</sup> Unicode code points, needing actually only 21 bits).<ref name="4_or_3_bytes" /> ~~UTF-32~~In ~~is a fixed-length encoding~~contrast, ~~in contrast to~~ all other Unicode transformation formats~~, which~~ are variable-length encodings. Each 32-bit value in UTF-32 represents one Unicode code point and is exactly equal to that code point's numerical value. The main advantage of UTF-32 is that the Unicode code points are directly indexed. Finding the ''Nth'' code point in a sequence of code points is a [[constant time\|constant-time]] operation. In contrast, a [[variable-length code]] requires [[linear time\|linear-time]] to count ''N'' code points from the start of the string. This makes UTF-32 a simple replacement in code that uses [[Integer\|integers]] that are incremented by one to examine each location in a [[String (computer science)\|string]], as was commonly done for [[ASCII]]. However, Unicode code points are rarely processed in complete isolation, such as [[combining character]] sequences and for emoji.<ref name=":0">{{Cite web \|title=FAQ - UTF-8, UTF-16, UTF-32 & BOM \|url=http://unicode.org/faq/utf_bom.html#utf32-2 \|access-date=2022-09-04 \|website=~~unicode.org~~Unicode }}</ref> The main disadvantage of UTF-32 is that it is space-inefficient, using four [[byte]]s per code point, including 11 bits that are always zero. Characters beyond the [[Basic Multilingual Plane\|BMP]] are relatively rare in most texts (except, for ~~e.g.~~example, in the case of texts with some popular emojis), and can typically be ignored for sizing estimates. This makes UTF-32 close to twice the size of [[UTF-16]]. It can be up to four times the size of [[UTF-8]] depending on how many of the characters are in the [[ASCII]] subset.<ref name=":0" /> == History == The original [[ISO/IEC 10646]] standard defines a 32-bit ''encoding form'' called '''UCS-4''', in which each code point in the [[Universal Character Set]] (UCS) is represented by a 31-bit value from 0 to 0x7FFFFFFF (the sign bit was unused and zero). In November 2003, Unicode was restricted by RFC 3629 to match the constraints of the [[UTF-16]] encoding: explicitly prohibiting code points greater than U+10FFFF (and also the high and low surrogates U+D800 through U+DFFF). This limited subset defines UTF-32.<ref>{{Cite web\|title=~~<!--~~Publicly Available Standards --> ISO/IEC 10646:2020 \|url=https://standards.iso.org/ittf/PubliclyAvailableStandards/index.html\|access-date=2021-10-12\|website=~~standards.iso.org~~ISO Standards \|quote=Clause 9.4: "Because surrogate code points are not UCS scalar values, UTF-32 code units in the range 0000 D800-0000 DFFF are ill-formed". Clause 4.57: "[UCS codespace] <!--doubt this is in any source, or then redundant previously: codespace--> consisting of the integers from 0 to 10 FFFF (hexadecimal)".<!--in an older(?) source like: "uses the UCS codespace which consists of the integers from 0 to 10FFFF."[http://std.dkuug.dk/JTC1/sc2/WG2/docs/n3967.zip/FCD-10646-00-Main.pdf]--> Clause 4.58: "[UCS scalar value] any UCS code point except high-surrogate and low-surrogate code points".}}</ref><ref name="4_or_3_bytes">{{Cite web \|title=Mapping codepoints to Unicode encoding forms \|url=https://scripts.sil.org/cms/scripts/page.php?site_id=nrsi&item_id=IWS-AppendixA \|first1=Peter \|last1=Constable \|date=2001-06-13 \|access-date=2022-10-03 \|website=~~scripts.sil.org~~Computers and Writing Systems - SIL International }}</ref> Although the ISO standard had (as of 1998 in Unicode 2.1) "reserved for private use" 0xE00000 to 0xFFFFFF, and 0x60000000 to 0x7FFFFFFF<ref>{{Cite web \|title=~~THE~~Annex ~~UNIVERSAL~~B ~~CHARACTER~~- The Universal Character ~~SET~~Set (UCS) \|url=http://std.dkuug.dk/cen/tc304/guidecharactersets/guideannexb.html \|access-date=2022-10-03 \|website=DKUUG Standardizing \|url-status=live \|archive-url=https://web.archive.org/web/20220122081513/http://std.dkuug.dk/cen/tc304/guidecharactersets/guideannexb.html \|archive-date= Jan 22, 2022 }}</ref> these areas were removed in later versions. Because the Principles and Procedures document of [[ISO/IEC JTC 1/SC 2]] Working Group 2 states that all future assignments of code points will be constrained to the Unicode range, UTF-32 will be able to represent all UCS code points and UTF-32 and UCS-4 are identical.<ref>{{Cite book \|title=The Unicode Standard, version 6.0 \|date=February 2011 \|publisher=[[Unicode Consortium]] \|isbn=978-1-936213-01-6 \|location=Mountain View, CA \|pages=573 \|chapter=C.2 Encoding Forms in ISO/IEC 10646 \|quote=It [UCS-4] is now treated simply as a synonym for UTF-32, and is considered the canonical form for representation of characters in 10646. \|chapter-url=https://www.unicode.org/versions/Unicode6.0.0/appC.pdf}}</ref> == Utility of fixed width == A fixed number of bytes per code point has ~~a number of~~ theoretical advantages, but each of these has problems in reality: * Truncation becomes easier, but not significantly so compared to [[UTF-8]] and [[UTF-16]] (both of which can search backwards for the point to truncate by looking at 2–4 code units at most).{{efn\|For UTF-8: Select point to truncate at. If the byte before it is 0-0x7F, or the byte after it is anything other than the continuation bytes 0x80-0xBF, the string can be truncated at that point. Otherwise search up to 3 bytes backwards for such a point and truncate at that. If not found, truncate at the original position. This works even if there are encoding errors in the UTF-8. UTF-16 is trivial and only has to back up one word at most.}}{{citation needed\|date=January 2023}} * Finding the ''Nth ~~character~~'' character in a string. For fixed width, this is simply a [[Big O notation\|O(1) problem]], while it is [[Big O notation\|O(n) problem]] in a variable-width encoding. Novice programmers often vastly overestimate how useful this is.<ref name=manishearth>{{Cite web\|title=Let's Stop Ascribing Meaning to Code Points - \|website=In Pursuit of Laziness \|first1=Manish \|last1=Goregaokar \|date=January 14, 2017 \|url=https://manishearth.github.io/blog/2017/01/14/stop-ascribing-meaning-to-unicode-code-points/\|access-date=2020-06-14\|quote=Folks start implying that code points mean something, and that O(1) indexing or slicing at code point boundaries is a useful operation.~~\|website=manishearth.github.io~~ }}</ref> ~~Amateur~~Also ~~programmers~~what ~~often~~a ~~vastly~~user ~~overestimate~~might ~~how~~call ~~useful~~a ~~this~~"character" is: instill ~~reality~~variable-width, anfor ~~algorithm that knows ''n'' without first examining~~instance the ~~''n''-1~~[[combining ~~characters~~character]] ~~before~~sequence ~~it (an O(n) problem) are very rare or non-existent.~~{{~~citation needed~~char\|~~date=January 2023~~á}} Incould ~~addition~~be ~~Unicode~~2 code points ~~are often not equivalent to what~~, the ~~user~~emoji ~~thinks~~{{char\|👨‍🦲}} is ~~a "character"~~three, ~~for instance both of these Emoji are 3 code points: "👨‍🦲 Man: Bald"~~<ref>{{Cite web\|title=👨‍🦲 Man: Bald Emoji\|url=https://emojipedia.org/man-bald/\|access-date=2021-10-12\|website=Emojipedia\|language=en}}</ref> and ~~"👩‍🦰~~the ~~Woman:~~ligature ~~Red Hair".<ref>~~{{~~Cite web~~char\|~~title=👩‍🦰 Woman: Red Hair Emoji\|url=https://emojipedia.org/woman-red-hair/\|access-date=2021-10-12\|website=Emojipedia\|language=en~~ﬀ}}~~</ref><ref>{{Cite~~ ~~web\|title=↔️~~is ~~Emoji ZWJ (Zero Width Joiner) Sequences\|url=https://emojipedia~~one.~~org/emoji-zwj-sequence/\|access-date=2021-10-12\|website=emojipedia.org}}</ref>~~ * Quickly knowing the "width" of a string. ~~In practice,~~However even ~~with a~~ "fixed width" ~~font~~fonts ~~and~~have ~~restricting~~varying ~~the characters to the BMP~~width, ~~finding the string width from a count of code points is impossible. There are~~often [[~~combining~~CJK ~~character~~characters\|~~combining~~CJK ~~forms~~ideographs]] ~~like~~are ~~'é'~~twice as ~~expressed~~wide,<ref ~~using~~name=manishearth/> ~~two~~plus ~~code~~the ~~points~~already-mentioned ~~'e'~~problems +with 'the ~~́ ' and "fixed width" may assign a width~~number of ~~2 to [[CJK characters\|CJK ideographs]], and some~~ code points ~~take~~not ~~multiple~~being ~~character~~equal ~~positions~~to ~~per~~the ~~code~~number ~~point~~of ~~("[[grapheme]] clusters" for CJK)~~characters.~~<ref name=manishearth/>~~ == Use == The main use of UTF-32 is in internal APIs where the data is single code points or [[Glyph\|glyphs]], rather than strings of characters. For instance, in modern text rendering, it is common{{citation needed\|date=January 2023}} that the last step is to build a list of structures each containing [[Coordinate system\|coordinates (x, y)]], attributes, and a single UTF-32 code point identifying the glyph to draw. Often non-Unicode information is stored in the "unused" 11 bits of each word.{{Citation needed\|date=June 2017}} Use of UTF-32 strings on Windows (where {{mono\|[[wchar_t]]}} is 16 bits) is almost non-existent. On Unix systems, UTF-32 strings are sometimes, but rarely, used internally by applications, due to the type {{mono\|wchar_t}} being defined as 32-bit. Use of UTF-32 strings on Windows (where {{mono\|wchar_t}} is 16 bits) is almost non-existent. On Unix systems, UTF-32 strings are sometimes, but rarely, used internally by applications, due to the type {{mono\|[[wchar_t]]}} being defined as 32 bit. [[Python (programming language)\|Python]] versions up to 3.2 can be compiled to use them instead of [[UTF-16]]; from version 3.3 onward all Unicode strings are stored in UTF-32 but with leading zero bytes optimized away "depending on the [code point] with the largest Unicode ordinal (1, 2, or 4 bytes)" to make all code points that size.<ref>{{cite web\|last1=Löwis\|first1=Martin\|title=PEP 393 -- Flexible String Representation\|url=https://legacy.python.org/dev/peps/pep-0393/\|website=python.org\|publisher=Python\|access-date=26 October 2014}}</ref> [[Seed7]]<ref>{{cite web\|url=http://seed7.sourceforge.net/faq.htm#unicode\|title=The usage of UTF-32 has several advantages}}</ref> and [[Lasso (programming language)\|Lasso]]{{Citation needed\|date=June 2017}} programming languages encode all strings with UTF-32, in the belief that direct indexing is important, whereas the [[Julia (programming language)\|Julia]] programming language moved away from builtin UTF-32 support with its 1.0 release, simplifying the language to having only UTF-8 strings (with all the other encodings considered legacy and moved out of the standard library to package<ref>{{Citation\|title=JuliaStrings/LegacyStrings.jl: Legacy Unicode string types\|date=2019-05-17\|url=https://github.com/JuliaStrings/LegacyStrings.jl\|publisher=JuliaStrings\|access-date=2019-10-15}}</ref>) following the "UTF-8 Everywhere Manifesto".<ref>{{cite web \|url=http://utf8everywhere.org/ \|title=UTF-8 Everywhere Manifesto}}</ref>▼ UTF-32 is also forbidden as an HTML character encoding.<ref>{{cite web\|access-date=2024-11-11 \|language=en \|title=HTML Standard \|url=https://html.spec.whatwg.org/multipage/parsing.html#character-encodings \|website=html.spec.whatwg.org}}<!-- auto-translated from French by Module:CS1 translator --></ref><ref>{{cite web\|access-date=2024-11-11 \|language=fr \|title=Choisir et appliquer un encodage de caractères \|url=https://www.w3.org/International/questions/qa-choosing-encodings.fr.html#avoid:~:text=particuli%C3%A8rement%20celle%20d%E2%80%99-,UTF-32,-. \|website=www.w3.org}}<!-- auto-translated from French by Module:CS1 translator --></ref> === Programming languages === [[Python (programming language)\|Python]] versions up to 3.2 can be compiled to use them{{Clarify\|reason=“Python versions up to 3.2 can be compiled to use them” is unclear\|date=November 2024}} instead of [[UTF-16]]; from version 3.3 onward, Unicode strings are stored in UTF-32 if there is at least 1 non-[[Basic Multilingual Plane\|BMP]] character in the string, but with leading zero bytes optimized away "depending on the [code point] with the largest Unicode ordinal (1, 2, or 4 bytes)" to make all code points that size.<ref>{{cite web\|last1=Löwis\|first1=Martin\|title=PEP 393 -- Flexible String Representation\|url=https://legacy.python.org/dev/peps/pep-0393/\|website=python.org\|publisher=Python\|access-date=26 October 2014}}</ref> <!-- In previous versions of Python, "\U0001F51F" (UTF-32) was equivalent to "\ud83d\udd1f" (UCS-2). However, this is true in languages like JavaScript. This also means that a non-BMP character is not equivalent to its surrogate pair (example: <code>"\U0001F51F" != "\ud83d\udd1f"</code>) unlike most programming languages. --> ▲Use of UTF-32 strings on Windows (where {{mono\|wchar_t}} is 16 bits) is almost non-existent. On Unix systems, UTF-32 strings are sometimes, but rarely, used internally by applications, due to the type {{mono\|[[wchar_t]]}} being defined as 32 bit. [[Python (programming language)\|Python]] versions up to 3.2 can be compiled to use them instead of [[UTF-16]]; from version 3.3 onward all Unicode strings are stored in UTF-32 but with leading zero bytes optimized away "depending on the [code point] with the largest Unicode ordinal (1, 2, or 4 bytes)" to make all code points that size.<ref>{{cite web\|last1=Löwis\|first1=Martin\|title=PEP 393 -- Flexible String Representation\|url=https://legacy.python.org/dev/peps/pep-0393/\|website=python.org\|publisher=Python\|access-date=26 October 2014}}</ref> [[Seed7]]<ref>{{cite web\|url=http://seed7.sourceforge.net/faq.htm#unicode\|title=The usage of UTF-32 has several advantages}}</ref> and [[Lasso (programming language)\|Lasso]]{{Citation needed\|date=June 2017}} programming languages encode all strings with UTF-32, in the belief that direct indexing is important, whereas the [[Julia (programming language)\|Julia]] programming language moved away from ~~builtin~~built-in UTF-32 support with its 1.0 release, simplifying the language to having only UTF-8 strings (with all the other encodings considered legacy and moved out of the standard library to package<ref>{{Citation\|title=JuliaStrings/LegacyStrings.jl: Legacy Unicode string types\|date=2019-05-17\|url=https://github.com/JuliaStrings/LegacyStrings.jl\|publisher=JuliaStrings\|access-date=2019-10-15}}</ref>) following the "UTF-8 Everywhere Manifesto".<ref>{{cite web \|url=http://utf8everywhere.org/ \|title=UTF-8 Everywhere Manifesto}}</ref> [[C++11]] has 2 built-in data types that use UTF-32. The <code>char32_t</code> data type stores 1 character in UTF-32. The <code>u32string</code> data type stores a string of UTF-32-encoded characters. A UTF-32-encoded character or string literal is marked with <code>U</code> before the character or string literal.<ref>{{Cite web \|url=https://cplusplus.com/reference/string/u32string/ \|access-date=2024-11-12 \|website=cplusplus.com\|title = u32string}}</ref><ref>{{Cite web \|title=String literal - cppreference.com \|url=https://en.cppreference.com/w/cpp/language/string_literal \|access-date=2024-11-14 \|website=en.cppreference.com}}</ref> <syntaxhighlight lang="c++"> #include <string> char32_t UTF32_character = U'🔟'; // also written as U'\U0001F51F' std::u32string UTF32_string = U"UTF–32-encoded string"; // defined as `const char32_t´ </syntaxhighlight>[[C Sharp (programming language)\|C#]] has a <code>UTF32Encoding</code> class which represents Unicode characters as bytes, rather than as a string.<ref>{{Cite web \|last=dotnet-bot \|title=UTF32Encoding Class (System.Text) \|url=https://learn.microsoft.com/en-us/dotnet/api/system.text.utf32encoding?view=net-8.0 \|access-date=2024-11-27 \|website=learn.microsoft.com \|language=en-us}}</ref> == Variants == Though technically invalid, the surrogate halves are often encoded and allowed. This allows invalid UTF-16 (such as Windows filenames) to be translated to UTF-32, similar to how the [[WTF-8]] variant of UTF-8 works. Sometimes paired surrogates are encoded instead of non-BMP characters, similar to [[CESU-8]]. Due to the large number of unused 32-bit values, it is also possible to preserve invalid UTF-8 by using non-Unicode values to encode UTF-8 errors, though there is no standard for this. UTF-32 has 2 versions for big-endian and little-endian: '''UTF-32-BE''' and '''UTF-32-LE'''. == See also == [[Comparison of Unicode encodings]] * [[UTF-16]] == Notes ==