Content deleted Content added
Theknightwho (talk | contribs) |
Fixed C# →Programming languages |
||
(27 intermediate revisions by 10 users not shown) | |||
Line 1:
{{Short description|
'''UTF-32''' (32-[[bit]] [[Unicode transformation format|Unicode Transformation Format]]), sometimes called UCS-4, is a fixed-length [[Character encoding|encoding]] used to encode Unicode [[code point]]s that uses exactly 32 bits (four [[byte]]s) per code point (but a number of leading bits must be zero as there are far fewer than 2<sup>32</sup> Unicode code points, needing actually only 21 bits).<ref name="4_or_3_bytes" />
The main advantage of UTF-32 is that the Unicode code points are directly indexed. Finding the ''Nth'' code point in a sequence of code points is a [[constant time|constant-time]] operation. In contrast, a [[variable-length code]] requires [[linear time|linear-time]] to count ''N'' code points from the start of the string. This makes UTF-32 a simple replacement in code that uses [[Integer|integers]] that are incremented by one to examine each location in a [[String (computer science)|string]], as was commonly done for [[ASCII]]. However, Unicode code points are rarely processed in complete isolation, such as [[combining character]] sequences and for emoji.<ref name=":0">{{Cite web |title=FAQ - UTF-8, UTF-16, UTF-32 & BOM |url=http://unicode.org/faq/utf_bom.html#utf32-2 |access-date=2022-09-04 |website=
The main disadvantage of UTF-32 is that it is space-inefficient, using four [[byte]]s per code point, including 11 bits that are always zero. Characters beyond the [[Basic Multilingual Plane|BMP]] are relatively rare in most texts (except, for
== History ==
The original [[ISO/IEC 10646]] standard defines a 32-bit ''encoding form'' called '''UCS-4''', in which each code point in the [[Universal Character Set]] (UCS) is represented by a 31-bit value from 0 to 0x7FFFFFFF (the sign bit was unused and zero). In November 2003, Unicode was restricted by RFC 3629 to match the constraints of the [[UTF-16]] encoding: explicitly prohibiting code points greater than U+10FFFF (and also the high and low surrogates U+D800 through U+DFFF). This limited subset defines UTF-32.<ref>{{Cite web|title=
== Utility of fixed width ==
A fixed number of bytes per code point has
* Truncation becomes easier, but not significantly so compared to [[UTF-8]] and [[UTF-16]] (both of which can search backwards for the point to truncate by looking at 2–4 code units at most).{{efn|For UTF-8: Select point to truncate at. If the byte before it is 0-0x7F, or the byte after it is anything other than the continuation bytes 0x80-0xBF, the string can be truncated at that point. Otherwise search up to 3 bytes backwards for such a point and truncate at that. If not found, truncate at the original position. This works even if there are encoding errors in the UTF-8. UTF-16 is trivial and only has to back up one word at most.}}{{citation needed|date=January 2023}}
* Finding the ''Nth
* Quickly knowing the "width" of a string.
== Use ==
The main use of UTF-32 is in internal APIs where the data is single code points or [[Glyph|glyphs]], rather than strings of characters. For instance, in modern text rendering, it is common{{citation needed|date=January 2023}} that the last step is to build a list of structures each containing [[Coordinate system|coordinates (x, y)]], attributes, and a single UTF-32 code point identifying the glyph to draw. Often non-Unicode information is stored in the "unused" 11 bits of each word.{{Citation needed|date=June 2017}}
Use of UTF-32 strings on Windows (where {{mono|[[wchar_t]]}} is 16 bits) is almost non-existent. On Unix systems, UTF-32 strings are sometimes, but rarely, used internally by applications, due to the type {{mono|wchar_t}} being defined as 32-bit.
Use of UTF-32 strings on Windows (where {{mono|wchar_t}} is 16 bits) is almost non-existent. On Unix systems, UTF-32 strings are sometimes, but rarely, used internally by applications, due to the type {{mono|[[wchar_t]]}} being defined as 32 bit. [[Python (programming language)|Python]] versions up to 3.2 can be compiled to use them instead of [[UTF-16]]; from version 3.3 onward all Unicode strings are stored in UTF-32 but with leading zero bytes optimized away "depending on the [code point] with the largest Unicode ordinal (1, 2, or 4 bytes)" to make all code points that size.<ref>{{cite web|last1=Löwis|first1=Martin|title=PEP 393 -- Flexible String Representation|url=https://legacy.python.org/dev/peps/pep-0393/|website=python.org|publisher=Python|access-date=26 October 2014}}</ref> [[Seed7]]<ref>{{cite web|url=http://seed7.sourceforge.net/faq.htm#unicode|title=The usage of UTF-32 has several advantages}}</ref> and [[Lasso (programming language)|Lasso]]{{Citation needed|date=June 2017}} programming languages encode all strings with UTF-32, in the belief that direct indexing is important, whereas the [[Julia (programming language)|Julia]] programming language moved away from builtin UTF-32 support with its 1.0 release, simplifying the language to having only UTF-8 strings (with all the other encodings considered legacy and moved out of the standard library to package<ref>{{Citation|title=JuliaStrings/LegacyStrings.jl: Legacy Unicode string types|date=2019-05-17|url=https://github.com/JuliaStrings/LegacyStrings.jl|publisher=JuliaStrings|access-date=2019-10-15}}</ref>) following the "UTF-8 Everywhere Manifesto".<ref>{{cite web |url=http://utf8everywhere.org/ |title=UTF-8 Everywhere Manifesto}}</ref>▼
UTF-32 is also forbidden as an HTML character encoding.<ref>{{cite web|access-date=2024-11-11 |language=en |title=HTML Standard |url=https://html.spec.whatwg.org/multipage/parsing.html#character-encodings |website=html.spec.whatwg.org}}<!-- auto-translated from French by Module:CS1 translator --></ref><ref>{{cite web|access-date=2024-11-11 |language=fr |title=Choisir et appliquer un encodage de caractères |url=https://www.w3.org/International/questions/qa-choosing-encodings.fr.html#avoid:~:text=particuli%C3%A8rement%20celle%20d%E2%80%99-,UTF-32,-. |website=www.w3.org}}<!-- auto-translated from French by Module:CS1 translator --></ref>
=== Programming languages ===
[[Python (programming language)|Python]] versions up to 3.2 can be compiled to use them{{Clarify|reason=“Python versions up to 3.2 can be compiled to use them” is unclear|date=November 2024}} instead of [[UTF-16]]; from version 3.3 onward, Unicode strings are stored in UTF-32 if there is at least 1 non-[[Basic Multilingual Plane|BMP]] character in the string, but with leading zero bytes optimized away "depending on the [code point] with the largest Unicode ordinal (1, 2, or 4 bytes)" to make all code points that size.<ref>{{cite web|last1=Löwis|first1=Martin|title=PEP 393 -- Flexible String Representation|url=https://legacy.python.org/dev/peps/pep-0393/|website=python.org|publisher=Python|access-date=26 October 2014}}</ref> <!-- In previous versions of Python, "\U0001F51F" (UTF-32) was equivalent to "\ud83d\udd1f" (UCS-2).
However, this is true in languages like JavaScript. This also means that a non-BMP character is not equivalent to its surrogate pair (example: <code>"\U0001F51F" != "\ud83d\udd1f"</code>) unlike most programming languages. -->
▲
[[C++11]] has 2 built-in data types that use UTF-32. The <code>char32_t</code> data type stores 1 character in UTF-32. The <code>u32string</code> data type stores a string of UTF-32-encoded characters. A UTF-32-encoded character or string literal is marked with <code>U</code> before the character or string literal.<ref>{{Cite web |url=https://cplusplus.com/reference/string/u32string/ |access-date=2024-11-12 |website=cplusplus.com|title = u32string}}</ref><ref>{{Cite web |title=String literal - cppreference.com |url=https://en.cppreference.com/w/cpp/language/string_literal |access-date=2024-11-14 |website=en.cppreference.com}}</ref>
<syntaxhighlight lang="c++">
#include <string>
char32_t UTF32_character = U'🔟'; // also written as U'\U0001F51F'
std::u32string UTF32_string = U"UTF–32-encoded string"; // defined as `const char32_t*´
</syntaxhighlight>[[C Sharp (programming language)|C#]] has a <code>UTF32Encoding</code> class which represents Unicode characters as bytes, rather than as a string.<ref>{{Cite web |last=dotnet-bot |title=UTF32Encoding Class (System.Text) |url=https://learn.microsoft.com/en-us/dotnet/api/system.text.utf32encoding?view=net-8.0 |access-date=2024-11-27 |website=learn.microsoft.com |language=en-us}}</ref>
== Variants ==
Though technically invalid, the surrogate halves are often encoded and allowed. This allows invalid UTF-16 (such as Windows filenames) to be translated to UTF-32, similar to how the [[WTF-8]] variant of UTF-8 works. Sometimes paired surrogates are encoded instead of non-BMP characters, similar to [[CESU-8]]. Due to the large number of unused 32-bit values, it is also possible to preserve invalid UTF-8 by using non-Unicode values to encode UTF-8 errors, though there is no standard for this.
UTF-32 has 2 versions for big-endian and little-endian: '''UTF-32-BE''' and '''UTF-32-LE'''.
== See also ==
* [[Comparison of Unicode encodings]]
* [[UTF-16]]
== Notes ==
|