You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Extend GB18030 encoding conversion to cover full Unicode range.
Our previous code for GB18030 <-> UTF8 conversion only covered Unicode code
points up to U+FFFF, but the actual spec defines conversions for all code
points up to U+10FFFF. That would be rather impractical as a lookup table,
but fortunately there is a simple algorithmic conversion between the
additional code points and the equivalent GB18030 byte patterns. Make use
of the just-added callback facility in LocalToUtf/UtfToLocal to perform the
additional conversions.
Having created the infrastructure to do that, we can use the same code to
map certain linearly-related subranges of the Unicode space below U+FFFF,
allowing removal of the corresponding lookup table entries. This more
than halves the lookup table size, which is a substantial savings;
utf8_and_gb18030.so drops from nearly a megabyte to about half that.
In support of doing that, replace ISO10646-GB18030.TXT with the data file
gb-18030-2000.xml (retrieved from
http://source.icu-project.org/repos/icu/data/trunk/charset/data/xml/ )
in which these subranges have been deleted from the simple lookup entries.
Per bug #12845 from Arjen Nienhuis. The conversion code added here is
based on his proposed patch, though I whacked it around rather heavily.
0 commit comments