Java and Unicode: The Confusion About String and Char in Java
Java and Unicode: The Confusion About String and Char in Java
It defines some 136,755 “characters” (and counting) for more than 139
language script systems and a rich symbol set.
Source: https://docs.oracle.com/javase/tutorial/java/nutsandbolts/datatypes.html
How can you store 136,755 characters in a datatype which is only 16 bits large
(65,536 values)?
One with a lower value and one with an upper value, a so called surrogate pair.
This allows us to represent much more characters (and symbols) than would
fit in a 16 bit character set (represented by, e.g. a Java char datatype).
For instance, the character “Bomb”: can be represented in Java, but not
stored in one single char value.
Example: representing
The decimal code for is 128163. But the largest value which fits in a char is
65535.
In hex, the code for the bomb is 0x1F4A3. So, in Java, one can represent that in
a few ways:
int c = 0x1F36F;
Example: representing
Another way to represent in Java is:
The int value of the bomb doesn’t fit inside the Java char type. So how long is
that String?
Example: representing
Another way to represent in Java is:
The int value of the bomb doesn’t fit inside the Java char type. So how long is
that String?
The answer is 2! It takes two chars inside the String to represent it.
The two chars making up the bomb
First part: 0xD83D (high surrogate)
So, how did Java turn 0x1F4A3 into those two parts?
UTF-16 encoding a character outside 16 bits
Another fun character is (the turd). It has value 128169 or 0x1F4A9 in hex.
// 1. subtract 0x010000
c -= 0x010000; // Keep 16 bits, get rid of the highest bit!
// We now have: 1111010010101001
// 2. keep the highest 10 bits, and add 0xD800 in the high surrogate
int high = (c >> 10) + 0xD800;
// First: 1111010010101001 >> 10 = 111101
//
// Then: 111101
// 1101100000000000+ (0xD800)
// -----------------
// 1101100000111101 (0xD83D) -> high
UTF-16 encoding a character outside 16 bits
// 3. keep the lowest 10 bits, and add 0xDC00 in the low surrogate
int low = (c & 0x3FF) + 0xDC00;
// 1111010010101001
// 0000001111111111 &
// ----------------
// 0010101001
// Add 0xDC00:
//
// 0010101001
// 1101110000000000 +
// --------------------
// 1101110010101001 (0xDCA9)
UTF-16 encoding a character outside 16 bits
// Now we have high surrogate 0xD83D and Low surrogate 0xDCA9
// OR:
String turd2 = new String(new int[] {0x1F4A9}, 0, 1);
// What do you think the String class does with the int version?
public static final char MIN_HIGH_SURROGATE = '\uD800';
public static final char MIN_LOW_SURROGATE = '\uDC00';