Working With Unicode
Working With Unicode
68 65 6C 6C 6F
“But if all data is stored as 1s and 0s, how do we see the alphabets????”
Character Encoding 1: ASCII and Windows-1252
ASCII Example:
ASCII is a single-byte character encoding:
Interpreted as ASCII: h e l l o
Character Encoding 2: UTF-8
Variable-width encoding:
Note: UTF-8’s 8 refers to the 8-bit size of its code units, as opposed to UTF-16 which uses 16-bit and UTF-32 which
uses 32-bit
UTF-8 Example
Hex: E4 BD A0 E5 A5 BD 21 F0 9F 98 84
UTF-8: 你 好 ! 😄
Wrong encoding?
When you try to use ASCII to interpret data meant to be interpreted with UTF-8:
Hex: E4 BD A0 E5 A5 BD 21 F0 9F 98 84
UTF-8: 你 好 ! 😄
Hex: E4 BD A0 E5 A5 BD 21 F0 9F 98 84
ASCII/latin1: ä ½ å ¥ ½ ! ð
If you see “gibberish” in your data...
See if you can fix the encoding from the source.
https://stackoverflow.com/questions/20151835/how-to-convert-wrongly-encoded-data-to-utf-8
Python2
Use Python 3 if you can. Will save you from needing to dive into the seventh
circle of hell.
chinese.txt: “hi猫”
>>> u"äöü"
u'\xe4\xf6\xfc'
>>> str(u"äöü")
Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-2:
ordinal not in range(128)
Unicode to String: Encode!
>>> u"äöü".encode('utf-8')
'\xc3\xa4\xc3\xb6\xc3\xbc'
String to Unicode: Decode!
>>> '\xc3\xa4\xc3\xb6\xc3\xbc'.decode('utf8')
u'\xe4\xf6\xfc'
or
\u
Next four digits should be interpreted as ordinal number for the Unicode
character
Links & References
http://kunststube.net/encoding/
http://www.pgbovine.net/unicode-python.htm
https://pythonhosted.org/kitchen/unicode-frustrations.html
https://stackoverflow.com/questions/20151835/how-to-convert-wrongly-enco
ded-data-to-utf-8
https://docs.python.org/2/howto/unicode.html
https://docs.python.org/2/tutorial/introduction.html#unicode-strings