Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
110 views

Working With Unicode

Working with Unicode can be challenging due to encoding issues that can occur everywhere from Python code to files and databases. Unicode represents text as characters rather than bytes, using encodings like UTF-8. In Python 2, strings are byte sequences while Unicode is used for text, requiring explicit encoding and decoding. Always work with Unicode internally and encode to strings only for I/O operations like printing. Knowing the encoding is also crucial, as incorrectly assuming the encoding can lead to "gibberish".

Uploaded by

PhillyPu
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
110 views

Working With Unicode

Working with Unicode can be challenging due to encoding issues that can occur everywhere from Python code to files and databases. Unicode represents text as characters rather than bytes, using encodings like UTF-8. In Python 2, strings are byte sequences while Unicode is used for text, requiring explicit encoding and decoding. Always work with Unicode internally and encode to strings only for I/O operations like printing. Knowing the encoding is also crucial, as incorrectly assuming the encoding can lead to "gibberish".

Uploaded by

PhillyPu
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

Working with Unicode

(with emphasis on Python 2)


“❤❤❤” vs “⟤⟤⟤Ÿ”
Where can Unicode encoding issues happen?
EVERYWHERE

● Python 2 (“str” vs “unicode”)


● Python 2 Libraries (“csv” module doesn’t write Unicode 😠)
● Erroneously encoded files (“Western-1252” or “ASCII”)
● MySQL connections (“latin1” vs “utf8” vs “utf8mb4”)
○ MySQL Workbench only does “utf8”, which is only a subset of “utf8mb4”
What the heck is an encoding⸮❔?⁉፧¿
All data is stored in the computer as bits. Usually grouped into chunks of 8
(bytes).

For convenience, instead of writing binary, we write in hexadecimal:

01101000 01100101 01101100 01101100 01101111

68 65 6C 6C 6F

“But if all data is stored as 1s and 0s, how do we see the alphabets????”
Character Encoding 1: ASCII and Windows-1252
ASCII Example:
ASCII is a single-byte character encoding:

1 byte -> 1 character

Hexadecimal (what we store): 68 65 6C 6C 6F

Interpreted as ASCII: h e l l o
Character Encoding 2: UTF-8
Variable-width encoding:

one character can be represented with 1-4 bytes

Generally, ASCII characters have the same byte representation as UTF-8.

ASCII-encoded files can be decoded with UTF-8 without any fiasco.


(Backwards compatibility)

Note: UTF-8’s 8 refers to the 8-bit size of its code units, as opposed to UTF-16 which uses 16-bit and UTF-32 which
uses 32-bit
UTF-8 Example

Hex: E4 BD A0 E5 A5 BD 21 F0 9F 98 84

UTF-8: 你 好 ! 😄
Wrong encoding?
When you try to use ASCII to interpret data meant to be interpreted with UTF-8:

Hex: E4 BD A0 E5 A5 BD 21 F0 9F 98 84

UTF-8: 你 好 ! 😄

Hex: E4 BD A0 E5 A5 BD 21 F0 9F 98 84

ASCII/latin1: ä ½ å ¥ ½ ! ð Ÿ Ÿ Ÿ
If you see “gibberish” in your data...
See if you can fix the encoding from the source.

If stored in MySQL already, see if this command works:

select convert(binary convert(field_name using latin1) using utf8) from


table_name

https://stackoverflow.com/questions/20151835/how-to-convert-wrongly-encoded-data-to-utf-8
Python2
Use Python 3 if you can. Will save you from needing to dive into the seventh
circle of hell.

Python 2’s str is a series of bytes.

chinese.txt: “hi猫”

>>> text = open('chinese.txt').read()


>>> text
'hi\xe7\x8c\xab'
>>> type(text)
<type 'str'>
>>> len(text) Example from
http://www.pgbovine.net/unicode-python.htm
5
Python2’s Unicode
>>> unicode_text = text.decode('utf-8')
>>> type(unicode_text)
<type 'unicode'>
>>> unicode_text
u'hi\u732b'
>>> len(unicode_text)
3
>>> unicode_text[0]
u'h'
>>> unicode_text[1]
u'i'
>>> unicode_text[2]
u'\u732b'
Working in Unicode… Now I want to write my data!
>>> u"abc"
u'abc'
>>> str(u"abc")
'abc'

>>> u"äöü"
u'\xe4\xf6\xfc'
>>> str(u"äöü")
Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-2:
ordinal not in range(128)
Unicode to String: Encode!
>>> u"äöü".encode('utf-8')
'\xc3\xa4\xc3\xb6\xc3\xbc'
String to Unicode: Decode!
>>> '\xc3\xa4\xc3\xb6\xc3\xbc'.decode('utf8')
u'\xe4\xf6\xfc'

or

>>> unicode('\xc3\xa4\xc3\xb6\xc3\xbc', 'utf8')


u'\xe4\xf6\xfc'
Why can’t I print this Unicode???
>>> x = u'\u732b'
>>> type(x)
<type 'unicode'>
>>> print x
Traceback (most recent call last):
File "", line 1, in UnicodeEncodeError: 'ascii' codec can't encode
character u'\u732b' in position 0: ordinal not in range(128)

Some terminals have the sys.stdout.encoding set to “US-ASCII”, which tries to


convert stdout outputs into ASCII.

Don’t try to print Unicode, as it can raise an exception. Print in string.


What should I do to print my pretty Unicode text?
>>> x = u'\u732b'
>>> import sys
>>> sys.stdout.encoding
'US-ASCII'
>>> x.encode('utf-8')
'\xe7\x8c\xab'
>>> print x.encode('utf-8')

Takeaways
● Unicode → Str : Encode
● Str → Unicode : Decode
● Working with text in Python2? The moment you read the data in, encode
it into unicode objects.
○ NEVER COMBINE UNICODE WITH STR
● Finished working with your (properly encoded) unicode objects? Decode
into str objects again.
● Always know what encoding you’re working with.
Notes: Python2 escape sequences
\x

Next two characters should be interpreted as hex digits for a character


code

\u

Next four digits should be interpreted as ordinal number for the Unicode
character
Links & References
http://kunststube.net/encoding/

http://www.pgbovine.net/unicode-python.htm

https://pythonhosted.org/kitchen/unicode-frustrations.html

https://stackoverflow.com/questions/20151835/how-to-convert-wrongly-enco
ded-data-to-utf-8

https://docs.python.org/2/howto/unicode.html

https://docs.python.org/2/tutorial/introduction.html#unicode-strings

You might also like