0% found this document useful (0 votes)

110 views

Working With Unicode

Working with Unicode can be challenging due to encoding issues that can occur everywhere from Python code to files and databases. Unicode represents text as characters rather than bytes, using encodings like UTF-8. In Python 2, strings are byte sequences while Unicode is used for text, requiring explicit encoding and decoding. Always work with Unicode internally and encode to strings only for I/O operations like printing. Knowing the encoding is also crucial, as incorrectly assuming the encoding can lead to "gibberish".

Uploaded by

PhillyPu

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

110 views

Working With Unicode

Uploaded by

PhillyPu

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 19

Working with Unicode

(with emphasis on Python 2)

“❤❤❤” vs “â¤â¤â¤”
Where can Unicode encoding issues happen?
EVERYWHERE

● Python 2 (“str” vs “unicode”)

● Python 2 Libraries (“csv” module doesn’t write Unicode 😠)
● Erroneously encoded ﬁles (“Western-1252” or “ASCII”)
● MySQL connections (“latin1” vs “utf8” vs “utf8mb4”)
○ MySQL Workbench only does “utf8”, which is only a subset of “utf8mb4”
What the heck is an encoding⸮❔?⁉፧¿
All data is stored in the computer as bits. Usually grouped into chunks of 8
(bytes).

For convenience, instead of writing binary, we write in hexadecimal:

01101000 01100101 01101100 01101100 01101111

68 65 6C 6C 6F

“But if all data is stored as 1s and 0s, how do we see the alphabets????”
Character Encoding 1: ASCII and Windows-1252
ASCII Example:
ASCII is a single-byte character encoding:

1 byte -> 1 character

Hexadecimal (what we store): 68 65 6C 6C 6F

Interpreted as ASCII: h e l l o
Character Encoding 2: UTF-8
Variable-width encoding:

one character can be represented with 1-4 bytes

Generally, ASCII characters have the same byte representation as UTF-8.

ASCII-encoded ﬁles can be decoded with UTF-8 without any ﬁasco.

(Backwards compatibility)

Note: UTF-8’s 8 refers to the 8-bit size of its code units, as opposed to UTF-16 which uses 16-bit and UTF-32 which
uses 32-bit
UTF-8 Example

Hex: E4 BD A0 E5 A5 BD 21 F0 9F 98 84

UTF-8: 你好 ! 😄
Wrong encoding?
When you try to use ASCII to interpret data meant to be interpreted with UTF-8:

Hex: E4 BD A0 E5 A5 BD 21 F0 9F 98 84

UTF-8: 你好 ! 😄

Hex: E4 BD A0 E5 A5 BD 21 F0 9F 98 84

ASCII/latin1: ä ½ å ¥ ½ ! ð
If you see “gibberish” in your data...
See if you can ﬁx the encoding from the source.

If stored in MySQL already, see if this command works:

select convert(binary convert(field_name using latin1) using utf8) from

table_name

https://stackoverﬂow.com/questions/20151835/how-to-convert-wrongly-encoded-data-to-utf-8
Python2
Use Python 3 if you can. Will save you from needing to dive into the seventh
circle of hell.

Python 2’s str is a series of bytes.

chinese.txt: “hi猫”

>>> text = open('chinese.txt').read()

>>> text
'hi\xe7\x8c\xab'
>>> type(text)
<type 'str'>
>>> len(text) Example from
http://www.pgbovine.net/unicode-python.htm
5
Python2’s Unicode
>>> unicode_text = text.decode('utf-8')
>>> type(unicode_text)
<type 'unicode'>
>>> unicode_text
u'hi\u732b'
>>> len(unicode_text)
3
>>> unicode_text[0]
u'h'
>>> unicode_text[1]
u'i'
>>> unicode_text[2]
u'\u732b'
Working in Unicode… Now I want to write my data!
>>> u"abc"
u'abc'
>>> str(u"abc")
'abc'

>>> u"äöü"
u'\xe4\xf6\xfc'
>>> str(u"äöü")
Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-2:
ordinal not in range(128)
Unicode to String: Encode!
>>> u"äöü".encode('utf-8')
'\xc3\xa4\xc3\xb6\xc3\xbc'
String to Unicode: Decode!
>>> '\xc3\xa4\xc3\xb6\xc3\xbc'.decode('utf8')
u'\xe4\xf6\xfc'

>>> unicode('\xc3\xa4\xc3\xb6\xc3\xbc', 'utf8')

u'\xe4\xf6\xfc'
Why can’t I print this Unicode???
>>> x = u'\u732b'
>>> type(x)
<type 'unicode'>
>>> print x
Traceback (most recent call last):
File "", line 1, in UnicodeEncodeError: 'ascii' codec can't encode
character u'\u732b' in position 0: ordinal not in range(128)

Some terminals have the sys.stdout.encoding set to “US-ASCII”, which tries to

convert stdout outputs into ASCII.

Don’t try to print Unicode, as it can raise an exception. Print in string.

What should I do to print my pretty Unicode text?
>>> x = u'\u732b'
>>> import sys
>>> sys.stdout.encoding
'US-ASCII'
>>> x.encode('utf-8')
'\xe7\x8c\xab'
>>> print x.encode('utf-8')
猫
Takeaways
● Unicode → Str : Encode
● Str → Unicode : Decode
● Working with text in Python2? The moment you read the data in, encode
it into unicode objects.
○ NEVER COMBINE UNICODE WITH STR
● Finished working with your (properly encoded) unicode objects? Decode
into str objects again.
● Always know what encoding you’re working with.
Notes: Python2 escape sequences
\x

Next two characters should be interpreted as hex digits for a character

code

Next four digits should be interpreted as ordinal number for the Unicode
character
Links & References
http://kunststube.net/encoding/

http://www.pgbovine.net/unicode-python.htm

https://pythonhosted.org/kitchen/unicode-frustrations.html

https://stackoverﬂow.com/questions/20151835/how-to-convert-wrongly-enco
ded-data-to-utf-8

https://docs.python.org/2/howto/unicode.html

https://docs.python.org/2/tutorial/introduction.html#unicode-strings

భృగు నంది నాడీ
No ratings yet
భృగు నంది నాడీ
327 pages
Urdu Calligraphy Book
89% (47)
Urdu Calligraphy Book
22 pages
மகாபாரத சூடாமணி
No ratings yet
மகாபாரத சூடாமணி
171 pages
CHARACTER ENCODING: How Do Computers Deal With Multiple Language?
No ratings yet
CHARACTER ENCODING: How Do Computers Deal With Multiple Language?
26 pages
Python Unicode Objects
No ratings yet
Python Unicode Objects
2 pages
Howto Unicode
No ratings yet
Howto Unicode
13 pages
Howto Unicode PDF
No ratings yet
Howto Unicode PDF
13 pages
Unicode HOWTO: Guido Van Rossum and The Python Development Team
No ratings yet
Unicode HOWTO: Guido Van Rossum and The Python Development Team
13 pages
Unicode HOWTO: Guido Van Rossum and The Python Development Team
No ratings yet
Unicode HOWTO: Guido Van Rossum and The Python Development Team
12 pages
Howto Unicode
No ratings yet
Howto Unicode
12 pages
Howto Unicode PDF
No ratings yet
Howto Unicode PDF
11 pages
Introduction To Unicode: History of Character Codes
No ratings yet
Introduction To Unicode: History of Character Codes
4 pages
Introduction-to-Encoding-and-Decoding (1)
No ratings yet
Introduction-to-Encoding-and-Decoding (1)
10 pages
Utf-8 Survival
100% (6)
Utf-8 Survival
52 pages
Howto Unicode
No ratings yet
Howto Unicode
9 pages
Unicode & Character Encodings in Python - A Painless Guide - Real Python
No ratings yet
Unicode & Character Encodings in Python - A Painless Guide - Real Python
20 pages
Unicode CPP PDF
No ratings yet
Unicode CPP PDF
139 pages
Unicode in C++ - McNellis - CppCon 2014
No ratings yet
Unicode in C++ - McNellis - CppCon 2014
125 pages
Unicode Better Explained
No ratings yet
Unicode Better Explained
5 pages
Unicode in C and C
No ratings yet
Unicode in C and C
8 pages
Unicode vs UTF-8
No ratings yet
Unicode vs UTF-8
2 pages
Cryptography Help in Python3: Convert HEX To Raw Bytes
No ratings yet
Cryptography Help in Python3: Convert HEX To Raw Bytes
3 pages
Analysis of Text Encoding Sin Computer Systems
No ratings yet
Analysis of Text Encoding Sin Computer Systems
6 pages
Lecture - ASCII and Unicode
No ratings yet
Lecture - ASCII and Unicode
38 pages
Unicode and Character Sets
No ratings yet
Unicode and Character Sets
2 pages
10200
No ratings yet
10200
38 pages
Howto Unicode
No ratings yet
Howto Unicode
12 pages
Utf-8 - Wikipedia, The Free Encyclopedia
No ratings yet
Utf-8 - Wikipedia, The Free Encyclopedia
10 pages
Lecture 1: Encoding Language: LING 1330/2330: Introduction To Computational Linguistics Na-Rae Han
No ratings yet
Lecture 1: Encoding Language: LING 1330/2330: Introduction To Computational Linguistics Na-Rae Han
18 pages
Coding Encoding
No ratings yet
Coding Encoding
14 pages
Unicode Fundamentals
No ratings yet
Unicode Fundamentals
51 pages
Linux Unicode Programming
No ratings yet
Linux Unicode Programming
10 pages
Web Security (CAT-309) - Unit 1 Lecture 5
No ratings yet
Web Security (CAT-309) - Unit 1 Lecture 5
11 pages
Week 4 - A Comparative Study of UTF-8 UTF-16 and UTF-32
No ratings yet
Week 4 - A Comparative Study of UTF-8 UTF-16 and UTF-32
12 pages
Uni Code
No ratings yet
Uni Code
13 pages
Byte Objects Vs String in Python
No ratings yet
Byte Objects Vs String in Python
2 pages
Power Point
No ratings yet
Power Point
10 pages
Programacion Web Parte-4
No ratings yet
Programacion Web Parte-4
4 pages
p62 0x09 UTF8 Shellcode by Greuff
No ratings yet
p62 0x09 UTF8 Shellcode by Greuff
16 pages
Unicode®: Character Encodings
No ratings yet
Unicode®: Character Encodings
11 pages
Ruby Conf 2006: I18N, M17N, Unicode, and All That
No ratings yet
Ruby Conf 2006: I18N, M17N, Unicode, and All That
60 pages
Understanding Files_ Binary vs. Text
No ratings yet
Understanding Files_ Binary vs. Text
10 pages
Problem Addressed by The Topic
No ratings yet
Problem Addressed by The Topic
2 pages
Text Encoding
No ratings yet
Text Encoding
8 pages
008 What Is UTF-8 - UTF-8 Character Encoding Tutorial
No ratings yet
008 What Is UTF-8 - UTF-8 Character Encoding Tutorial
4 pages
Uni Code
No ratings yet
Uni Code
9 pages
String Conversion Tools
No ratings yet
String Conversion Tools
2 pages
Byte Objects Vs String in Python
No ratings yet
Byte Objects Vs String in Python
2 pages
Slide 3
No ratings yet
Slide 3
9 pages
Maxbox Starter120 Unicode
No ratings yet
Maxbox Starter120 Unicode
7 pages
Encoding Schemes
No ratings yet
Encoding Schemes
23 pages
5
No ratings yet
5
8 pages
Week2DataandExpression_98caa4c4-db93-407e-856b-6095a08b49b3_95176_
No ratings yet
Week2DataandExpression_98caa4c4-db93-407e-856b-6095a08b49b3_95176_
63 pages
018 Repraesentation III Online
No ratings yet
018 Repraesentation III Online
46 pages
SS3 Note 2nd Term
No ratings yet
SS3 Note 2nd Term
10 pages
Variables and Identifiers: The Representation of Character Values
No ratings yet
Variables and Identifiers: The Representation of Character Values
2 pages
03 - Unicode Characters and Strings - en
No ratings yet
03 - Unicode Characters and Strings - en
4 pages
Computer Codes
No ratings yet
Computer Codes
22 pages
Based 64 Encoding - 00019594 - Pawan Kumar Das
No ratings yet
Based 64 Encoding - 00019594 - Pawan Kumar Das
21 pages
Lesson Plan Data Representation Characters
No ratings yet
Lesson Plan Data Representation Characters
3 pages
Python Note 2
No ratings yet
Python Note 2
8 pages
C# Package Mastery: 100 Essentials in 1 Hour - 2024 Edition
From Everand
C# Package Mastery: 100 Essentials in 1 Hour - 2024 Edition
Tenko
No ratings yet
A Beginner's guide to Python
From Everand
A Beginner's guide to Python
Steven Mcananey
No ratings yet
Style Guide: Herbalife Brand
100% (1)
Style Guide: Herbalife Brand
19 pages
Font List
No ratings yet
Font List
29 pages
Part II Nalt Style Guide 2
No ratings yet
Part II Nalt Style Guide 2
8 pages
Syserr
No ratings yet
Syserr
11 pages
Vim Cheat Sheet
No ratings yet
Vim Cheat Sheet
3 pages
Weekly Plans 2018-19
No ratings yet
Weekly Plans 2018-19
4 pages
MS Word Shortcut Keys
No ratings yet
MS Word Shortcut Keys
6 pages
2do Parcial GIMNASIA
No ratings yet
2do Parcial GIMNASIA
40 pages
MLA Formatting: Tables, Figures, & Illustrations
No ratings yet
MLA Formatting: Tables, Figures, & Illustrations
1 page
HTML Basics
No ratings yet
HTML Basics
76 pages
Latex Wikibook
No ratings yet
Latex Wikibook
313 pages
Manual de Servicio SUPER 1600-3
No ratings yet
Manual de Servicio SUPER 1600-3
462 pages
Miss Bala Magazine Cover
No ratings yet
Miss Bala Magazine Cover
2 pages
KAZ (Keyboarding A-Z)
No ratings yet
KAZ (Keyboarding A-Z)
116 pages
MS Brandguideliness 1
No ratings yet
MS Brandguideliness 1
12 pages
DE US History Research Paper Checklist
No ratings yet
DE US History Research Paper Checklist
3 pages
Flash Keycodes
No ratings yet
Flash Keycodes
4 pages
Lovely Professional University
No ratings yet
Lovely Professional University
27 pages
FONTLOG
No ratings yet
FONTLOG
2 pages
Lesson 2 Working With Text
No ratings yet
Lesson 2 Working With Text
16 pages
Peter MC Laren: Canada 1947 Filosofo Licenciatura en Arte Maestria en Doctorado en
No ratings yet
Peter MC Laren: Canada 1947 Filosofo Licenciatura en Arte Maestria en Doctorado en
19 pages
Business Writing As A Tool of Management Control
No ratings yet
Business Writing As A Tool of Management Control
11 pages
Space Essay
No ratings yet
Space Essay
326 pages
Thesis Latex Template Stanford
100% (1)
Thesis Latex Template Stanford
4 pages
BASIC WEB DESIGN COURSE
No ratings yet
BASIC WEB DESIGN COURSE
232 pages
Latex Tutorial: Jeff Clark Revised February 26, 2002
No ratings yet
Latex Tutorial: Jeff Clark Revised February 26, 2002
35 pages
Strings
No ratings yet
Strings
3 pages