0% found this document useful (0 votes)

32 views

Python Unicode Objects

This document discusses best practices for working with non-ASCII character sets in Python. It notes that text files contain encoded text, not characters, and the encoding must be known to decode strings into Unicode characters. It also discusses using decode() to convert encoded strings to Unicode, encoding Unicode strings for output or storage, and using the default locale to encode strings for printing.

Uploaded by

robert_silva8458

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

32 views

Python Unicode Objects

Uploaded by

robert_silva8458

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 2

Python Unicode Objects http://effbot.org/zone/unicode-objects.

htm

Pyt h on U n i cod e Obj ect s

Som e Obser vat i on s on W or k i n g W i t h N on -ASCI I
Ch ar act er Set s
This note pr ovides some br ief infor mation on best pr actices for wor king with
non-ASCI I data in Python 2.0 and later . As ever ything else on this site, this is
a wor k in pr ogr ess.
Updated June 21, 2004 | February 11, 2002 | Fredrik Lundh

Pythons Unicode string type stores characters from the Unicode character set.
In this set, each distinct character has its own number, the code point.
Unicode supports more than one million code points. Unicode characters
dont have an encoding; each character is represented by its code. The Unicode
string type uses some unknown mechanism to store the characters; in your
Python code, Unicode strings simply appear as sequences of characters, just
like 8-bit strings appear as sequences of bytes.

Observations:

1. Text files always contain encoded text, not characters. Each character in
the text is encoded as one or more bytes in the file.

2. Most popular encodings (UTF-8, ISO-8859-X, etc) are supersets of

ASCII. This means that the first 128 characters have the usual meaning,
and that the usual characters are used for line endings. In other words,
r eadl i n e( ) will work just fine.

3. You can mix Python Unicode strings with 8-bit Python strings, as long
as the 8-bit string only contains ASCII characters. A Unicode-aware
library may chose to use 8-bit strings for text that only contains ASCII,
to save space and time.

4. If you read a line of text from a file, you get bytes, not characters.

5. To decode an encoded string into a string of well-defined characters,

you have to know what encoding it uses.

6. To decode a string, use the decode( ) method on the input string, and
pass it the name of the encoding:
fileencoding = "iso-8859-1"

raw = file.readline()
txt = raw.decode(fileencoding)

(the result is a Python Unicode string).

1 de 2 11/04/2017 09:10 p.m.

Python Unicode Objects http://effbot.org/zone/unicode-objects.htm

The decode method was added in Python 2.2. In earlier versions (or if
you think it reads better), use the u n i code constructor instead:
txt = unicode(raw, fileencoding)

7. Pythons regular expression engine supports Unicode. You can apply

the same pattern to either 8-bit (encoded) or Unicode strings. To create
a regular expression pattern that uses Unicode character classes for \ w
(and \s, and \b), use the (?u) flag prefix, or the r e.U N I COD E flag:
pattern = re.compile("(?u)pattern")

pattern = re.compile("pattern", re.UNICODE)

8. To write a Unicode string to a file or other device, you have to convert it

to the encoding used by the file. The en code method converts from
Unicode to an encoded string.
out = txt.encode(encoding)

If the string contains characters that cannot be represented in the given

encoding, Python raises an exception. You can change this by passing in
a second argument to en code:
# s k i p bad c har s
out = txt.encode(encoding, "ignore")

# r epl ac e bad c har s wi t h " ?"

out = txt.encode(encoding, "replace")

For more on string encoding, see Conver ting Unicode Str ings to 8-bit
Str ings.

9. To print a Unicode string to your output device, you have to convert it

to the encoding used by your terminal. The en code( ) method converts
from Unicode back to an encoded string. You can use the
l ocal e.get def au l t l ocal e( ) function to get the current output
encoding.
i mpor t locale
language, output_encoding = locale.getdefaultlocale()

pr i nt txt.encode(output_encoding)

Ther e ar e lots of shor tcuts in Python, including coded str eams, using default
locales for patter n matching, I SO-8859-1 as a subset of Unicode, etc, but
thats outside the scope of this note. At least for the moment.

rendered by a django application. hosted by webfaction.

2 de 2 11/04/2017 09:10 p.m.

Unicode HOWTO: Guido Van Rossum and The Python Development Team
No ratings yet
Unicode HOWTO: Guido Van Rossum and The Python Development Team
13 pages
Howto Unicode
No ratings yet
Howto Unicode
13 pages
Howto Unicode PDF
No ratings yet
Howto Unicode PDF
13 pages
Howto Unicode
No ratings yet
Howto Unicode
12 pages
Unicode HOWTO: Guido Van Rossum and The Python Development Team
No ratings yet
Unicode HOWTO: Guido Van Rossum and The Python Development Team
12 pages
Howto Unicode PDF
No ratings yet
Howto Unicode PDF
11 pages
Howto Unicode
No ratings yet
Howto Unicode
9 pages
Introduction To Unicode: History of Character Codes
No ratings yet
Introduction To Unicode: History of Character Codes
4 pages
Working With Unicode
No ratings yet
Working With Unicode
19 pages
Unicode vs UTF-8
No ratings yet
Unicode vs UTF-8
2 pages
Python Recipes
No ratings yet
Python Recipes
84 pages
Howto Unicode
No ratings yet
Howto Unicode
12 pages
Unicode in C and C
No ratings yet
Unicode in C and C
8 pages
10200
No ratings yet
10200
38 pages
Python Unit-2 Notes
No ratings yet
Python Unit-2 Notes
60 pages
ICT Lecture 22
No ratings yet
ICT Lecture 22
23 pages
Unicode CPP PDF
No ratings yet
Unicode CPP PDF
139 pages
Unicode in C++ - McNellis - CppCon 2014
No ratings yet
Unicode in C++ - McNellis - CppCon 2014
125 pages
Python Programming UNIT 2
No ratings yet
Python Programming UNIT 2
25 pages
Introduction-to-Encoding-and-Decoding (1)
No ratings yet
Introduction-to-Encoding-and-Decoding (1)
10 pages
Week 4 - A Comparative Study of UTF-8 UTF-16 and UTF-32
No ratings yet
Week 4 - A Comparative Study of UTF-8 UTF-16 and UTF-32
12 pages
Strings
No ratings yet
Strings
23 pages
Unicode & Character Encodings in Python - A Painless Guide - Real Python
No ratings yet
Unicode & Character Encodings in Python - A Painless Guide - Real Python
20 pages
Byte Objects Vs String in Python
No ratings yet
Byte Objects Vs String in Python
2 pages
Unicode Fundamentals
No ratings yet
Unicode Fundamentals
51 pages
Unicode and Character Sets
No ratings yet
Unicode and Character Sets
2 pages
Ruby Conf 2006: I18N, M17N, Unicode, and All That
No ratings yet
Ruby Conf 2006: I18N, M17N, Unicode, and All That
60 pages
Unicode Better Explained
No ratings yet
Unicode Better Explained
5 pages
Python Strings & Operations
No ratings yet
Python Strings & Operations
4 pages
3 Python Fundamentals m02 Strings Slides
No ratings yet
3 Python Fundamentals m02 Strings Slides
76 pages
Python Fundamental PDF
No ratings yet
Python Fundamental PDF
37 pages
Uni Code
No ratings yet
Uni Code
9 pages
Maxbox Starter120 Unicode
No ratings yet
Maxbox Starter120 Unicode
7 pages
Understanding Files_ Binary vs. Text
No ratings yet
Understanding Files_ Binary vs. Text
10 pages
Unicode®: Character Encodings
No ratings yet
Unicode®: Character Encodings
11 pages
Linux Unicode Programming
No ratings yet
Linux Unicode Programming
10 pages
Byte Objects Vs String in Python
No ratings yet
Byte Objects Vs String in Python
2 pages
Problem Addressed by The Topic
No ratings yet
Problem Addressed by The Topic
2 pages
File Handling -7
No ratings yet
File Handling -7
48 pages
Module 3.1 - Encryption
No ratings yet
Module 3.1 - Encryption
58 pages
Unit 6 File Handling
No ratings yet
Unit 6 File Handling
49 pages
Python Strings: Accessing Values in String S
No ratings yet
Python Strings: Accessing Values in String S
7 pages
1 Files Pycharm Installation
No ratings yet
1 Files Pycharm Installation
9 pages
Utf-8 - Wikipedia, The Free Encyclopedia
No ratings yet
Utf-8 - Wikipedia, The Free Encyclopedia
10 pages
Programming Fundamentals: Lecturer XXX
No ratings yet
Programming Fundamentals: Lecturer XXX
30 pages
datacleaning_clinical_py
No ratings yet
datacleaning_clinical_py
7 pages
Extr 030
No ratings yet
Extr 030
4 pages
Utf-8 Survival
100% (6)
Utf-8 Survival
52 pages
Uni Code
No ratings yet
Uni Code
13 pages
String Conversion Tools
No ratings yet
String Conversion Tools
2 pages
PYTHON_FILES
No ratings yet
PYTHON_FILES
37 pages
Unicodebook PDF
No ratings yet
Unicodebook PDF
73 pages
Power Point
No ratings yet
Power Point
10 pages
unit-4 python
No ratings yet
unit-4 python
18 pages
What Is String in Python?
No ratings yet
What Is String in Python?
18 pages
An Introduction To Python For Absolute Beginners
No ratings yet
An Introduction To Python For Absolute Beginners
457 pages
Lecture No 3
No ratings yet
Lecture No 3
14 pages
Mastering Python Programming: A Comprehensive Guide: The IT Collection
From Everand
Mastering Python Programming: A Comprehensive Guide: The IT Collection
Christopher Ford
5/5 (1)
A Beginner's guide to Python
From Everand
A Beginner's guide to Python
Steven Mcananey
No ratings yet
C# Package Mastery: 100 Essentials in 1 Hour - 2024 Edition
From Everand
C# Package Mastery: 100 Essentials in 1 Hour - 2024 Edition
Tenko
No ratings yet
Applied Statistics in Business and Economics 4th Edition Doane Test Bank instant download
100% (5)
Applied Statistics in Business and Economics 4th Edition Doane Test Bank instant download
55 pages
Design A Dynamic Sliding Mode Controller
No ratings yet
Design A Dynamic Sliding Mode Controller
8 pages
Lab 02 - Getting Familier To Ladder Logic Programming
No ratings yet
Lab 02 - Getting Familier To Ladder Logic Programming
12 pages
Gujarat Technological University
No ratings yet
Gujarat Technological University
2 pages
Entropy Equation For A Control Volume
No ratings yet
Entropy Equation For A Control Volume
12 pages
Adding Eccentricity Using - Coefficient Lateral Force - Computers and Structures - ETABS - Eng-Tips
No ratings yet
Adding Eccentricity Using - Coefficient Lateral Force - Computers and Structures - ETABS - Eng-Tips
2 pages
Geosynthetic Encased Column
No ratings yet
Geosynthetic Encased Column
10 pages
Elipse Sheet
No ratings yet
Elipse Sheet
23 pages
NGEC9 Presentation New
No ratings yet
NGEC9 Presentation New
24 pages
Ieep214 PDF
No ratings yet
Ieep214 PDF
21 pages
Motion in A Straight Line Worksheet
No ratings yet
Motion in A Straight Line Worksheet
2 pages
Condensed Chapter1
No ratings yet
Condensed Chapter1
37 pages
Quiz 3 - Discrete Probability Distribution
No ratings yet
Quiz 3 - Discrete Probability Distribution
3 pages
Portfolio Selection: Optimizing An Error: Assignment #1 (Based On Case From Textbook)
No ratings yet
Portfolio Selection: Optimizing An Error: Assignment #1 (Based On Case From Textbook)
10 pages
Speaker-Independent Phone Recognition Using Hidden Markov Models PDF
No ratings yet
Speaker-Independent Phone Recognition Using Hidden Markov Models PDF
8 pages
Ludovica Cotta-Ramusino - A Path-Integral Formalism of DNA Looping Probability
No ratings yet
Ludovica Cotta-Ramusino - A Path-Integral Formalism of DNA Looping Probability
122 pages
Hall Effect - LAB Manual PhyWE Experiment
50% (2)
Hall Effect - LAB Manual PhyWE Experiment
24 pages
Leibniz's Spark of Kant's Great Light:: An Application of Castaneda's Darwinian Approach To The History of Philosophy
No ratings yet
Leibniz's Spark of Kant's Great Light:: An Application of Castaneda's Darwinian Approach To The History of Philosophy
8 pages
Fluent 6.3: Population Balance Module Manual
No ratings yet
Fluent 6.3: Population Balance Module Manual
55 pages
Lab Manual: Subject:ARWP
No ratings yet
Lab Manual: Subject:ARWP
19 pages
CRIM 7 Midterm Reviewer
No ratings yet
CRIM 7 Midterm Reviewer
3 pages
110325 - Bài Tập Tree Digram Probability
No ratings yet
110325 - Bài Tập Tree Digram Probability
8 pages
The Cambridge Sr. Sec. School Paper 25 Maths
No ratings yet
The Cambridge Sr. Sec. School Paper 25 Maths
5 pages
Nthpbatch 2018 Pastsamplepaper
No ratings yet
Nthpbatch 2018 Pastsamplepaper
19 pages
Fox Fluid Mechanics 8th Solved Problem 3.13
No ratings yet
Fox Fluid Mechanics 8th Solved Problem 3.13
2 pages
7 - Module 2
No ratings yet
7 - Module 2
10 pages
Biostatistics 4th Year Rates Ratios Proportions 1326731088 Phpapp02 120116104238 Phpapp02
No ratings yet
Biostatistics 4th Year Rates Ratios Proportions 1326731088 Phpapp02 120116104238 Phpapp02
46 pages
Ingen Dynamics - Programming Aptitude Assessment
No ratings yet
Ingen Dynamics - Programming Aptitude Assessment
3 pages
A26 Mean Mode Median
No ratings yet
A26 Mean Mode Median
12 pages
Continuous Wavelet Transform Coeff - Matlab Code
No ratings yet
Continuous Wavelet Transform Coeff - Matlab Code
8 pages

Python Unicode Objects

Uploaded by

Python Unicode Objects

Uploaded by

Python Unicode Objects http://effbot.org/zone/unicode-objects.

Pyt h on U n i cod e Obj ect s

2. Most popular encodings (UTF-8, ISO-8859-X, etc) are supersets of

5. To decode an encoded string into a string of well-defined characters,

(the result is a Python Unicode string).

1 de 2 11/04/2017 09:10 p.m.

7. Pythons regular expression engine supports Unicode. You can apply

pattern = re.compile("pattern", re.UNICODE)

8. To write a Unicode string to a file or other device, you have to convert it

If the string contains characters that cannot be represented in the given

# r epl ac e bad c har s wi t h " ?"

9. To print a Unicode string to your output device, you have to convert it

rendered by a django application. hosted by webfaction.

2 de 2 11/04/2017 09:10 p.m.

You might also like