Python Unicode Objects
Python Unicode Objects
htm
Pythons Unicode string type stores characters from the Unicode character set.
In this set, each distinct character has its own number, the code point.
Unicode supports more than one million code points. Unicode characters
dont have an encoding; each character is represented by its code. The Unicode
string type uses some unknown mechanism to store the characters; in your
Python code, Unicode strings simply appear as sequences of characters, just
like 8-bit strings appear as sequences of bytes.
Observations:
1. Text files always contain encoded text, not characters. Each character in
the text is encoded as one or more bytes in the file.
3. You can mix Python Unicode strings with 8-bit Python strings, as long
as the 8-bit string only contains ASCII characters. A Unicode-aware
library may chose to use 8-bit strings for text that only contains ASCII,
to save space and time.
4. If you read a line of text from a file, you get bytes, not characters.
6. To decode a string, use the decode( ) method on the input string, and
pass it the name of the encoding:
fileencoding = "iso-8859-1"
raw = file.readline()
txt = raw.decode(fileencoding)
The decode method was added in Python 2.2. In earlier versions (or if
you think it reads better), use the u n i code constructor instead:
txt = unicode(raw, fileencoding)
For more on string encoding, see Conver ting Unicode Str ings to 8-bit
Str ings.
pr i nt txt.encode(output_encoding)
Ther e ar e lots of shor tcuts in Python, including coded str eams, using default
locales for patter n matching, I SO-8859-1 as a subset of Unicode, etc, but
thats outside the scope of this note. At least for the moment.