Howto Unicode
Howto Unicode
3.11.4
24, 2023
Python Software Foundation
Email: docs@python.org
Contents
1 Unicode 2
1.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 Python Unicode 4
2.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.3 Python Unicode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.4 Unicode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.6 Unicode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3 Unicode 8
3.1 Unicode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.2 Unicode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
4 11
12
1.12
Python Unicode Unicode
1
1 Unicode
1.1
Python Unicode
Python
Unicode https://www.unicode.org/
A B C È Í
Ⅰ 1 I
U+265E U+265E
’♞’
glyph A
Python
GUI
1.2
P y t h o n
0x50 00 00 00 79 00 00 00 74 00 00 00 68 00 00 00 6f 00 00 00 6e 00 00 00
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
2
1.
2. 127 255 0x00
ASCII 6 24 RAM
GB RAM
4
3. C strlen()
UTF-8
UTF-8 Python UTF Unicode Transformation Format ’8’
8 UTF-16 UTF-32 UTF-8 UTF-8
1. < 128
2. >= 128 2 3 4 128 255
UTF-8
1. Unicode
2. Unicode null U+0000
strcpy() C UTF-8
3. ASCII UTF-8
4. UTF-8
5. UTF-8 8
UTF-8
6. UTF-8
UTF-16 UTF-32
1.3
3
2 Python Unicode
2.1
try:
with open('/tmp/input.txt', 'r') as f:
...
except OSError:
# 'File not found' error message.
print("Fichier non trouvé")
Python 3 Unicode
répertoire = "/tmp/records.log"
with open(répertoire, "w") as f:
f.write("test\n")
ASCII
Delta u
chr() Unicode
1 Unicode ord() Unicode
4
>>> chr(57344)
'\ue000'
>>> ord('\ue000')
57344
2.2
codecs
codecs
Python \u Unicode 4
\U 8 4
>>> s = "a\xac\u1234\u20ac\U00008000"
... # ^^^^ two-digit hex escape
... # ^^^^^^ four-digit Unicode escape
... # ^^^^^^^^^^ eight-digit Unicode escape
>>> [ord(c) for c in s]
[97, 172, 4660, 8364, 32768]
127
chr()
Python
5
Python UTF-8
#!/usr/bin/env python
# -*- coding: latin-1 -*-
u = 'abcdé'
print(ord(u[-1]))
2.4 Unicode
Unicode
import unicodedata
for i, c in enumerate(u):
print(i, '%04x' % ord(c), unicodedata.category(c), end=" ")
print(unicodedata.name(c))
2.5
Unicode
ê U+00EA U+0065 U+0302 e COMBINING
CIRCUMFLEX ACCENT 1
2
casefold() Unicode
ß U+00DF
ss
6
>>> street = 'Gürzenichstraße'
>>> street.casefold()
'gürzenichstrasse'
unicodedata normalize()
normalize()
import unicodedata
single_char = 'ê'
multiple_chars = '\N{LATIN SMALL LETTER E}\N{COMBINING CIRCUMFLEX ACCENT}'
print('length of first string=', len(single_char))
print('length of second string=', len(multiple_chars))
print(compare_strs(single_char, multiple_chars))
$ python3 compare-strs.py
length of first string= 1
length of second string= 2
True
import unicodedata
# Example usage
single_char = 'ê'
multiple_chars = '\N{LATIN CAPITAL LETTER E}\N{COMBINING CIRCUMFLEX ACCENT}'
print(compare_caseless(single_char, multiple_chars))
7
2.6 Unicode
re \d \w
\d
[0-9] 'Nd'
57
import re
p = re.compile(r'\d+')
2.7
Python Unicode
• Processing Text Files in Python 3, by Nick Coghlan.
• Unicode Ned Batchelder PyCon 2012
str Python textseq
unicodedata
codecs
Marc-André Lemburg EuroPython 2002 Python Unicode PDF <https:
//downloads.egenix.com/python/Unicode-EPC2002-Talk.pdf>‘_ Python
2 Unicode Unicode unicode u
3 Unicode
Unicode / Unicode
Unicode
Unicode
XML Unicode Unicode SQL
Unicode
Unicode
8 bytes.decode(encoding)
Unicode
1024 4096 Unicode
2 GB 2 GB RAM
Unicode
8
Unicode
3.1 Unicode
Unicode Unicode
Python UTF-8 MacOS Python
UTF-8 Python 3.6 Windows UTF-8 Unix
LANG LC_CTYPE
UTF-8
sys.getfilesystemencoding()
Unicode
filename = 'filename\u4500abc'
with open(filename, 'w') as f:
f.write('blah\n')
os Unicode os.stat()
os.listdir() Unicode
os.listdir()
Unicode Unicode
Unicode
UTF-8 :
fn = 'filename\u4500abc'
f = open(fn, 'w')
f.close()
import os
print(os.listdir(b'.'))
print(os.listdir('.'))
$ python listdir-test.py
[b'filename\xe4\x94\x80abc', ...]
['filename\u4500abc', ...]
UTF-8 Unicode
9
API Unicode API
Unix
3.2 Unicode
Unicode
Unicode
Unicode
str + bytes
TypeError
Web
ASCII
StreamRecoder #1
#2
f Latin-1 StreamRecoder UTF-8
new_f = codecs.StreamRecoder(f,
# en/decoder: used by read() to encode its results and
# by write() to decode its input.
codecs.getencoder('utf-8'), codecs.getdecoder('utf-8'),
ASCII
ASCII surrogateescape handler
10
3.3
One section of Mastering Python 3 Input/Output, a PyCon 2010 talk by David Beazley, discusses text processing and
binary data handling.
Marc-André Lemburg PDF Python Unicode
Python 2.x
The Guts of Unicode in Python is a PyCon 2013 talk by Benjamin Peterson that discusses the internal Unicode
representation in Python 3.3.
11
P
Python
PEP 263, 6
12