Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
4 views

Howto Unicode

Uploaded by

ryan suen
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Howto Unicode

Uploaded by

ryan suen
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Unicode

3.11.4

Guido van Rossum and the Python development team

24, 2023
Python Software Foundation
Email: docs@python.org

Contents

1 Unicode 2
1.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Python Unicode 4
2.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.3 Python Unicode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.4 Unicode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.6 Unicode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

3 Unicode 8
3.1 Unicode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.2 Unicode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

4 11

12

1.12
Python Unicode Unicode

1
1 Unicode

1.1

Python Unicode
Python
Unicode https://www.unicode.org/

A B C È Í
Ⅰ 1 I

Unicode code point 0 0x10FFFF


110 Unicode U+265E
0x265e 9822
Unicode

0061 'a'; LATIN SMALL LETTER A


0062 'b'; LATIN SMALL LETTER B
0063 'c'; LATIN SMALL LETTER C
...
007B '{'; LEFT CURLY BRACKET
...
2167 'Ⅷ'; ROMAN NUMERAL EIGHT
2168 'Ⅸ'; ROMAN NUMERAL NINE
...
265E ' '; BLACK CHESS KNIGHT
265F ' '; BLACK CHESS PAWN
...
1F600 ' '; GRINNING FACE
1F609 ' '; WINKING FACE
...

U+265E U+265E
’♞’

glyph A
Python
GUI

1.2

Unicode 0 0x10FFFF 1,114,111


code unit
Unicode
32 CPU 32
Python

P y t h o n
0x50 00 00 00 79 00 00 00 74 00 00 00 68 00 00 00 6f 00 00 00 6e 00 00 00
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

2
1.
2. 127 255 0x00
ASCII 6 24 RAM
GB RAM
4
3. C strlen()
UTF-8
UTF-8 Python UTF Unicode Transformation Format ’8’
8 UTF-16 UTF-32 UTF-8 UTF-8

1. < 128
2. >= 128 2 3 4 128 255
UTF-8
1. Unicode
2. Unicode null U+0000
strcpy() C UTF-8

3. ASCII UTF-8
4. UTF-8
5. UTF-8 8
UTF-8
6. UTF-8
UTF-16 UTF-32

1.3

Unicode Consortium Unicode PDF


Unicode ‘ <https://www.unicode.org/history/>‘_
Computerphile Youtube Tom Scott ‘ Unicode UTF-8 <https://www.youtube.com/
watch?v=MijmeoH9LT4>‘_ 9 36
To help understand the standard, Jukka Korpela has written an introductory guide to reading the Unicode character
tables.
Joel Spolsky <https://www.joelonsoftware.com/2003/10/08/
the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-
set-no-excuses/>‘_
Wikipedia UTF-8

3
2 Python Unicode

Unicode Python Unicode

2.1

Python 3.0 str Unicode “”unicode rocks!” ’unicode rocks!’“


Unicode
Python UTF-8 Unicode

try:
with open('/tmp/input.txt', 'r') as f:
...
except OSError:
# 'File not found' error message.
print("Fichier non trouvé")

Python 3 Unicode

répertoire = "/tmp/records.log"
with open(répertoire, "w") as f:
f.write("test\n")

ASCII
Delta u

>>> "\N{GREEK CAPITAL LETTER DELTA}" # Using the character name


'\u0394'
>>> "\u0394" # Using a 16-bit hex value
'\u0394'
>>> "\U00000394" # Using a 32-bit hex value
'\u0394'

bytes decode() encoding


UTF-8 errors
errors
'strict' ( UnicodeDecodeError ) 'replace' ( U+FFFD REPLACEMENT
CHARACTER) 'ignore' ( Unicode ) 'backslashreplace' (
\xNN ) :

>>> b'\x80abc'.decode("utf-8", "strict")


Traceback (most recent call last):
...
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0:
invalid start byte
>>> b'\x80abc'.decode("utf-8", "replace")
'\ufffdabc'
>>> b'\x80abc'.decode("utf-8", "backslashreplace")
'\\x80abc'
>>> b'\x80abc'.decode("utf-8", "ignore")
'abc'

Python 100 Python


standard-encodings 'latin-1' 'iso_8859_1' '8859

chr() Unicode
1 Unicode ord() Unicode

4
>>> chr(57344)
'\ue000'
>>> ord('\ue000')
57344

2.2

bytes.decode() str.encode() Unicode bytes


encoding
errors decode() handler 'strict' 'ignore'
'replace' 'xmlcharrefreplace'
XML backslashreplace \uNNNN namereplace
\N{...}

>>> u = chr(40960) + 'abcd' + chr(1972)


>>> u.encode('utf-8')
b'\xea\x80\x80abcd\xde\xb4'
>>> u.encode('ascii')
Traceback (most recent call last):
...
UnicodeEncodeError: 'ascii' codec can't encode character '\ua000' in
position 0: ordinal not in range(128)
>>> u.encode('ascii', 'ignore')
b'abcd'
>>> u.encode('ascii', 'replace')
b'?abcd?'
>>> u.encode('ascii', 'xmlcharrefreplace')
b'&#40960;abcd&#1972;'
>>> u.encode('ascii', 'backslashreplace')
b'\\ua000abcd\\u07b4'
>>> u.encode('ascii', 'namereplace')
b'\\N{YI SYLLABLE IT}abcd\\u07b4'

codecs
codecs

2.3 Python Unicode

Python \u Unicode 4
\U 8 4

>>> s = "a\xac\u1234\u20ac\U00008000"
... # ^^^^ two-digit hex escape
... # ^^^^^^ four-digit Unicode escape
... # ^^^^^^^^^^ eight-digit Unicode escape
>>> [ord(c) for c in s]
[97, 172, 4660, 8364, 32768]

127
chr()

Python

5
Python UTF-8

#!/usr/bin/env python
# -*- coding: latin-1 -*-

u = 'abcdé'
print(ord(u[-1]))

Emacs Emacs Python


-*- Emacs Python
Python coding: name coding=name
UTF-8 PEP 263

2.4 Unicode

Unicode

import unicodedata

u = chr(233) + chr(0x0bf2) + chr(3972) + chr(6000) + chr(13231)

for i, c in enumerate(u):
print(i, '%04x' % ord(c), unicodedata.category(c), end=" ")
print(unicodedata.name(c))

# Get numeric value of second character


print(unicodedata.numeric(u[1]))

0 00e9 Ll LATIN SMALL LETTER E WITH ACUTE


1 0bf2 No TAMIL NUMBER ONE THOUSAND
2 0f84 Mn TIBETAN MARK HALANTA
3 1770 Lo TAGBANWA LETTER SA
4 33af So SQUARE RAD OVER S SQUARED
1000.0

'Ll' 'No' 'Mn'


, 'So' Unicode
<https://www.unicode.org/reports/tr44/#General_Category_Values>‘_

2.5

Unicode
ê U+00EA U+0065 U+0302 e COMBINING
CIRCUMFLEX ACCENT 1
2
casefold() Unicode
ß U+00DF
ss

6
>>> street = 'Gürzenichstraße'
>>> street.casefold()
'gürzenichstrasse'

unicodedata normalize()
normalize()

import unicodedata

def compare_strs(s1, s2):


def NFD(s):
return unicodedata.normalize('NFD', s)

return NFD(s1) == NFD(s2)

single_char = 'ê'
multiple_chars = '\N{LATIN SMALL LETTER E}\N{COMBINING CIRCUMFLEX ACCENT}'
print('length of first string=', len(single_char))
print('length of second string=', len(multiple_chars))
print(compare_strs(single_char, multiple_chars))

$ python3 compare-strs.py
length of first string= 1
length of second string= 2
True

normalize() NFC NFKC NFD


NFKD
Unicode

import unicodedata

def compare_caseless(s1, s2):


def NFD(s):
return unicodedata.normalize('NFD', s)

return NFD(NFD(s1).casefold()) == NFD(NFD(s2).casefold())

# Example usage
single_char = 'ê'
multiple_chars = '\N{LATIN CAPITAL LETTER E}\N{COMBINING CIRCUMFLEX ACCENT}'

print(compare_caseless(single_char, multiple_chars))

True NFD() casefold()


Unicode 3.13

7
2.6 Unicode

re \d \w
\d
[0-9] 'Nd'
57

import re
p = re.compile(r'\d+')

s = "Over \u0e55\u0e57 57 flavours"


m = p.search(s)
print(repr(m.group()))

\d+ compile() re.ASCII \d+


”57”
\w Unicode [a-zA-Z0-9_] re.ASCII
\s `` Unicode ``[ \t\n\r\f\v]

2.7

Python Unicode
• Processing Text Files in Python 3, by Nick Coghlan.
• Unicode Ned Batchelder PyCon 2012
str Python textseq
unicodedata
codecs
Marc-André Lemburg EuroPython 2002 Python Unicode PDF <https:
//downloads.egenix.com/python/Unicode-EPC2002-Talk.pdf>‘_ Python
2 Unicode Unicode unicode u

3 Unicode

Unicode / Unicode
Unicode
Unicode
XML Unicode Unicode SQL
Unicode
Unicode
8 bytes.decode(encoding)

Unicode
1024 4096 Unicode

2 GB 2 GB RAM
Unicode

open() read() write()


Unicode open() encoding errors str.encode()
bytes.decode()

8
Unicode

with open('unicode.txt', encoding='utf-8') as f:


for line in f:
print(repr(line))

with open('test', encoding='utf-8', mode='w+') as f:


f.write('\u4500 blah blah blah\n')
f.seek(0)
print(repr(f.readline()[:1]))

Unicode U+FEFF BOM


UTF-16 BOM BOM
little-endian
big-endian utf-16-le utf-16-be BOM
UTF-8 BOM UTF-8
UTF-8 utf-8-sig

3.1 Unicode

Unicode Unicode
Python UTF-8 MacOS Python
UTF-8 Python 3.6 Windows UTF-8 Unix
LANG LC_CTYPE
UTF-8
sys.getfilesystemencoding()
Unicode

filename = 'filename\u4500abc'
with open(filename, 'w') as f:
f.write('blah\n')

os Unicode os.stat()
os.listdir() Unicode
os.listdir()
Unicode Unicode
Unicode
UTF-8 :

fn = 'filename\u4500abc'
f = open(fn, 'w')
f.close()

import os
print(os.listdir(b'.'))
print(os.listdir('.'))

$ python listdir-test.py
[b'filename\xe4\x94\x80abc', ...]
['filename\u4500abc', ...]

UTF-8 Unicode

9
API Unicode API
Unix

3.2 Unicode

Unicode

Unicode

Unicode
str + bytes
TypeError
Web

ASCII

StreamRecoder #1
#2
f Latin-1 StreamRecoder UTF-8

new_f = codecs.StreamRecoder(f,
# en/decoder: used by read() to encode its results and
# by write() to decode its input.
codecs.getencoder('utf-8'), codecs.getdecoder('utf-8'),

# reader/writer: used to read and write to the stream.


codecs.getreader('latin-1'), codecs.getwriter('latin-1') )

ASCII
ASCII surrogateescape handler

with open(fname, 'r', encoding="ascii", errors="surrogateescape") as f:


data = f.read()

# make changes to the string 'data'

with open(fname + '.new', 'w',


encoding="ascii", errors="surrogateescape") as f:
f.write(data)

surrogateescape handler ASCII U+DC80 U+DCFF


surrogateescape handler

10
3.3

One section of Mastering Python 3 Input/Output, a PyCon 2010 talk by David Beazley, discusses text processing and
binary data handling.
Marc-André Lemburg PDF Python Unicode
Python 2.x
The Guts of Unicode in Python is a PyCon 2013 talk by Benjamin Peterson that discusses the internal Unicode
representation in Python 3.3.

Andrew Kuchling Alexander Belopolsky Georg Brandl Andrew Kuchling Ezio


Melotti
Éric Araujo Nicholas Bastin Nick Coghlan Marius Gedminas
Kent Johnson Ken Krugler Marc-André Lemburg Martin von Löwis Terry J. Reedy Serhiy Storchaka , Eryk
Sun, Chad Whitacre, Graham Wideman

11
P
Python
PEP 263, 6

12

You might also like