0% found this document useful (0 votes)

4 views

Howto Unicode

Uploaded by

ryan suen

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views

Howto Unicode

Uploaded by

ryan suen

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 12

Unicode

3.11.4

Guido van Rossum and the Python development team

24, 2023
Python Software Foundation
Email: docs@python.org

Contents

1 Unicode 2
1.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Python Unicode 4
2.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.3 Python Unicode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.4 Unicode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.6 Unicode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

3 Unicode 8
3.1 Unicode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.2 Unicode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

4 11

1.12
Python Unicode Unicode

1
1 Unicode

1.1

Python Unicode
Python
Unicode https://www.unicode.org/

A B C È Í
Ⅰ 1 I

Unicode code point 0 0x10FFFF

110 Unicode U+265E
0x265e 9822
Unicode

0061 'a'; LATIN SMALL LETTER A

0062 'b'; LATIN SMALL LETTER B
0063 'c'; LATIN SMALL LETTER C
...
007B '{'; LEFT CURLY BRACKET
...
2167 'Ⅷ'; ROMAN NUMERAL EIGHT
2168 'Ⅸ'; ROMAN NUMERAL NINE
...
265E ' '; BLACK CHESS KNIGHT
265F ' '; BLACK CHESS PAWN
...
1F600 ' '; GRINNING FACE
1F609 ' '; WINKING FACE
...

U+265E U+265E
’♞’

glyph A
Python
GUI

1.2

Unicode 0 0x10FFFF 1,114,111

code unit
Unicode
32 CPU 32
Python

P y t h o n
0x50 00 00 00 79 00 00 00 74 00 00 00 68 00 00 00 6f 00 00 00 6e 00 00 00
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

2
1.
2. 127 255 0x00
ASCII 6 24 RAM
GB RAM
4
3. C strlen()
UTF-8
UTF-8 Python UTF Unicode Transformation Format ’8’
8 UTF-16 UTF-32 UTF-8 UTF-8

1. < 128
2. >= 128 2 3 4 128 255
UTF-8
1. Unicode
2. Unicode null U+0000
strcpy() C UTF-8

3. ASCII UTF-8
4. UTF-8
5. UTF-8 8
UTF-8
6. UTF-8
UTF-16 UTF-32

1.3

Unicode Consortium Unicode PDF

Unicode ‘ <https://www.unicode.org/history/>‘_
Computerphile Youtube Tom Scott ‘ Unicode UTF-8 <https://www.youtube.com/
watch?v=MijmeoH9LT4>‘_ 9 36
To help understand the standard, Jukka Korpela has written an introductory guide to reading the Unicode character
tables.
Joel Spolsky <https://www.joelonsoftware.com/2003/10/08/
the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-
set-no-excuses/>‘_
Wikipedia UTF-8

3
2 Python Unicode

Unicode Python Unicode

2.1

Python 3.0 str Unicode “”unicode rocks!” ’unicode rocks!’“

Unicode
Python UTF-8 Unicode

try:
with open('/tmp/input.txt', 'r') as f:
...
except OSError:
# 'File not found' error message.
print("Fichier non trouvé")

Python 3 Unicode

répertoire = "/tmp/records.log"
with open(répertoire, "w") as f:
f.write("test\n")

ASCII
Delta u

>>> "\N{GREEK CAPITAL LETTER DELTA}" # Using the character name

'\u0394'
>>> "\u0394" # Using a 16-bit hex value
'\u0394'
>>> "\U00000394" # Using a 32-bit hex value
'\u0394'

bytes decode() encoding

UTF-8 errors
errors
'strict' ( UnicodeDecodeError ) 'replace' ( U+FFFD REPLACEMENT
CHARACTER) 'ignore' ( Unicode ) 'backslashreplace' (
\xNN ) :

>>> b'\x80abc'.decode("utf-8", "strict")

Traceback (most recent call last):
...
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0:
invalid start byte
>>> b'\x80abc'.decode("utf-8", "replace")
'\ufffdabc'
>>> b'\x80abc'.decode("utf-8", "backslashreplace")
'\\x80abc'
>>> b'\x80abc'.decode("utf-8", "ignore")
'abc'

Python 100 Python

standard-encodings 'latin-1' 'iso_8859_1' '8859

chr() Unicode
1 Unicode ord() Unicode

4
>>> chr(57344)
'\ue000'
>>> ord('\ue000')
57344

2.2

bytes.decode() str.encode() Unicode bytes

encoding
errors decode() handler 'strict' 'ignore'
'replace' 'xmlcharrefreplace'
XML backslashreplace \uNNNN namereplace
\N{...}

>>> u = chr(40960) + 'abcd' + chr(1972)

>>> u.encode('utf-8')
b'\xea\x80\x80abcd\xde\xb4'
>>> u.encode('ascii')
Traceback (most recent call last):
...
UnicodeEncodeError: 'ascii' codec can't encode character '\ua000' in
position 0: ordinal not in range(128)
>>> u.encode('ascii', 'ignore')
b'abcd'
>>> u.encode('ascii', 'replace')
b'?abcd?'
>>> u.encode('ascii', 'xmlcharrefreplace')
b'ꀀabcd޴'
>>> u.encode('ascii', 'backslashreplace')
b'\\ua000abcd\\u07b4'
>>> u.encode('ascii', 'namereplace')
b'\\N{YI SYLLABLE IT}abcd\\u07b4'

codecs
codecs

2.3 Python Unicode

Python \u Unicode 4
\U 8 4

>>> s = "a\xac\u1234\u20ac\U00008000"
... # ^^^^ two-digit hex escape
... # ^^^^^^ four-digit Unicode escape
... # ^^^^^^^^^^ eight-digit Unicode escape
>>> [ord(c) for c in s]
[97, 172, 4660, 8364, 32768]

127
chr()

Python

5
Python UTF-8

#!/usr/bin/env python
# -*- coding: latin-1 -*-

u = 'abcdé'
print(ord(u[-1]))

Emacs Emacs Python

-*- Emacs Python
Python coding: name coding=name
UTF-8 PEP 263

2.4 Unicode

Unicode

import unicodedata

u = chr(233) + chr(0x0bf2) + chr(3972) + chr(6000) + chr(13231)

for i, c in enumerate(u):
print(i, '%04x' % ord(c), unicodedata.category(c), end=" ")
print(unicodedata.name(c))

# Get numeric value of second character

print(unicodedata.numeric(u[1]))

0 00e9 Ll LATIN SMALL LETTER E WITH ACUTE

1 0bf2 No TAMIL NUMBER ONE THOUSAND
2 0f84 Mn TIBETAN MARK HALANTA
3 1770 Lo TAGBANWA LETTER SA
4 33af So SQUARE RAD OVER S SQUARED
1000.0

'Ll' 'No' 'Mn'

, 'So' Unicode
<https://www.unicode.org/reports/tr44/#General_Category_Values>‘_

2.5

Unicode
ê U+00EA U+0065 U+0302 e COMBINING
CIRCUMFLEX ACCENT 1
2
casefold() Unicode
ß U+00DF
ss

6
>>> street = 'Gürzenichstraße'
>>> street.casefold()
'gürzenichstrasse'

unicodedata normalize()
normalize()

import unicodedata

def compare_strs(s1, s2):

def NFD(s):
return unicodedata.normalize('NFD', s)

return NFD(s1) == NFD(s2)

single_char = 'ê'
multiple_chars = '\N{LATIN SMALL LETTER E}\N{COMBINING CIRCUMFLEX ACCENT}'
print('length of first string=', len(single_char))
print('length of second string=', len(multiple_chars))
print(compare_strs(single_char, multiple_chars))

$ python3 compare-strs.py
length of first string= 1
length of second string= 2
True

normalize() NFC NFKC NFD

NFKD
Unicode

import unicodedata

def compare_caseless(s1, s2):

def NFD(s):
return unicodedata.normalize('NFD', s)

return NFD(NFD(s1).casefold()) == NFD(NFD(s2).casefold())

# Example usage
single_char = 'ê'
multiple_chars = '\N{LATIN CAPITAL LETTER E}\N{COMBINING CIRCUMFLEX ACCENT}'

print(compare_caseless(single_char, multiple_chars))

True NFD() casefold()

Unicode 3.13

7
2.6 Unicode

re \d \w
\d
[0-9] 'Nd'
57

import re
p = re.compile(r'\d+')

s = "Over \u0e55\u0e57 57 flavours"

m = p.search(s)
print(repr(m.group()))

\d+ compile() re.ASCII \d+

”57”
\w Unicode [a-zA-Z0-9_] re.ASCII
\s `` Unicode ``[ \t\n\r\f\v]

2.7

Python Unicode
• Processing Text Files in Python 3, by Nick Coghlan.
• Unicode Ned Batchelder PyCon 2012
str Python textseq
unicodedata
codecs
Marc-André Lemburg EuroPython 2002 Python Unicode PDF <https:
//downloads.egenix.com/python/Unicode-EPC2002-Talk.pdf>‘_ Python
2 Unicode Unicode unicode u

3 Unicode

Unicode / Unicode
Unicode
Unicode
XML Unicode Unicode SQL
Unicode
Unicode
8 bytes.decode(encoding)

Unicode
1024 4096 Unicode

2 GB 2 GB RAM
Unicode

open() read() write()

Unicode open() encoding errors str.encode()
bytes.decode()

8
Unicode

with open('unicode.txt', encoding='utf-8') as f:

for line in f:
print(repr(line))

with open('test', encoding='utf-8', mode='w+') as f:

f.write('\u4500 blah blah blah\n')
f.seek(0)
print(repr(f.readline()[:1]))

Unicode U+FEFF BOM

UTF-16 BOM BOM
little-endian
big-endian utf-16-le utf-16-be BOM
UTF-8 BOM UTF-8
UTF-8 utf-8-sig

3.1 Unicode

Unicode Unicode
Python UTF-8 MacOS Python
UTF-8 Python 3.6 Windows UTF-8 Unix
LANG LC_CTYPE
UTF-8
sys.getfilesystemencoding()
Unicode

filename = 'filename\u4500abc'
with open(filename, 'w') as f:
f.write('blah\n')

os Unicode os.stat()
os.listdir() Unicode
os.listdir()
Unicode Unicode
Unicode
UTF-8 :

fn = 'filename\u4500abc'
f = open(fn, 'w')
f.close()

import os
print(os.listdir(b'.'))
print(os.listdir('.'))

$ python listdir-test.py
[b'filename\xe4\x94\x80abc', ...]
['filename\u4500abc', ...]

UTF-8 Unicode

9
API Unicode API
Unix

3.2 Unicode

Unicode

Unicode
str + bytes
TypeError
Web

ASCII

StreamRecoder #1
#2
f Latin-1 StreamRecoder UTF-8

new_f = codecs.StreamRecoder(f,
# en/decoder: used by read() to encode its results and
# by write() to decode its input.
codecs.getencoder('utf-8'), codecs.getdecoder('utf-8'),

# reader/writer: used to read and write to the stream.

codecs.getreader('latin-1'), codecs.getwriter('latin-1') )

ASCII
ASCII surrogateescape handler

with open(fname, 'r', encoding="ascii", errors="surrogateescape") as f:

data = f.read()

# make changes to the string 'data'

with open(fname + '.new', 'w',

encoding="ascii", errors="surrogateescape") as f:
f.write(data)

surrogateescape handler ASCII U+DC80 U+DCFF

surrogateescape handler

10
3.3

One section of Mastering Python 3 Input/Output, a PyCon 2010 talk by David Beazley, discusses text processing and
binary data handling.
Marc-André Lemburg PDF Python Unicode
Python 2.x
The Guts of Unicode in Python is a PyCon 2013 talk by Benjamin Peterson that discusses the internal Unicode
representation in Python 3.3.

Andrew Kuchling Alexander Belopolsky Georg Brandl Andrew Kuchling Ezio

Melotti
Éric Araujo Nicholas Bastin Nick Coghlan Marius Gedminas
Kent Johnson Ken Krugler Marc-André Lemburg Martin von Löwis Terry J. Reedy Serhiy Storchaka , Eryk
Sun, Chad Whitacre, Graham Wideman

11
P
Python
PEP 263, 6

Howto Unicode PDF
No ratings yet
Howto Unicode PDF
13 pages
Unicode HOWTO: Guido Van Rossum and The Python Development Team
No ratings yet
Unicode HOWTO: Guido Van Rossum and The Python Development Team
13 pages
Howto Unicode
No ratings yet
Howto Unicode
13 pages
Unicode HOWTO: Guido Van Rossum and The Python Development Team
No ratings yet
Unicode HOWTO: Guido Van Rossum and The Python Development Team
12 pages
Howto Unicode
No ratings yet
Howto Unicode
12 pages
Howto Unicode PDF
No ratings yet
Howto Unicode PDF
11 pages
Howto Unicode
No ratings yet
Howto Unicode
9 pages
Python Unicode Objects
No ratings yet
Python Unicode Objects
2 pages
An Introduction To Python For Absolute Beginners
No ratings yet
An Introduction To Python For Absolute Beginners
127 pages
An Introduction To Python For Absolute Beginners
No ratings yet
An Introduction To Python For Absolute Beginners
457 pages
P6-String
No ratings yet
P6-String
20 pages
Python 3
No ratings yet
Python 3
457 pages
Introduction To Unicode: History of Character Codes
No ratings yet
Introduction To Unicode: History of Character Codes
4 pages
Unicode vs UTF-8
No ratings yet
Unicode vs UTF-8
2 pages
Unicode in C++ - McNellis - CppCon 2014
No ratings yet
Unicode in C++ - McNellis - CppCon 2014
125 pages
ICT Lecture 22
No ratings yet
ICT Lecture 22
23 pages
Unicodebook PDF
No ratings yet
Unicodebook PDF
73 pages
Working With Unicode
No ratings yet
Working With Unicode
19 pages
Space/Terrestrial Mobile Networks: Internet Access and QoS Support
From Everand
Space/Terrestrial Mobile Networks: Internet Access and QoS Support
Ray E. Sheriff
No ratings yet
Python Unit-2 Notes
No ratings yet
Python Unit-2 Notes
60 pages
Unicode CPP PDF
No ratings yet
Unicode CPP PDF
139 pages
I Love Unicode Softshake
No ratings yet
I Love Unicode Softshake
78 pages
Ruby Conf 2006: I18N, M17N, Unicode, and All That
No ratings yet
Ruby Conf 2006: I18N, M17N, Unicode, and All That
60 pages
Python Strings: Accessing Values in String S
No ratings yet
Python Strings: Accessing Values in String S
7 pages
Lab 1 and 2 - Python
No ratings yet
Lab 1 and 2 - Python
37 pages
All IP in 3G CDMA Networks: The UMTS Infrastructure and Service Platforms for Future Mobile Systems
From Everand
All IP in 3G CDMA Networks: The UMTS Infrastructure and Service Platforms for Future Mobile Systems
Jonathan P. Castro
No ratings yet
VANET: Vehicular Applications and Inter-Networking Technologies
From Everand
VANET: Vehicular Applications and Inter-Networking Technologies
Hannes Hartenstein
No ratings yet
Strings
No ratings yet
Strings
23 pages
Python
No ratings yet
Python
50 pages
Wireless Mesh Networks
From Everand
Wireless Mesh Networks
Ian F. Akyildiz
No ratings yet
Broadband Wireless Communications Business: An Introduction to the Costs and Benefits of New Technologies
From Everand
Broadband Wireless Communications Business: An Introduction to the Costs and Benefits of New Technologies
Riaz Esmailzadeh
No ratings yet
An Informal Introduction To Python - Python 3.12
No ratings yet
An Informal Introduction To Python - Python 3.12
9 pages
Design and Analysis of Distributed Algorithms
From Everand
Design and Analysis of Distributed Algorithms
Nicola Santoro
No ratings yet
Programming Fundamentals: Lecturer XXX
No ratings yet
Programming Fundamentals: Lecturer XXX
30 pages
Python Note 2
No ratings yet
Python Note 2
8 pages
Unicode & Character Encodings in Python - A Painless Guide - Real Python
No ratings yet
Unicode & Character Encodings in Python - A Painless Guide - Real Python
20 pages
Unit 3 Powerpoint
No ratings yet
Unit 3 Powerpoint
43 pages
Path Routing in Mesh Optical Networks
From Everand
Path Routing in Mesh Optical Networks
Eric Bouillet
No ratings yet
Unicode in C and C
No ratings yet
Unicode in C and C
8 pages
What Is Python?: Emphasis On Structure and Discipline Simple Problems ! Simple Programs
No ratings yet
What Is Python?: Emphasis On Structure and Discipline Simple Problems ! Simple Programs
27 pages
Accessing Values in Strings: 'Hello World!' "Python Programming"
No ratings yet
Accessing Values in Strings: 'Hello World!' "Python Programming"
29 pages
Python Fundamentals 1
No ratings yet
Python Fundamentals 1
21 pages
Sigcse Slides PDF
No ratings yet
Sigcse Slides PDF
108 pages
Triple Play: Building the converged network for IP, VoIP and IPTV
From Everand
Triple Play: Building the converged network for IP, VoIP and IPTV
Francisco J. Hens
No ratings yet
Lecture No 3
No ratings yet
Lecture No 3
14 pages
Principles of Ad-hoc Networking
From Everand
Principles of Ad-hoc Networking
Michel Barbeau
No ratings yet
2 Python Fundamentals Lecture Xi 2023-24
No ratings yet
2 Python Fundamentals Lecture Xi 2023-24
10 pages
PPS Python
No ratings yet
PPS Python
212 pages
Content Production Technologies
From Everand
Content Production Technologies
Fumio Hasegawa
No ratings yet
Python Recipes
No ratings yet
Python Recipes
84 pages
Cython Tutorial: Release 0.28.2
No ratings yet
Cython Tutorial: Release 0.28.2
81 pages
Docs - Python.org Tutorial Introduction
No ratings yet
Docs - Python.org Tutorial Introduction
14 pages
CF Chapter-007
No ratings yet
CF Chapter-007
76 pages
Programming Multi-Agent Systems in AgentSpeak using Jason
From Everand
Programming Multi-Agent Systems in AgentSpeak using Jason
Rafael H. Bordini
3/5 (1)
datacleaning_clinical_py
No ratings yet
datacleaning_clinical_py
7 pages
Week2DataandExpression_98caa4c4-db93-407e-856b-6095a08b49b3_95176_
No ratings yet
Week2DataandExpression_98caa4c4-db93-407e-856b-6095a08b49b3_95176_
63 pages
FEATURES OF PYTHON FILE
No ratings yet
FEATURES OF PYTHON FILE
77 pages
Python Essentials 1 - Module 2
No ratings yet
Python Essentials 1 - Module 2
9 pages
Python - Strings
No ratings yet
Python - Strings
17 pages
Python Programming UNIT 2
No ratings yet
Python Programming UNIT 2
25 pages
Ic-M34 Service Manual
No ratings yet
Ic-M34 Service Manual
30 pages
Load - Program - Class - Mismatch: Symptom
No ratings yet
Load - Program - Class - Mismatch: Symptom
3 pages
Inteligencia Artificial Ingles
100% (1)
Inteligencia Artificial Ingles
9 pages
Reasoning with Data An Introduction to Traditional and Bayesian Statistics Using R 1st Edition Jeffrey M. Stanton - Download the ebook now and own the full detailed content
100% (3)
Reasoning with Data An Introduction to Traditional and Bayesian Statistics Using R 1st Edition Jeffrey M. Stanton - Download the ebook now and own the full detailed content
56 pages
Different Types of Sampling Designs
100% (6)
Different Types of Sampling Designs
12 pages
Development of Boost Converter For MIRAI: Yoshinobu Hasuka, Hiroyuki Sekine, Koji Katano, and Yasuhiro Nonobe
No ratings yet
Development of Boost Converter For MIRAI: Yoshinobu Hasuka, Hiroyuki Sekine, Koji Katano, and Yasuhiro Nonobe
6 pages
International Postcodes Canada
No ratings yet
International Postcodes Canada
3 pages
Aug'22 Cash Book
No ratings yet
Aug'22 Cash Book
6 pages
HAKDOGKOMALAKO
No ratings yet
HAKDOGKOMALAKO
2 pages
A 10
No ratings yet
A 10
3 pages
Control Valve
No ratings yet
Control Valve
35 pages
Imaster NCE Installation Guidemain-Content
No ratings yet
Imaster NCE Installation Guidemain-Content
1 page
M2017-0061 PB3S-22-26 Manual - Part II
No ratings yet
M2017-0061 PB3S-22-26 Manual - Part II
30 pages
A Preliminary Concept Meaning & Importance of Controlling The Controlling Process Types of Control Control Techniques/methods
No ratings yet
A Preliminary Concept Meaning & Importance of Controlling The Controlling Process Types of Control Control Techniques/methods
20 pages
M-12 F.O. Drain Piping Diagram
No ratings yet
M-12 F.O. Drain Piping Diagram
9 pages
Group 1 Presentation Servo Motors
No ratings yet
Group 1 Presentation Servo Motors
25 pages
Nov Dec 2023
No ratings yet
Nov Dec 2023
2 pages
Follower Special Azalea Town Pokemon Heartgold FunkMotown Jazz Band Arrangment
No ratings yet
Follower Special Azalea Town Pokemon Heartgold FunkMotown Jazz Band Arrangment
36 pages
HCIA-Intelligent Computing V1.0 Lab Guide
No ratings yet
HCIA-Intelligent Computing V1.0 Lab Guide
213 pages
Chapter 7 Multiple Access Techniques
No ratings yet
Chapter 7 Multiple Access Techniques
65 pages
CERTIFICATIONS (B) (Responses)
No ratings yet
CERTIFICATIONS (B) (Responses)
3 pages
Anthropometry - I: Understanding Importance of Human Body Measurements - Eye Level
No ratings yet
Anthropometry - I: Understanding Importance of Human Body Measurements - Eye Level
14 pages
AcademyCloudfoundations Module 02
No ratings yet
AcademyCloudfoundations Module 02
66 pages
SQL Homework Problems
100% (1)
SQL Homework Problems
7 pages
6272
No ratings yet
6272
49 pages
Chapter 3 Review Questions
No ratings yet
Chapter 3 Review Questions
4 pages
The Future of Work in The Age of Automation
No ratings yet
The Future of Work in The Age of Automation
5 pages
Failure To Feed: Here Are The Main Reasons That A Pistol Will Jam
No ratings yet
Failure To Feed: Here Are The Main Reasons That A Pistol Will Jam
2 pages
Tamer Khattab: Electrical Engineering Qatar University
No ratings yet
Tamer Khattab: Electrical Engineering Qatar University
29 pages
Main QA's
No ratings yet
Main QA's
13 pages