Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to content

Commit e3dd7c0

Browse files
committed
Simplify a bit the special rules generating unaccent.rules
As noted by Thomas Munro, CLDR 36 has added SOUND RECORDING COPYRIGHT (U+2117), and we use CLDR 41, so this can be removed from the set of special cases. The set of regression tests is expanded for degree signs, which are two of the special cases, and a fancy case with U+210C in Latin-ASCII.xml that we have discovered about when diving into what could be done for Cyrillic characters (this last part is material for a future patch, not tackled yet). While on it, some of the assertions of generate_unaccent_rules.py are expanded to report the codepoint on which a failure is found, something useful for debugging. Extracted from a larger patch by the same author. Author: Przemysław Sztoch Discussion: https://postgr.es/m/8478da0d-3b61-d24f-80b4-ce2f5e971c60@sztoch.pl
1 parent 84ad713 commit e3dd7c0

File tree

3 files changed

+56
-3
lines changed

3 files changed

+56
-3
lines changed

contrib/unaccent/expected/unaccent.out

+44
Original file line numberDiff line numberDiff line change
@@ -37,6 +37,18 @@ SELECT unaccent('À'); -- Remove combining diacritical 0x0300
3737
A
3838
(1 row)
3939

40+
SELECT unaccent('℃℉'); -- degree signs
41+
unaccent
42+
----------
43+
°C°F
44+
(1 row)
45+
46+
SELECT unaccent('℗'); -- sound recording copyright
47+
unaccent
48+
----------
49+
(P)
50+
(1 row)
51+
4052
SELECT unaccent('unaccent', 'foobar');
4153
unaccent
4254
----------
@@ -67,6 +79,18 @@ SELECT unaccent('unaccent', 'À');
6779
A
6880
(1 row)
6981

82+
SELECT unaccent('unaccent', '℃℉');
83+
unaccent
84+
----------
85+
°C°F
86+
(1 row)
87+
88+
SELECT unaccent('unaccent', '℗');
89+
unaccent
90+
----------
91+
(P)
92+
(1 row)
93+
7094
SELECT ts_lexize('unaccent', 'foobar');
7195
ts_lexize
7296
-----------
@@ -97,3 +121,23 @@ SELECT ts_lexize('unaccent', 'À');
97121
{A}
98122
(1 row)
99123

124+
SELECT ts_lexize('unaccent', '℃℉');
125+
ts_lexize
126+
-----------
127+
{°C°F}
128+
(1 row)
129+
130+
SELECT ts_lexize('unaccent', '℗');
131+
ts_lexize
132+
-----------
133+
{(P)}
134+
(1 row)
135+
136+
-- Controversial case. Black-Letter Capital H (U+210C) is translated by
137+
-- Latin-ASCII.xml as 'x', but it should be 'H'.
138+
SELECT unaccent('ℌ');
139+
unaccent
140+
----------
141+
x
142+
(1 row)
143+

contrib/unaccent/generate_unaccent_rules.py

+2-3
Original file line numberDiff line numberDiff line change
@@ -134,12 +134,12 @@ def get_plain_letter(codepoint, table):
134134
return table[codepoint.combining_ids[0]]
135135

136136
# Should not come here
137-
assert(False)
137+
assert False, 'Codepoint U+%0.2X' % codepoint.id
138138
elif is_plain_letter(codepoint):
139139
return codepoint
140140

141141
# Should not come here
142-
assert(False)
142+
assert False, 'Codepoint U+%0.2X' % codepoint.id
143143

144144

145145
def is_ligature(codepoint, table):
@@ -212,7 +212,6 @@ def special_cases():
212212
# Symbols of "Letterlike Symbols" Unicode Block (U+2100 to U+214F)
213213
charactersSet.add((0x2103, "\xb0C")) # DEGREE CELSIUS
214214
charactersSet.add((0x2109, "\xb0F")) # DEGREE FAHRENHEIT
215-
charactersSet.add((0x2117, "(P)")) # SOUND RECORDING COPYRIGHT
216215

217216
return charactersSet
218217

contrib/unaccent/sql/unaccent.sql

+10
Original file line numberDiff line numberDiff line change
@@ -10,15 +10,25 @@ SELECT unaccent('ёлка');
1010
SELECT unaccent('ЁЖИК');
1111
SELECT unaccent('˃˖˗˜');
1212
SELECT unaccent(''); -- Remove combining diacritical 0x0300
13+
SELECT unaccent('℃℉'); -- degree signs
14+
SELECT unaccent(''); -- sound recording copyright
1315

1416
SELECT unaccent('unaccent', 'foobar');
1517
SELECT unaccent('unaccent', 'ёлка');
1618
SELECT unaccent('unaccent', 'ЁЖИК');
1719
SELECT unaccent('unaccent', '˃˖˗˜');
1820
SELECT unaccent('unaccent', '');
21+
SELECT unaccent('unaccent', '℃℉');
22+
SELECT unaccent('unaccent', '');
1923

2024
SELECT ts_lexize('unaccent', 'foobar');
2125
SELECT ts_lexize('unaccent', 'ёлка');
2226
SELECT ts_lexize('unaccent', 'ЁЖИК');
2327
SELECT ts_lexize('unaccent', '˃˖˗˜');
2428
SELECT ts_lexize('unaccent', '');
29+
SELECT ts_lexize('unaccent', '℃℉');
30+
SELECT ts_lexize('unaccent', '');
31+
32+
-- Controversial case. Black-Letter Capital H (U+210C) is translated by
33+
-- Latin-ASCII.xml as 'x', but it should be 'H'.
34+
SELECT unaccent('');

0 commit comments

Comments
 (0)