Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to content

Commit d3d0983

Browse files
committed
Support PG_UNICODE_FAST locale in the builtin collation provider.
The PG_UNICODE_FAST locale uses code point sort order (fast, memcmp-based) combined with Unicode character semantics. The character semantics are based on Unicode full case mapping. Full case mapping can map a single codepoint to multiple codepoints, such as "ß" uppercasing to "SS". Additionally, it handles context-sensitive mappings like the "final sigma", and it uses titlecase mappings such as "Dž" when titlecasing (rather than plain uppercase mappings). Importantly, the uppercasing of "ß" as "SS" is specifically mentioned by the SQL standard. In Postgres, UCS_BASIC uses plain ASCII semantics for case mapping and pattern matching, so if we changed it to use the PG_UNICODE_FAST locale, it would offer better compliance with the standard. For now, though, do not change the behavior of UCS_BASIC. Discussion: https://postgr.es/m/ddfd67928818f138f51635712529bc5e1d25e4e7.camel@j-davis.com Discussion: https://postgr.es/m/27bb0e52-801d-4f73-a0a4-02cfdd4a9ada@eisentraut.org Reviewed-by: Peter Eisentraut, Daniel Verite
1 parent 286a365 commit d3d0983

File tree

13 files changed

+283
-16
lines changed

13 files changed

+283
-16
lines changed

doc/src/sgml/charset.sgml

Lines changed: 27 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -377,8 +377,9 @@ initdb --locale-provider=icu --icu-locale=en
377377
<listitem>
378378
<para>
379379
The <literal>builtin</literal> provider uses built-in operations. Only
380-
the <literal>C</literal> and <literal>C.UTF-8</literal> locales are
381-
supported for this provider.
380+
the <literal>C</literal>, <literal>C.UTF-8</literal>, and
381+
<literal>PG_UNICODE_FAST</literal> locales are supported for this
382+
provider.
382383
</para>
383384
<para>
384385
The <literal>C</literal> locale behavior is identical to the
@@ -392,6 +393,13 @@ initdb --locale-provider=icu --icu-locale=en
392393
regular expression character classes are based on the "POSIX
393394
Compatible" semantics, and the case mapping is the "simple" variant.
394395
</para>
396+
<para>
397+
The <literal>PG_UNICODE_FAST</literal> locale is available only when
398+
the database encoding is <literal>UTF-8</literal>, and the behavior is
399+
based on Unicode. The collation uses the code point values only. The
400+
regular expression character classes are based on the "Standard"
401+
semantics, and the case mapping is the "full" variant.
402+
</para>
395403
</listitem>
396404
</varlistentry>
397405

@@ -886,6 +894,23 @@ SELECT * FROM test1 ORDER BY a || b COLLATE "fr_FR";
886894
</listitem>
887895
</varlistentry>
888896

897+
<varlistentry>
898+
<term><literal>pg_unicode_fast</literal></term>
899+
<listitem>
900+
<para>
901+
This collation sorts by Unicode code point values rather than natural
902+
language order. For the functions <function>lower</function>,
903+
<function>initcap</function>, and <function>upper</function> it uses
904+
Unicode full case mapping. For pattern matching (including regular
905+
expressions), it uses the Standard variant of Unicode <ulink
906+
url="https://www.unicode.org/reports/tr18/#Compatibility_Properties">Compatibility
907+
Properties</ulink>. Behavior is efficient and stable within a
908+
<productname>Postgres</productname> major version. It is only
909+
available for encoding <literal>UTF8</literal>.
910+
</para>
911+
</listitem>
912+
</varlistentry>
913+
889914
<varlistentry>
890915
<term><literal>pg_c_utf8</literal></term>
891916
<listitem>

doc/src/sgml/ref/create_collation.sgml

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -99,7 +99,8 @@ CREATE COLLATION [ IF NOT EXISTS ] <replaceable>name</replaceable> FROM <replace
9999
<para>
100100
If <replaceable>provider</replaceable> is <literal>builtin</literal>,
101101
then <replaceable>locale</replaceable> must be specified and set to
102-
either <literal>C</literal> or <literal>C.UTF-8</literal>.
102+
either <literal>C</literal>, <literal>C.UTF-8</literal> or
103+
<literal>PG_UNICODE_FAST</literal>.
103104
</para>
104105
</listitem>
105106
</varlistentry>

doc/src/sgml/ref/create_database.sgml

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -168,7 +168,8 @@ CREATE DATABASE <replaceable class="parameter">name</replaceable>
168168
If <xref linkend="create-database-locale-provider"/> is
169169
<literal>builtin</literal>, then <replaceable>locale</replaceable> or
170170
<replaceable>builtin_locale</replaceable> must be specified and set to
171-
either <literal>C</literal> or <literal>C.UTF-8</literal>.
171+
either <literal>C</literal>, <literal>C.UTF-8</literal>, or
172+
<literal>PG_UNICODE_FAST</literal>.
172173
</para>
173174
<tip>
174175
<para>
@@ -233,7 +234,8 @@ CREATE DATABASE <replaceable class="parameter">name</replaceable>
233234
</para>
234235
<para>
235236
The locales available for the <literal>builtin</literal> provider are
236-
<literal>C</literal> and <literal>C.UTF-8</literal>.
237+
<literal>C</literal>, <literal>C.UTF-8</literal> and
238+
<literal>PG_UNICODE_FAST</literal>.
237239
</para>
238240
</listitem>
239241
</varlistentry>

doc/src/sgml/ref/initdb.sgml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -295,8 +295,8 @@ PostgreSQL documentation
295295
<para>
296296
If <option>--locale-provider</option> is <literal>builtin</literal>,
297297
<option>--locale</option> or <option>--builtin-locale</option> must be
298-
specified and set to <literal>C</literal> or
299-
<literal>C.UTF-8</literal>.
298+
specified and set to <literal>C</literal>, <literal>C.UTF-8</literal>
299+
or <literal>PG_UNICODE_FAST</literal>.
300300
</para>
301301
</listitem>
302302
</varlistentry>

src/backend/regex/regc_pg_locale.c

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -307,7 +307,7 @@ pg_wc_isdigit(pg_wchar c)
307307
return (c <= (pg_wchar) 127 &&
308308
(pg_char_properties[c] & PG_ISDIGIT));
309309
case PG_REGEX_STRATEGY_BUILTIN:
310-
return pg_u_isdigit(c, true);
310+
return pg_u_isdigit(c, !pg_regex_locale->info.builtin.casemap_full);
311311
case PG_REGEX_STRATEGY_LIBC_WIDE:
312312
if (sizeof(wchar_t) >= 4 || c <= (pg_wchar) 0xFFFF)
313313
return iswdigit_l((wint_t) c, pg_regex_locale->info.lt);
@@ -361,7 +361,7 @@ pg_wc_isalnum(pg_wchar c)
361361
return (c <= (pg_wchar) 127 &&
362362
(pg_char_properties[c] & PG_ISALNUM));
363363
case PG_REGEX_STRATEGY_BUILTIN:
364-
return pg_u_isalnum(c, true);
364+
return pg_u_isalnum(c, !pg_regex_locale->info.builtin.casemap_full);
365365
case PG_REGEX_STRATEGY_LIBC_WIDE:
366366
if (sizeof(wchar_t) >= 4 || c <= (pg_wchar) 0xFFFF)
367367
return iswalnum_l((wint_t) c, pg_regex_locale->info.lt);
@@ -505,7 +505,7 @@ pg_wc_ispunct(pg_wchar c)
505505
return (c <= (pg_wchar) 127 &&
506506
(pg_char_properties[c] & PG_ISPUNCT));
507507
case PG_REGEX_STRATEGY_BUILTIN:
508-
return pg_u_ispunct(c, true);
508+
return pg_u_ispunct(c, !pg_regex_locale->info.builtin.casemap_full);
509509
case PG_REGEX_STRATEGY_LIBC_WIDE:
510510
if (sizeof(wchar_t) >= 4 || c <= (pg_wchar) 0xFFFF)
511511
return iswpunct_l((wint_t) c, pg_regex_locale->info.lt);

src/backend/utils/adt/pg_locale.c

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1590,8 +1590,11 @@ builtin_locale_encoding(const char *locale)
15901590
{
15911591
if (strcmp(locale, "C") == 0)
15921592
return -1;
1593-
if (strcmp(locale, "C.UTF-8") == 0)
1593+
else if (strcmp(locale, "C.UTF-8") == 0)
15941594
return PG_UTF8;
1595+
else if (strcmp(locale, "PG_UNICODE_FAST") == 0)
1596+
return PG_UTF8;
1597+
15951598

15961599
ereport(ERROR,
15971600
(errcode(ERRCODE_WRONG_OBJECT_TYPE),
@@ -1616,6 +1619,8 @@ builtin_validate_locale(int encoding, const char *locale)
16161619
canonical_name = "C";
16171620
else if (strcmp(locale, "C.UTF-8") == 0 || strcmp(locale, "C.UTF8") == 0)
16181621
canonical_name = "C.UTF-8";
1622+
else if (strcmp(locale, "PG_UNICODE_FAST") == 0)
1623+
canonical_name = "PG_UNICODE_FAST";
16191624

16201625
if (!canonical_name)
16211626
ereport(ERROR,

src/backend/utils/adt/pg_locale_builtin.c

Lines changed: 9 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -78,7 +78,8 @@ size_t
7878
strlower_builtin(char *dest, size_t destsize, const char *src, ssize_t srclen,
7979
pg_locale_t locale)
8080
{
81-
return unicode_strlower(dest, destsize, src, srclen, false);
81+
return unicode_strlower(dest, destsize, src, srclen,
82+
locale->info.builtin.casemap_full);
8283
}
8384

8485
size_t
@@ -93,15 +94,17 @@ strtitle_builtin(char *dest, size_t destsize, const char *src, ssize_t srclen,
9394
.prev_alnum = false,
9495
};
9596

96-
return unicode_strtitle(dest, destsize, src, srclen, false,
97+
return unicode_strtitle(dest, destsize, src, srclen,
98+
locale->info.builtin.casemap_full,
9799
initcap_wbnext, &wbstate);
98100
}
99101

100102
size_t
101103
strupper_builtin(char *dest, size_t destsize, const char *src, ssize_t srclen,
102104
pg_locale_t locale)
103105
{
104-
return unicode_strupper(dest, destsize, src, srclen, false);
106+
return unicode_strupper(dest, destsize, src, srclen,
107+
locale->info.builtin.casemap_full);
105108
}
106109

107110
pg_locale_t
@@ -142,6 +145,7 @@ create_pg_locale_builtin(Oid collid, MemoryContext context)
142145
result = MemoryContextAllocZero(context, sizeof(struct pg_locale_struct));
143146

144147
result->info.builtin.locale = MemoryContextStrdup(context, locstr);
148+
result->info.builtin.casemap_full = (strcmp(locstr, "PG_UNICODE_FAST") == 0);
145149
result->provider = COLLPROVIDER_BUILTIN;
146150
result->deterministic = true;
147151
result->collate_is_c = true;
@@ -164,6 +168,8 @@ get_collation_actual_version_builtin(const char *collcollate)
164168
return "1";
165169
else if (strcmp(collcollate, "C.UTF-8") == 0)
166170
return "1";
171+
else if (strcmp(collcollate, "PG_UNICODE_FAST") == 0)
172+
return "1";
167173
else
168174
ereport(ERROR,
169175
(errcode(ERRCODE_WRONG_OBJECT_TYPE),

src/bin/initdb/initdb.c

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2489,6 +2489,8 @@ setlocales(void)
24892489
else if (strcmp(datlocale, "C.UTF-8") == 0 ||
24902490
strcmp(datlocale, "C.UTF8") == 0)
24912491
canonname = "C.UTF-8";
2492+
else if (strcmp(datlocale, "PG_UNICODE_FAST") == 0)
2493+
canonname = "PG_UNICODE_FAST";
24922494
else
24932495
pg_fatal("invalid locale name \"%s\" for builtin provider",
24942496
datlocale);
@@ -2782,7 +2784,9 @@ setup_locale_encoding(void)
27822784

27832785
if (locale_provider == COLLPROVIDER_BUILTIN)
27842786
{
2785-
if (strcmp(datlocale, "C.UTF-8") == 0 && encodingid != PG_UTF8)
2787+
if ((strcmp(datlocale, "C.UTF-8") == 0 ||
2788+
strcmp(datlocale, "PG_UNICODE_FAST") == 0) &&
2789+
encodingid != PG_UTF8)
27862790
pg_fatal("builtin provider locale \"%s\" requires encoding \"%s\"",
27872791
datlocale, "UTF-8");
27882792
}

src/include/catalog/catversion.h

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -57,6 +57,6 @@
5757
*/
5858

5959
/* yyyymmddN */
60-
#define CATALOG_VERSION_NO 202501162
60+
#define CATALOG_VERSION_NO 202501171
6161

6262
#endif

src/include/catalog/pg_collation.dat

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -33,5 +33,8 @@
3333
descr => 'sorts by Unicode code point; Unicode and POSIX character semantics',
3434
collname => 'pg_c_utf8', collprovider => 'b', collencoding => '6',
3535
colllocale => 'C.UTF-8', collversion => '1' },
36+
{ oid => '9535', descr => 'sorts by Unicode code point; Unicode character semantics',
37+
collname => 'pg_unicode_fast', collprovider => 'b', collencoding => '6',
38+
colllocale => 'PG_UNICODE_FAST', collversion => '1' },
3639

3740
]

src/include/utils/pg_locale.h

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -108,6 +108,7 @@ struct pg_locale_struct
108108
struct
109109
{
110110
const char *locale;
111+
bool casemap_full;
111112
} builtin;
112113
locale_t lt;
113114
#ifdef USE_ICU

src/test/regress/expected/collate.utf8.out

Lines changed: 160 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -160,3 +160,163 @@ SELECT 'δ' ~* '[Γ-Λ]' COLLATE PG_C_UTF8; -- same as above with cases reversed
160160
t
161161
(1 row)
162162

163+
--
164+
-- Test PG_UNICODE_FAST
165+
--
166+
CREATE COLLATION regress_pg_unicode_fast (
167+
provider = builtin, locale = 'unicode'); -- fails
168+
ERROR: invalid locale name "unicode" for builtin provider
169+
CREATE COLLATION regress_pg_unicode_fast (
170+
provider = builtin, locale = 'PG_UNICODE_FAST');
171+
CREATE TABLE test_pg_unicode_fast (
172+
t TEXT COLLATE PG_UNICODE_FAST
173+
);
174+
INSERT INTO test_pg_unicode_fast VALUES
175+
('abc DEF 123abc'),
176+
('ábc sßs ßss DÉF'),
177+
('DŽxxDŽ džxxDž Džxxdž'),
178+
('ȺȺȺ'),
179+
('ⱥⱥⱥ'),
180+
('ⱥȺ');
181+
SELECT
182+
t, lower(t), initcap(t), upper(t),
183+
length(convert_to(t, 'UTF8')) AS t_bytes,
184+
length(convert_to(lower(t), 'UTF8')) AS lower_t_bytes,
185+
length(convert_to(initcap(t), 'UTF8')) AS initcap_t_bytes,
186+
length(convert_to(upper(t), 'UTF8')) AS upper_t_bytes
187+
FROM test_pg_unicode_fast;
188+
t | lower | initcap | upper | t_bytes | lower_t_bytes | initcap_t_bytes | upper_t_bytes
189+
-----------------+-----------------+------------------+-------------------+---------+---------------+-----------------+---------------
190+
abc DEF 123abc | abc def 123abc | Abc Def 123abc | ABC DEF 123ABC | 14 | 14 | 14 | 14
191+
ábc sßs ßss DÉF | ábc sßs ßss déf | Ábc Sßs Ssss Déf | ÁBC SSSS SSSS DÉF | 19 | 19 | 19 | 19
192+
DŽxxDŽ džxxDž Džxxdž | džxxdž džxxdž džxxdž | Džxxdž Džxxdž Džxxdž | DŽXXDŽ DŽXXDŽ DŽXXDŽ | 20 | 20 | 20 | 20
193+
ȺȺȺ | ⱥⱥⱥ | Ⱥⱥⱥ | ȺȺȺ | 6 | 9 | 8 | 6
194+
ⱥⱥⱥ | ⱥⱥⱥ | Ⱥⱥⱥ | ȺȺȺ | 9 | 9 | 8 | 6
195+
ⱥȺ | ⱥⱥ | Ⱥⱥ | ȺȺ | 5 | 6 | 5 | 4
196+
(6 rows)
197+
198+
DROP TABLE test_pg_unicode_fast;
199+
-- test Final_Sigma
200+
SELECT lower('ΑΣ' COLLATE PG_UNICODE_FAST); -- 0391 03A3
201+
lower
202+
-------
203+
ας
204+
(1 row)
205+
206+
SELECT lower('ΑΣ0' COLLATE PG_UNICODE_FAST); -- 0391 03A3 0030
207+
lower
208+
-------
209+
ας0
210+
(1 row)
211+
212+
SELECT lower('ἈΣ̓' COLLATE PG_UNICODE_FAST); -- 0391 0343 03A3 0343
213+
lower
214+
-------
215+
ἀς̓
216+
(1 row)
217+
218+
SELECT lower('ᾼΣͅ' COLLATE PG_UNICODE_FAST); -- 0391 0345 03A3 0345
219+
lower
220+
-------
221+
ᾳςͅ
222+
(1 row)
223+
224+
-- test !Final_Sigma
225+
SELECT lower('Σ' COLLATE PG_UNICODE_FAST); -- 03A3
226+
lower
227+
-------
228+
σ
229+
(1 row)
230+
231+
SELECT lower('0Σ' COLLATE PG_UNICODE_FAST); -- 0030 03A3
232+
lower
233+
-------
234+
235+
(1 row)
236+
237+
SELECT lower('ΑΣΑ' COLLATE PG_UNICODE_FAST); -- 0391 03A3 0391
238+
lower
239+
-------
240+
ασα
241+
(1 row)
242+
243+
SELECT lower('ἈΣ̓Α' COLLATE PG_UNICODE_FAST); -- 0391 0343 03A3 0343 0391
244+
lower
245+
-------
246+
ἀσ̓α
247+
(1 row)
248+
249+
SELECT lower('ᾼΣͅΑ' COLLATE PG_UNICODE_FAST); -- 0391 0345 03A3 0345 0391
250+
lower
251+
-------
252+
ᾳσͅα
253+
(1 row)
254+
255+
-- properties
256+
SELECT 'xyz' ~ '[[:alnum:]]' COLLATE PG_UNICODE_FAST;
257+
?column?
258+
----------
259+
t
260+
(1 row)
261+
262+
SELECT 'xyz' !~ '[[:upper:]]' COLLATE PG_UNICODE_FAST;
263+
?column?
264+
----------
265+
t
266+
(1 row)
267+
268+
SELECT '@' !~ '[[:alnum:]]' COLLATE PG_UNICODE_FAST;
269+
?column?
270+
----------
271+
t
272+
(1 row)
273+
274+
SELECT '=' !~ '[[:punct:]]' COLLATE PG_UNICODE_FAST; -- symbols are not punctuation
275+
?column?
276+
----------
277+
t
278+
(1 row)
279+
280+
SELECT 'a8a' ~ '[[:digit:]]' COLLATE PG_UNICODE_FAST;
281+
?column?
282+
----------
283+
t
284+
(1 row)
285+
286+
SELECT '൧' ~ '\d' COLLATE PG_UNICODE_FAST;
287+
?column?
288+
----------
289+
t
290+
(1 row)
291+
292+
-- case mapping
293+
SELECT 'xYz' ~* 'XyZ' COLLATE PG_UNICODE_FAST;
294+
?column?
295+
----------
296+
t
297+
(1 row)
298+
299+
SELECT 'xAb' ~* '[W-Y]' COLLATE PG_UNICODE_FAST;
300+
?column?
301+
----------
302+
t
303+
(1 row)
304+
305+
SELECT 'xAb' !~* '[c-d]' COLLATE PG_UNICODE_FAST;
306+
?column?
307+
----------
308+
t
309+
(1 row)
310+
311+
SELECT 'Δ' ~* '[γ-λ]' COLLATE PG_UNICODE_FAST;
312+
?column?
313+
----------
314+
t
315+
(1 row)
316+
317+
SELECT 'δ' ~* '[Γ-Λ]' COLLATE PG_UNICODE_FAST; -- same as above with cases reversed
318+
?column?
319+
----------
320+
t
321+
(1 row)
322+

0 commit comments

Comments
 (0)