Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to content

Commit f41bd4c

Browse files
committed
Expand collation documentation
Document better how to create custom collations and what locale strings ICU accepts. Explain the ICU examples in more detail. Also update the text on the CREATE COLLATION reference page a bit to take ICU more into account.
1 parent 0703c19 commit f41bd4c

File tree

2 files changed

+124
-39
lines changed

2 files changed

+124
-39
lines changed

doc/src/sgml/charset.sgml

Lines changed: 107 additions & 28 deletions
Original file line numberDiff line numberDiff line change
@@ -515,7 +515,7 @@ SELECT * FROM test1 ORDER BY a || b COLLATE "fr_FR";
515515
<para>
516516
A collation object provided by <literal>libc</literal> maps to a
517517
combination of <symbol>LC_COLLATE</symbol> and <symbol>LC_CTYPE</symbol>
518-
settings. (As
518+
settings, as accepted by the <literal>setlocale()</literal> system library call. (As
519519
the name would suggest, the main purpose of a collation is to set
520520
<symbol>LC_COLLATE</symbol>, which controls the sort order. But
521521
it is rarely necessary in practice to have an
@@ -640,21 +640,19 @@ SELECT a COLLATE "C" &lt; b COLLATE "POSIX" FROM test1;
640640
<title>ICU collations</title>
641641

642642
<para>
643-
Collations provided by ICU are created with names in BCP 47 language tag
643+
With ICU, it is not sensible to enumerate all possible locale names. ICU
644+
uses a particular naming system for locales, but there are many more ways
645+
to name a locale than there are actually distinct locales.
646+
<command>initdb</command> uses the ICU APIs to extract a set of distinct
647+
locales to populate the initial set of collations. Collations provided by
648+
ICU are created in the SQL environment with names in BCP 47 language tag
644649
format, with a <quote>private use</quote>
645650
extension <literal>-x-icu</literal> appended, to distinguish them from
646-
libc locales. So <literal>de-x-icu</literal> would be an example name.
651+
libc locales.
647652
</para>
648653

649654
<para>
650-
With ICU, it is not sensible to enumerate all possible locale names. ICU
651-
uses a particular naming system for locales, but there are many more ways
652-
to name a locale than there are actually distinct locales. (In fact, any
653-
string will be accepted as a locale name.)
654-
See <ulink url="http://userguide.icu-project.org/locale"></ulink> for
655-
information on ICU locale naming. <command>initdb</command> uses the ICU
656-
APIs to extract a set of distinct locales to populate the initial set of
657-
collations. Here are some example collations that might be created:
655+
Here are some example collations that might be created:
658656

659657
<variablelist>
660658
<varlistentry>
@@ -695,32 +693,104 @@ SELECT a COLLATE "C" &lt; b COLLATE "POSIX" FROM test1;
695693
will draw an error along the lines of <quote>collation "de-x-icu" for
696694
encoding "WIN874" does not exist</>.
697695
</para>
696+
</sect4>
697+
</sect3>
698+
699+
<sect3 id="collation-create">
700+
<title>Creating New Collation Objects</title>
701+
702+
<para>
703+
If the standard and predefined collations are not sufficient, users can
704+
create their own collation objects using the SQL
705+
command <xref linkend="sql-createcollation">.
706+
</para>
707+
708+
<para>
709+
The standard and predefined collations are in the
710+
schema <literal>pg_catalog</literal>, like all predefined objects.
711+
User-defined collations should be created in user schemas. This also
712+
ensures that they are saved by <command>pg_dump</command>.
713+
</para>
714+
715+
<sect4>
716+
<title>libc collations</title>
717+
718+
<para>
719+
New libc collations can be created like this:
720+
<programlisting>
721+
CREATE COLLATION german (provider = libc, locale = 'de_DE');
722+
</programlisting>
723+
The exact values that are acceptable for the <literal>locale</literal>
724+
clause in this command depend on the operating system. On Unix-like
725+
systems, the command <literal>locale -a</literal> will show a list.
726+
</para>
727+
728+
<para>
729+
Since the predefined libc collations already include all collations
730+
defined in the operating system when the database instance is
731+
initialized, it is not often necessary to manually create new ones.
732+
Reasons might be if a different naming system is desired (in which case
733+
see also <xref linkend="collation-copy">) or if the operating system has
734+
been upgraded to provide new locale definitions (in which case see
735+
also <link linkend="functions-admin-collation"><function>pg_import_system_collations()</function></link>).
736+
</para>
737+
</sect4>
738+
739+
<sect4>
740+
<title>ICU collations</title>
698741

699742
<para>
700743
ICU allows collations to be customized beyond the basic language+country
701744
set that is preloaded by <command>initdb</command>. Users are encouraged
702745
to define their own collation objects that make use of these facilities to
703-
suit the sorting behavior to their requirements. Here are some examples:
746+
suit the sorting behavior to their requirements.
747+
See <ulink url="http://userguide.icu-project.org/locale"></ulink>
748+
and <ulink url="http://userguide.icu-project.org/collation/api"></ulink> for
749+
information on ICU locale naming. The set of acceptable names and
750+
attributes depends on the particular ICU version.
751+
</para>
752+
753+
<para>
754+
Here are some examples:
704755

705756
<variablelist>
706757
<varlistentry>
707-
<term><literal>CREATE COLLATION "de-u-co-phonebk-x-icu" (provider = icu, locale = 'de-u-co-phonebk')</literal></term>
758+
<term><literal>CREATE COLLATION "de-u-co-phonebk-x-icu" (provider = icu, locale = 'de-u-co-phonebk');</literal></term>
759+
<term><literal>CREATE COLLATION "de-u-co-phonebk-x-icu" (provider = icu, locale = 'de@collation=phonebook');</literal></term>
708760
<listitem>
709761
<para>German collation with phone book collation type</para>
762+
<para>
763+
The first example selects the ICU locale using a <quote>language
764+
tag</quote> per BCP 47. The second example uses the traditional
765+
ICU-specific locale syntax. The first style is preferred going
766+
forward, but it is not supported by older ICU versions.
767+
</para>
768+
<para>
769+
Note that you can name the collation objects in the SQL environment
770+
anything you want. In this example, we follow the naming style that
771+
the predefined collations use, which in turn also follow BCP 47, but
772+
that is not required for user-defined collations.
773+
</para>
710774
</listitem>
711775
</varlistentry>
712776

713777
<varlistentry>
714-
<term><literal>CREATE COLLATION "und-u-co-emoji-x-icu" (provider = icu, locale = 'und-u-co-emoji')</literal></term>
778+
<term><literal>CREATE COLLATION "und-u-co-emoji-x-icu" (provider = icu, locale = 'und-u-co-emoji');</literal></term>
779+
<term><literal>CREATE COLLATION "und-u-co-emoji-x-icu" (provider = icu, locale = '@collation=emoji');</literal></term>
715780
<listitem>
716781
<para>
717782
Root collation with Emoji collation type, per Unicode Technical Standard #51
718783
</para>
784+
<para>
785+
Observe how in the traditional ICU locale naming system, the root
786+
locale is selected by an empty string.
787+
</para>
719788
</listitem>
720789
</varlistentry>
721790

722791
<varlistentry>
723-
<term><literal>CREATE COLLATION digitslast (provider = icu, locale = 'en-u-kr-latn-digit')</literal></term>
792+
<term><literal>CREATE COLLATION digitslast (provider = icu, locale = 'en-u-kr-latn-digit');</literal></term>
793+
<term><literal>CREATE COLLATION digitslast (provider = icu, locale = 'en@colReorder=latn-digit');</literal></term>
724794
<listitem>
725795
<para>
726796
Sort digits after Latin letters. (The default is digits before letters.)
@@ -729,7 +799,8 @@ SELECT a COLLATE "C" &lt; b COLLATE "POSIX" FROM test1;
729799
</varlistentry>
730800

731801
<varlistentry>
732-
<term><literal>CREATE COLLATION upperfirst (provider = icu, locale = 'en-u-kf-upper')</literal></term>
802+
<term><literal>CREATE COLLATION upperfirst (provider = icu, locale = 'en-u-kf-upper');</literal></term>
803+
<term><literal>CREATE COLLATION upperfirst (provider = icu, locale = 'en@colCaseFirst=upper');</literal></term>
733804
<listitem>
734805
<para>
735806
Sort upper-case letters before lower-case letters. (The default is
@@ -739,7 +810,8 @@ SELECT a COLLATE "C" &lt; b COLLATE "POSIX" FROM test1;
739810
</varlistentry>
740811

741812
<varlistentry>
742-
<term><literal>CREATE COLLATION special (provider = icu, locale = 'en-u-kf-upper-kr-latn-digit')</literal></term>
813+
<term><literal>CREATE COLLATION special (provider = icu, locale = 'en-u-kf-upper-kr-latn-digit');</literal></term>
814+
<term><literal>CREATE COLLATION special (provider = icu, locale = 'en@colCaseFirst=upper;colReorder=latn-digit');</literal></term>
743815
<listitem>
744816
<para>
745817
Combines both of the above options.
@@ -748,7 +820,8 @@ SELECT a COLLATE "C" &lt; b COLLATE "POSIX" FROM test1;
748820
</varlistentry>
749821

750822
<varlistentry>
751-
<term><literal>CREATE COLLATION numeric (provider = icu, locale = 'en-u-kn-true')</literal></term>
823+
<term><literal>CREATE COLLATION numeric (provider = icu, locale = 'en-u-kn-true');</literal></term>
824+
<term><literal>CREATE COLLATION numeric (provider = icu, locale = 'en@colNumeric=yes');</literal></term>
752825
<listitem>
753826
<para>
754827
Numeric ordering, sorts sequences of digits by their numeric value,
@@ -768,7 +841,8 @@ SELECT a COLLATE "C" &lt; b COLLATE "POSIX" FROM test1;
768841
repository</ulink>.
769842
The <ulink url="https://ssl.icu-project.org/icu-bin/locexp">ICU Locale
770843
Explorer</ulink> can be used to check the details of a particular locale
771-
definition.
844+
definition. The examples using the <literal>k*</literal> subtags require
845+
at least ICU version 54.
772846
</para>
773847

774848
<para>
@@ -779,10 +853,21 @@ SELECT a COLLATE "C" &lt; b COLLATE "POSIX" FROM test1;
779853
strings that compare equal according to the collation but are not
780854
byte-wise equal will be sorted according to their byte values.
781855
</para>
856+
857+
<note>
858+
<para>
859+
By design, ICU will accept almost any string as a locale name and match
860+
it to the closet locale it can provide, using the fallback procedure
861+
described in its documentation. Thus, there will be no direct feedback
862+
if a collation specification is composed using features that the given
863+
ICU installation does not actually support. It is therefore recommended
864+
to create application-level test cases to check that the collation
865+
definitions satisfy one's requirements.
866+
</para>
867+
</note>
782868
</sect4>
783-
</sect3>
784869

785-
<sect3>
870+
<sect4 id="collation-copy">
786871
<title>Copying Collations</title>
787872

788873
<para>
@@ -796,13 +881,7 @@ CREATE COLLATION german FROM "de_DE";
796881
CREATE COLLATION french FROM "fr-x-icu";
797882
</programlisting>
798883
</para>
799-
800-
<para>
801-
The standard and predefined collations are in the
802-
schema <literal>pg_catalog</literal>, like all predefined objects.
803-
User-defined collations should be created in user schemas. This also
804-
ensures that they are saved by <command>pg_dump</command>.
805-
</para>
884+
</sect4>
806885
</sect3>
807886
</sect2>
808887
</sect1>

doc/src/sgml/ref/create_collation.sgml

Lines changed: 17 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -93,10 +93,7 @@ CREATE COLLATION [ IF NOT EXISTS ] <replaceable>name</replaceable> FROM <replace
9393
<listitem>
9494
<para>
9595
Use the specified operating system locale for
96-
the <symbol>LC_COLLATE</symbol> locale category. The locale
97-
must be applicable to the current database encoding.
98-
(See <xref linkend="sql-createdatabase"> for the precise
99-
rules.)
96+
the <symbol>LC_COLLATE</symbol> locale category.
10097
</para>
10198
</listitem>
10299
</varlistentry>
@@ -107,10 +104,7 @@ CREATE COLLATION [ IF NOT EXISTS ] <replaceable>name</replaceable> FROM <replace
107104
<listitem>
108105
<para>
109106
Use the specified operating system locale for
110-
the <symbol>LC_CTYPE</symbol> locale category. The locale
111-
must be applicable to the current database encoding.
112-
(See <xref linkend="sql-createdatabase"> for the precise
113-
rules.)
107+
the <symbol>LC_CTYPE</symbol> locale category.
114108
</para>
115109
</listitem>
116110
</varlistentry>
@@ -173,8 +167,13 @@ CREATE COLLATION [ IF NOT EXISTS ] <replaceable>name</replaceable> FROM <replace
173167
</para>
174168

175169
<para>
176-
See <xref linkend="collation"> for more information about collation
177-
support in PostgreSQL.
170+
See <xref linkend="collation-create"> for more information on how to create collations.
171+
</para>
172+
173+
<para>
174+
When using the <literal>libc</literal> collation provider, the locale must
175+
be applicable to the current database encoding.
176+
See <xref linkend="sql-createdatabase"> for the precise rules.
178177
</para>
179178
</refsect1>
180179

@@ -186,7 +185,14 @@ CREATE COLLATION [ IF NOT EXISTS ] <replaceable>name</replaceable> FROM <replace
186185
<literal>fr_FR.utf8</literal>
187186
(assuming the current database encoding is <literal>UTF8</literal>):
188187
<programlisting>
189-
CREATE COLLATION french (LOCALE = 'fr_FR.utf8');
188+
CREATE COLLATION french (locale = 'fr_FR.utf8');
189+
</programlisting>
190+
</para>
191+
192+
<para>
193+
To create a collation using the ICU provider using German phone book sort order:
194+
<programlisting>
195+
CREATE COLLATION german_phonebook (provider = icu, locale = 'de-u-co-phonebk');
190196
</programlisting>
191197
</para>
192198

0 commit comments

Comments
 (0)