Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to content

Commit 3ebc061

Browse files
committed
Make citext's equality and hashing functions collation-insensitive.
This is an ugly hack to get around the fact that significant parts of the core backend assume they don't need to worry about passing collation to equality and hashing functions. That's true for the core string datatypes, but citext should ideally have equality behavior that depends on the specified collation's LC_CTYPE. However, there's no chance of fixing the core before 9.2, so we'll have to live with this compromise arrangement for now. Per bug #6053 from Regina Obe. The code changes in this commit should be reverted in full once the core code is up to speed, but be careful about reverting the docs changes: I fixed a number of obsolete statements while at it.
1 parent 1bcdd66 commit 3ebc061

File tree

2 files changed

+52
-21
lines changed

2 files changed

+52
-21
lines changed

contrib/citext/citext.c

Lines changed: 16 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,7 @@
44
#include "postgres.h"
55

66
#include "access/hash.h"
7+
#include "catalog/pg_collation.h"
78
#include "fmgr.h"
89
#include "utils/builtins.h"
910
#include "utils/formatting.h"
@@ -48,8 +49,16 @@ citextcmp(text *left, text *right, Oid collid)
4849
*rcstr;
4950
int32 result;
5051

51-
lcstr = str_tolower(VARDATA_ANY(left), VARSIZE_ANY_EXHDR(left), collid);
52-
rcstr = str_tolower(VARDATA_ANY(right), VARSIZE_ANY_EXHDR(right), collid);
52+
/*
53+
* We must do our str_tolower calls with DEFAULT_COLLATION_OID, not the
54+
* input collation as you might expect. This is so that the behavior of
55+
* citext's equality and hashing functions is not collation-dependent. We
56+
* should change this once the core infrastructure is able to cope with
57+
* collation-dependent equality and hashing functions.
58+
*/
59+
60+
lcstr = str_tolower(VARDATA_ANY(left), VARSIZE_ANY_EXHDR(left), DEFAULT_COLLATION_OID);
61+
rcstr = str_tolower(VARDATA_ANY(right), VARSIZE_ANY_EXHDR(right), DEFAULT_COLLATION_OID);
5362

5463
result = varstr_cmp(lcstr, strlen(lcstr),
5564
rcstr, strlen(rcstr),
@@ -93,7 +102,7 @@ citext_hash(PG_FUNCTION_ARGS)
93102
char *str;
94103
Datum result;
95104

96-
str = str_tolower(VARDATA_ANY(txt), VARSIZE_ANY_EXHDR(txt), PG_GET_COLLATION());
105+
str = str_tolower(VARDATA_ANY(txt), VARSIZE_ANY_EXHDR(txt), DEFAULT_COLLATION_OID);
97106
result = hash_any((unsigned char *) str, strlen(str));
98107
pfree(str);
99108

@@ -122,8 +131,8 @@ citext_eq(PG_FUNCTION_ARGS)
122131

123132
/* We can't compare lengths in advance of downcasing ... */
124133

125-
lcstr = str_tolower(VARDATA_ANY(left), VARSIZE_ANY_EXHDR(left), PG_GET_COLLATION());
126-
rcstr = str_tolower(VARDATA_ANY(right), VARSIZE_ANY_EXHDR(right), PG_GET_COLLATION());
134+
lcstr = str_tolower(VARDATA_ANY(left), VARSIZE_ANY_EXHDR(left), DEFAULT_COLLATION_OID);
135+
rcstr = str_tolower(VARDATA_ANY(right), VARSIZE_ANY_EXHDR(right), DEFAULT_COLLATION_OID);
127136

128137
/*
129138
* Since we only care about equality or not-equality, we can avoid all the
@@ -152,8 +161,8 @@ citext_ne(PG_FUNCTION_ARGS)
152161

153162
/* We can't compare lengths in advance of downcasing ... */
154163

155-
lcstr = str_tolower(VARDATA_ANY(left), VARSIZE_ANY_EXHDR(left), PG_GET_COLLATION());
156-
rcstr = str_tolower(VARDATA_ANY(right), VARSIZE_ANY_EXHDR(right), PG_GET_COLLATION());
164+
lcstr = str_tolower(VARDATA_ANY(left), VARSIZE_ANY_EXHDR(left), DEFAULT_COLLATION_OID);
165+
rcstr = str_tolower(VARDATA_ANY(right), VARSIZE_ANY_EXHDR(right), DEFAULT_COLLATION_OID);
157166

158167
/*
159168
* Since we only care about equality or not-equality, we can avoid all the

doc/src/sgml/citext.sgml

Lines changed: 36 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -58,9 +58,9 @@ SELECT * FROM tab WHERE lower(col) = LOWER(?);
5858
The <type>citext</> data type allows you to eliminate calls
5959
to <function>lower</> in SQL queries, and allows a primary key to
6060
be case-insensitive. <type>citext</> is locale-aware, just
61-
like <type>text</>, which means that the comparison of upper case and
61+
like <type>text</>, which means that the matching of upper case and
6262
lower case characters is dependent on the rules of
63-
the <literal>LC_CTYPE</> locale setting. Again, this behavior is
63+
the database's <literal>LC_CTYPE</> setting. Again, this behavior is
6464
identical to the use of <function>lower</> in queries. But because it's
6565
done transparently by the data type, you don't have to remember to do
6666
anything special in your queries.
@@ -97,17 +97,25 @@ SELECT * FROM users WHERE nick = 'Larry';
9797

9898
<sect2>
9999
<title>String Comparison Behavior</title>
100+
101+
<para>
102+
<type>citext</> performs comparisons by converting each string to lower
103+
case (as though <function>lower</> were called) and then comparing the
104+
results normally. Thus, for example, two strings are considered equal
105+
if <function>lower</> would produce identical results for them.
106+
</para>
107+
100108
<para>
101109
In order to emulate a case-insensitive collation as closely as possible,
102-
there are <type>citext</>-specific versions of a number of the comparison
110+
there are <type>citext</>-specific versions of a number of string-processing
103111
operators and functions. So, for example, the regular expression
104112
operators <literal>~</> and <literal>~*</> exhibit the same behavior when
105-
applied to <type>citext</>: they both compare case-insensitively.
113+
applied to <type>citext</>: they both match case-insensitively.
106114
The same is true
107115
for <literal>!~</> and <literal>!~*</>, as well as for the
108116
<literal>LIKE</> operators <literal>~~</> and <literal>~~*</>, and
109117
<literal>!~~</> and <literal>!~~*</>. If you'd like to match
110-
case-sensitively, you can always cast to <type>text</> before comparing.
118+
case-sensitively, you can cast the operator's arguments to <type>text</>.
111119
</para>
112120

113121
<para>
@@ -168,10 +176,10 @@ SELECT * FROM users WHERE nick = 'Larry';
168176
<itemizedlist>
169177
<listitem>
170178
<para>
171-
<type>citext</>'s behavior depends on
179+
<type>citext</>'s case-folding behavior depends on
172180
the <literal>LC_CTYPE</> setting of your database. How it compares
173-
values is therefore determined when
174-
<application>initdb</> is run to create the cluster. It is not truly
181+
values is therefore determined when the database is created.
182+
It is not truly
175183
case-insensitive in the terms defined by the Unicode standard.
176184
Effectively, what this means is that, as long as you're happy with your
177185
collation, you should be happy with <type>citext</>'s comparisons. But
@@ -181,6 +189,20 @@ SELECT * FROM users WHERE nick = 'Larry';
181189
</para>
182190
</listitem>
183191

192+
<listitem>
193+
<para>
194+
As of <productname>PostgreSQL</> 9.1, you can attach a
195+
<literal>COLLATE</> specification to <type>citext</> columns or data
196+
values. Currently, <type>citext</> operators will honor a non-default
197+
<literal>COLLATE</> specification while comparing case-folded strings,
198+
but the initial folding to lower case is always done according to the
199+
database's <literal>LC_CTYPE</> setting (that is, as though
200+
<literal>COLLATE "default"</> were given). This may be changed in a
201+
future release so that both steps follow the input <literal>COLLATE</>
202+
specification.
203+
</para>
204+
</listitem>
205+
184206
<listitem>
185207
<para>
186208
<type>citext</> is not as efficient as <type>text</> because the
@@ -198,20 +220,20 @@ SELECT * FROM users WHERE nick = 'Larry';
198220
contexts. The standard answer is to use the <type>text</> type and
199221
manually use the <function>lower</> function when you need to compare
200222
case-insensitively; this works all right if case-insensitive comparison
201-
is needed only infrequently. If you need case-insensitive most of
202-
the time and case-sensitive infrequently, consider storing the data
223+
is needed only infrequently. If you need case-insensitive behavior most
224+
of the time and case-sensitive infrequently, consider storing the data
203225
as <type>citext</> and explicitly casting the column to <type>text</>
204-
when you want case-sensitive comparison. In either situation, you
205-
will need two indexes if you want both types of searches to be fast.
226+
when you want case-sensitive comparison. In either situation, you will
227+
need two indexes if you want both types of searches to be fast.
206228
</para>
207229
</listitem>
208230

209231
<listitem>
210232
<para>
211233
The schema containing the <type>citext</> operators must be
212234
in the current <varname>search_path</> (typically <literal>public</>);
213-
if it is not, a normal case-sensitive <type>text</> comparison
214-
is performed.
235+
if it is not, the normal case-sensitive <type>text</> operators
236+
will be invoked instead.
215237
</para>
216238
</listitem>
217239
</itemizedlist>

0 commit comments

Comments
 (0)