Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to content

Commit 06735e3

Browse files
committed
Unicode escapes in strings and identifiers
1 parent 05bba3d commit 06735e3

File tree

18 files changed

+638
-59
lines changed

18 files changed

+638
-59
lines changed

doc/src/sgml/syntax.sgml

Lines changed: 134 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
<!-- $PostgreSQL: pgsql/doc/src/sgml/syntax.sgml,v 1.123 2008/06/26 22:24:42 momjian Exp $ -->
1+
<!-- $PostgreSQL: pgsql/doc/src/sgml/syntax.sgml,v 1.124 2008/10/29 08:04:52 petere Exp $ -->
22

33
<chapter id="sql-syntax">
44
<title>SQL Syntax</title>
@@ -189,6 +189,57 @@ UPDATE "my_table" SET "a" = 5;
189189
ampersands. The length limitation still applies.
190190
</para>
191191

192+
<para>
193+
<indexterm><primary>Unicode escape</primary><secondary>in
194+
identifiers</secondary></indexterm> A variant of quoted
195+
identifiers allows including escaped Unicode characters identified
196+
by their code points. This variant starts
197+
with <literal>U&</literal> (upper or lower case U followed by
198+
ampersand) immediately before the opening double quote, without
199+
any spaces in between, for example <literal>U&"foo"</literal>.
200+
(Note that this creates an ambiguity with the
201+
operator <literal>&</literal>. Use spaces around the operator to
202+
avoid this problem.) Inside the quotes, Unicode characters can be
203+
specified in escaped form by writing a backslash followed by the
204+
four-digit hexadecimal code point number or alternatively a
205+
backslash followed by a plus sign followed by a six-digit
206+
hexadecimal code point number. For example, the
207+
identifier <literal>"data"</literal> could be written as
208+
<programlisting>
209+
U&"d\0061t\+000061"
210+
</programlisting>
211+
The following less trivial example writes the Russian
212+
word <quote>slon</quote> (elephant) in Cyrillic letters:
213+
<programlisting>
214+
U&"\0441\043B\043E\043D"
215+
</programlisting>
216+
</para>
217+
218+
<para>
219+
If a different escape character than backslash is desired, it can
220+
be specified using
221+
the <literal>UESCAPE</literal><indexterm><primary>UESCAPE</primary></indexterm>
222+
clause after the string, for example:
223+
<programlisting>
224+
U&"d!0061t!+000061" UESCAPE '!'
225+
</programlisting>
226+
The escape character can be any single character other than a
227+
hexadecimal digit, the plus sign, a single quote, a double quote,
228+
or a whitespace character. Note that the escape character is
229+
written in single quotes, not double quotes.
230+
</para>
231+
232+
<para>
233+
To include the escape character in the identifier literally, write
234+
it twice.
235+
</para>
236+
237+
<para>
238+
The Unicode escape syntax works only when the server encoding is
239+
UTF8. When other server encodings are used, only code points in
240+
the ASCII range (up to <literal>\007F</literal>) can be specified.
241+
</para>
242+
192243
<para>
193244
Quoting an identifier also makes it case-sensitive, whereas
194245
unquoted names are always folded to lower case. For example, the
@@ -245,7 +296,7 @@ UPDATE "my_table" SET "a" = 5;
245296
write two adjacent single quotes, e.g.
246297
<literal>'Dianne''s horse'</literal>.
247298
Note that this is <emphasis>not</> the same as a double-quote
248-
character (<literal>"</>).
299+
character (<literal>"</>). <!-- font-lock sanity: " -->
249300
</para>
250301

251302
<para>
@@ -269,14 +320,19 @@ SELECT 'foo' 'bar';
269320
by <acronym>SQL</acronym>; <productname>PostgreSQL</productname> is
270321
following the standard.)
271322
</para>
323+
</sect3>
272324

273-
<para>
274-
<indexterm>
325+
<sect3 id="sql-syntax-strings-escape">
326+
<title>String Constants with C-Style Escapes</title>
327+
328+
<indexterm zone="sql-syntax-strings-escape">
275329
<primary>escape string syntax</primary>
276330
</indexterm>
277-
<indexterm>
331+
<indexterm zone="sql-syntax-strings-escape">
278332
<primary>backslash escapes</primary>
279333
</indexterm>
334+
335+
<para>
280336
<productname>PostgreSQL</productname> also accepts <quote>escape</>
281337
string constants, which are an extension to the SQL standard.
282338
An escape string constant is specified by writing the letter
@@ -287,7 +343,8 @@ SELECT 'foo' 'bar';
287343
Within an escape string, a backslash character (<literal>\</>) begins a
288344
C-like <firstterm>backslash escape</> sequence, in which the combination
289345
of backslash and following character(s) represent a special byte
290-
value:
346+
value, as shown in <xref linkend="sql-backslash-table">.
347+
</para>
291348

292349
<table id="sql-backslash-table">
293350
<title>Backslash Escape Sequences</title>
@@ -341,14 +398,24 @@ SELECT 'foo' 'bar';
341398
</tgroup>
342399
</table>
343400

344-
It is your responsibility that the byte sequences you create are
345-
valid characters in the server character set encoding. Any other
401+
<para>
402+
Any other
346403
character following a backslash is taken literally. Thus, to
347404
include a backslash character, write two backslashes (<literal>\\</>).
348405
Also, a single quote can be included in an escape string by writing
349406
<literal>\'</literal>, in addition to the normal way of <literal>''</>.
350407
</para>
351408

409+
<para>
410+
It is your responsibility that the byte sequences you create are
411+
valid characters in the server character set encoding. When the
412+
server encoding is UTF-8, then the alternative Unicode escape
413+
syntax, explained in <xref linkend="sql-syntax-strings-uescape">,
414+
should be used instead. (The alternative would be doing the
415+
UTF-8 encoding by hand and writing out the bytes, which would be
416+
very cumbersome.)
417+
</para>
418+
352419
<caution>
353420
<para>
354421
If the configuration parameter
@@ -379,6 +446,65 @@ SELECT 'foo' 'bar';
379446
</para>
380447
</sect3>
381448

449+
<sect3 id="sql-syntax-strings-uescape">
450+
<title>String Constants with Unicode Escapes</title>
451+
452+
<indexterm zone="sql-syntax-strings-uescape">
453+
<primary>Unicode escape</primary>
454+
<secondary>in string constants</secondary>
455+
</indexterm>
456+
457+
<para>
458+
<productname>PostgreSQL</productname> also supports another type
459+
of escape syntax for strings that allows specifying arbitrary
460+
Unicode characters by code point. A Unicode escape string
461+
constant starts with <literal>U&</literal> (upper or lower case
462+
letter U followed by ampersand) immediately before the opening
463+
quote, without any spaces in between, for
464+
example <literal>U&'foo'</literal>. (Note that this creates an
465+
ambiguity with the operator <literal>&</literal>. Use spaces
466+
around the operator to avoid this problem.) Inside the quotes,
467+
Unicode characters can be specified in escaped form by writing a
468+
backslash followed by the four-digit hexadecimal code point
469+
number or alternatively a backslash followed by a plus sign
470+
followed by a six-digit hexadecimal code point number. For
471+
example, the string <literal>'data'</literal> could be written as
472+
<programlisting>
473+
U&'d\0061t\+000061'
474+
</programlisting>
475+
The following less trivial example writes the Russian
476+
word <quote>slon</quote> (elephant) in Cyrillic letters:
477+
<programlisting>
478+
U&'\0441\043B\043E\043D'
479+
</programlisting>
480+
</para>
481+
482+
<para>
483+
If a different escape character than backslash is desired, it can
484+
be specified using
485+
the <literal>UESCAPE</literal><indexterm><primary>UESCAPE</primary></indexterm>
486+
clause after the string, for example:
487+
<programlisting>
488+
U&'d!0061t!+000061' UESCAPE '!'
489+
</programlisting>
490+
The escape character can be any single character other than a
491+
hexadecimal digit, the plus sign, a single quote, a double quote,
492+
or a whitespace character.
493+
</para>
494+
495+
<para>
496+
The Unicode escape syntax works only when the server encoding is
497+
UTF8. When other server encodings are used, only code points in
498+
the ASCII range (up to <literal>\007F</literal>) can be
499+
specified.
500+
</para>
501+
502+
<para>
503+
To include the escape character in the string literally, write it
504+
twice.
505+
</para>
506+
</sect3>
507+
382508
<sect3 id="sql-syntax-dollar-quoting">
383509
<title>Dollar-Quoted String Constants</title>
384510

src/backend/catalog/sql_features.txt

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -238,8 +238,8 @@ F381 Extended schema manipulation 02 ALTER TABLE statement: ADD CONSTRAINT claus
238238
F381 Extended schema manipulation 03 ALTER TABLE statement: DROP CONSTRAINT clause YES
239239
F382 Alter column data type YES
240240
F391 Long identifiers YES
241-
F392 Unicode escapes in identifiers NO
242-
F393 Unicode escapes in literals NO
241+
F392 Unicode escapes in identifiers YES
242+
F393 Unicode escapes in literals YES
243243
F394 Optional normal form specification NO
244244
F401 Extended joined table YES
245245
F401 Extended joined table 01 NATURAL JOIN YES

0 commit comments

Comments
 (0)