Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to content

Commit 451d280

Browse files
committed
Fix jsonb Unicode escape processing, and in consequence disallow \u0000.
We've been trying to support \u0000 in JSON values since commit 78ed8e0, and have introduced increasingly worse hacks to try to make it work, such as commit 0ad1a81. However, it fundamentally can't work in the way envisioned, because the stored representation looks the same as for \\u0000 which is not the same thing at all. It's also entirely bogus to output \u0000 when de-escaped output is called for. The right way to do this would be to store an actual 0x00 byte, and then throw error only if asked to produce de-escaped textual output. However, getting to that point seems likely to take considerable work and may well never be practical in the 9.4.x series. To preserve our options for better behavior while getting rid of the nasty side-effects of 0ad1a81, revert that commit in toto and instead throw error if \u0000 is used in a context where it needs to be de-escaped. (These are the same contexts where non-ASCII Unicode escapes throw error if the database encoding isn't UTF8, so this behavior is by no means without precedent.) In passing, make both the \u0000 case and the non-ASCII Unicode case report ERRCODE_UNTRANSLATABLE_CHARACTER / "unsupported Unicode escape sequence" rather than claiming there's something wrong with the input syntax. Back-patch to 9.4, where we have to do something because 0ad1a81 broke things for many cases having nothing to do with \u0000. 9.3 also has bogus behavior, but only for that specific escape value, so given the lack of field complaints it seems better to leave 9.3 alone.
1 parent e40d43f commit 451d280

File tree

9 files changed

+250
-160
lines changed

9 files changed

+250
-160
lines changed

doc/src/sgml/json.sgml

+11-8
Original file line numberDiff line numberDiff line change
@@ -69,12 +69,14 @@
6969
regardless of the database encoding, and are checked only for syntactic
7070
correctness (that is, that four hex digits follow <literal>\u</>).
7171
However, the input function for <type>jsonb</> is stricter: it disallows
72-
Unicode escapes for non-ASCII characters (those
73-
above <literal>U+007F</>) unless the database encoding is UTF8. It also
74-
insists that any use of Unicode surrogate pairs to designate characters
75-
outside the Unicode Basic Multilingual Plane be correct. Valid Unicode
76-
escapes, except for <literal>\u0000</>, are then converted to the
77-
equivalent ASCII or UTF8 character for storage.
72+
Unicode escapes for non-ASCII characters (those above <literal>U+007F</>)
73+
unless the database encoding is UTF8. The <type>jsonb</> type also
74+
rejects <literal>\u0000</> (because that cannot be represented in
75+
<productname>PostgreSQL</productname>'s <type>text</> type), and it insists
76+
that any use of Unicode surrogate pairs to designate characters outside
77+
the Unicode Basic Multilingual Plane be correct. Valid Unicode escapes
78+
are converted to the equivalent ASCII or UTF8 character for storage;
79+
this includes folding surrogate pairs into a single character.
7880
</para>
7981

8082
<note>
@@ -101,7 +103,7 @@
101103
constitutes valid <type>jsonb</type> data that do not apply to
102104
the <type>json</type> type, nor to JSON in the abstract, corresponding
103105
to limits on what can be represented by the underlying data type.
104-
Specifically, <type>jsonb</> will reject numbers that are outside the
106+
Notably, <type>jsonb</> will reject numbers that are outside the
105107
range of the <productname>PostgreSQL</productname> <type>numeric</> data
106108
type, while <type>json</> will not. Such implementation-defined
107109
restrictions are permitted by <acronym>RFC</> 7159. However, in
@@ -134,7 +136,8 @@
134136
<row>
135137
<entry><type>string</></entry>
136138
<entry><type>text</></entry>
137-
<entry>See notes above concerning encoding restrictions</entry>
139+
<entry><literal>\u0000</> is disallowed, as are non-ASCII Unicode
140+
escapes if database encoding is not UTF8</entry>
138141
</row>
139142
<row>
140143
<entry><type>number</></entry>

doc/src/sgml/release-9.4.sgml

-16
Original file line numberDiff line numberDiff line change
@@ -101,22 +101,6 @@
101101
</para>
102102
</listitem>
103103

104-
<listitem>
105-
<para>
106-
Unicode escapes in <link linkend="datatype-json"><type>JSON</type></link>
107-
text values are no longer rendered with the backslash escaped
108-
(Andrew Dunstan)
109-
</para>
110-
111-
<para>
112-
Previously, all backslashes in text values being formed into JSON
113-
were escaped. Now a backslash followed by <literal>u</> and four
114-
hexadecimal digits is not escaped, as this is a legal sequence in a
115-
JSON string value, and escaping the backslash led to some perverse
116-
results.
117-
</para>
118-
</listitem>
119-
120104
<listitem>
121105
<para>
122106
When converting values of type <type>date</>, <type>timestamp</>

src/backend/utils/adt/json.c

+15-34
Original file line numberDiff line numberDiff line change
@@ -806,14 +806,17 @@ json_lex_string(JsonLexContext *lex)
806806
* For UTF8, replace the escape sequence by the actual
807807
* utf8 character in lex->strval. Do this also for other
808808
* encodings if the escape designates an ASCII character,
809-
* otherwise raise an error. We don't ever unescape a
810-
* \u0000, since that would result in an impermissible nul
811-
* byte.
809+
* otherwise raise an error.
812810
*/
813811

814812
if (ch == 0)
815813
{
816-
appendStringInfoString(lex->strval, "\\u0000");
814+
/* We can't allow this, since our TEXT type doesn't */
815+
ereport(ERROR,
816+
(errcode(ERRCODE_UNTRANSLATABLE_CHARACTER),
817+
errmsg("unsupported Unicode escape sequence"),
818+
errdetail("\\u0000 cannot be converted to text."),
819+
report_json_context(lex)));
817820
}
818821
else if (GetDatabaseEncoding() == PG_UTF8)
819822
{
@@ -833,8 +836,8 @@ json_lex_string(JsonLexContext *lex)
833836
else
834837
{
835838
ereport(ERROR,
836-
(errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
837-
errmsg("invalid input syntax for type json"),
839+
(errcode(ERRCODE_UNTRANSLATABLE_CHARACTER),
840+
errmsg("unsupported Unicode escape sequence"),
838841
errdetail("Unicode escape values cannot be used for code point values above 007F when the server encoding is not UTF8."),
839842
report_json_context(lex)));
840843
}
@@ -1284,8 +1287,8 @@ json_categorize_type(Oid typoid,
12841287

12851288
/*
12861289
* We need to get the output function for everything except date and
1287-
* timestamp types, array and composite types, booleans,
1288-
* and non-builtin types where there's a cast to json.
1290+
* timestamp types, array and composite types, booleans, and non-builtin
1291+
* types where there's a cast to json.
12891292
*/
12901293

12911294
switch (typoid)
@@ -1335,11 +1338,12 @@ json_categorize_type(Oid typoid,
13351338
/* but let's look for a cast to json, if it's not built-in */
13361339
if (typoid >= FirstNormalObjectId)
13371340
{
1338-
Oid castfunc;
1341+
Oid castfunc;
13391342
CoercionPathType ctype;
13401343

13411344
ctype = find_coercion_pathway(JSONOID, typoid,
1342-
COERCION_EXPLICIT, &castfunc);
1345+
COERCION_EXPLICIT,
1346+
&castfunc);
13431347
if (ctype == COERCION_PATH_FUNC && OidIsValid(castfunc))
13441348
{
13451349
*tcategory = JSONTYPE_CAST;
@@ -2382,30 +2386,7 @@ escape_json(StringInfo buf, const char *str)
23822386
appendStringInfoString(buf, "\\\"");
23832387
break;
23842388
case '\\':
2385-
2386-
/*
2387-
* Unicode escapes are passed through as is. There is no
2388-
* requirement that they denote a valid character in the
2389-
* server encoding - indeed that is a big part of their
2390-
* usefulness.
2391-
*
2392-
* All we require is that they consist of \uXXXX where the Xs
2393-
* are hexadecimal digits. It is the responsibility of the
2394-
* caller of, say, to_json() to make sure that the unicode
2395-
* escape is valid.
2396-
*
2397-
* In the case of a jsonb string value being escaped, the only
2398-
* unicode escape that should be present is \u0000, all the
2399-
* other unicode escapes will have been resolved.
2400-
*/
2401-
if (p[1] == 'u' &&
2402-
isxdigit((unsigned char) p[2]) &&
2403-
isxdigit((unsigned char) p[3]) &&
2404-
isxdigit((unsigned char) p[4]) &&
2405-
isxdigit((unsigned char) p[5]))
2406-
appendStringInfoCharMacro(buf, *p);
2407-
else
2408-
appendStringInfoString(buf, "\\\\");
2389+
appendStringInfoString(buf, "\\\\");
24092390
break;
24102391
default:
24112392
if ((unsigned char) *p < ' ')

src/test/regress/expected/json.out

+42-16
Original file line numberDiff line numberDiff line change
@@ -426,20 +426,6 @@ select to_json(timestamptz '2014-05-28 12:22:35.614298-04');
426426
(1 row)
427427

428428
COMMIT;
429-
-- unicode escape - backslash is not escaped
430-
select to_json(text '\uabcd');
431-
to_json
432-
----------
433-
"\uabcd"
434-
(1 row)
435-
436-
-- any other backslash is escaped
437-
select to_json(text '\abcd');
438-
to_json
439-
----------
440-
"\\abcd"
441-
(1 row)
442-
443429
--json_agg
444430
SELECT json_agg(q)
445431
FROM ( SELECT $$a$$ || x AS b, y AS c,
@@ -1400,6 +1386,36 @@ ERROR: invalid input syntax for type json
14001386
DETAIL: Unicode low surrogate must follow a high surrogate.
14011387
CONTEXT: JSON data, line 1: { "a":...
14021388
--handling of simple unicode escapes
1389+
select json '{ "a": "the Copyright \u00a9 sign" }' as correct_in_utf8;
1390+
correct_in_utf8
1391+
---------------------------------------
1392+
{ "a": "the Copyright \u00a9 sign" }
1393+
(1 row)
1394+
1395+
select json '{ "a": "dollar \u0024 character" }' as correct_everywhere;
1396+
correct_everywhere
1397+
-------------------------------------
1398+
{ "a": "dollar \u0024 character" }
1399+
(1 row)
1400+
1401+
select json '{ "a": "dollar \\u0024 character" }' as not_an_escape;
1402+
not_an_escape
1403+
--------------------------------------
1404+
{ "a": "dollar \\u0024 character" }
1405+
(1 row)
1406+
1407+
select json '{ "a": "null \u0000 escape" }' as not_unescaped;
1408+
not_unescaped
1409+
--------------------------------
1410+
{ "a": "null \u0000 escape" }
1411+
(1 row)
1412+
1413+
select json '{ "a": "null \\u0000 escape" }' as not_an_escape;
1414+
not_an_escape
1415+
---------------------------------
1416+
{ "a": "null \\u0000 escape" }
1417+
(1 row)
1418+
14031419
select json '{ "a": "the Copyright \u00a9 sign" }' ->> 'a' as correct_in_utf8;
14041420
correct_in_utf8
14051421
----------------------
@@ -1412,8 +1428,18 @@ select json '{ "a": "dollar \u0024 character" }' ->> 'a' as correct_everywhere;
14121428
dollar $ character
14131429
(1 row)
14141430

1415-
select json '{ "a": "null \u0000 escape" }' ->> 'a' as not_unescaped;
1416-
not_unescaped
1431+
select json '{ "a": "dollar \\u0024 character" }' ->> 'a' as not_an_escape;
1432+
not_an_escape
1433+
-------------------------
1434+
dollar \u0024 character
1435+
(1 row)
1436+
1437+
select json '{ "a": "null \u0000 escape" }' ->> 'a' as fails;
1438+
ERROR: unsupported Unicode escape sequence
1439+
DETAIL: \u0000 cannot be converted to text.
1440+
CONTEXT: JSON data, line 1: { "a":...
1441+
select json '{ "a": "null \\u0000 escape" }' ->> 'a' as not_an_escape;
1442+
not_an_escape
14171443
--------------------
14181444
null \u0000 escape
14191445
(1 row)

src/test/regress/expected/json_1.out

+44-18
Original file line numberDiff line numberDiff line change
@@ -426,20 +426,6 @@ select to_json(timestamptz '2014-05-28 12:22:35.614298-04');
426426
(1 row)
427427

428428
COMMIT;
429-
-- unicode escape - backslash is not escaped
430-
select to_json(text '\uabcd');
431-
to_json
432-
----------
433-
"\uabcd"
434-
(1 row)
435-
436-
-- any other backslash is escaped
437-
select to_json(text '\abcd');
438-
to_json
439-
----------
440-
"\\abcd"
441-
(1 row)
442-
443429
--json_agg
444430
SELECT json_agg(q)
445431
FROM ( SELECT $$a$$ || x AS b, y AS c,
@@ -1378,7 +1364,7 @@ select * from json_populate_recordset(row('def',99,null)::jpop,'[{"a":[100,200,3
13781364

13791365
-- handling of unicode surrogate pairs
13801366
select json '{ "a": "\ud83d\ude04\ud83d\udc36" }' -> 'a' as correct_in_utf8;
1381-
ERROR: invalid input syntax for type json
1367+
ERROR: unsupported Unicode escape sequence
13821368
DETAIL: Unicode escape values cannot be used for code point values above 007F when the server encoding is not UTF8.
13831369
CONTEXT: JSON data, line 1: { "a":...
13841370
select json '{ "a": "\ud83d\ud83d" }' -> 'a'; -- 2 high surrogates in a row
@@ -1398,8 +1384,38 @@ ERROR: invalid input syntax for type json
13981384
DETAIL: Unicode low surrogate must follow a high surrogate.
13991385
CONTEXT: JSON data, line 1: { "a":...
14001386
--handling of simple unicode escapes
1387+
select json '{ "a": "the Copyright \u00a9 sign" }' as correct_in_utf8;
1388+
correct_in_utf8
1389+
---------------------------------------
1390+
{ "a": "the Copyright \u00a9 sign" }
1391+
(1 row)
1392+
1393+
select json '{ "a": "dollar \u0024 character" }' as correct_everywhere;
1394+
correct_everywhere
1395+
-------------------------------------
1396+
{ "a": "dollar \u0024 character" }
1397+
(1 row)
1398+
1399+
select json '{ "a": "dollar \\u0024 character" }' as not_an_escape;
1400+
not_an_escape
1401+
--------------------------------------
1402+
{ "a": "dollar \\u0024 character" }
1403+
(1 row)
1404+
1405+
select json '{ "a": "null \u0000 escape" }' as not_unescaped;
1406+
not_unescaped
1407+
--------------------------------
1408+
{ "a": "null \u0000 escape" }
1409+
(1 row)
1410+
1411+
select json '{ "a": "null \\u0000 escape" }' as not_an_escape;
1412+
not_an_escape
1413+
---------------------------------
1414+
{ "a": "null \\u0000 escape" }
1415+
(1 row)
1416+
14011417
select json '{ "a": "the Copyright \u00a9 sign" }' ->> 'a' as correct_in_utf8;
1402-
ERROR: invalid input syntax for type json
1418+
ERROR: unsupported Unicode escape sequence
14031419
DETAIL: Unicode escape values cannot be used for code point values above 007F when the server encoding is not UTF8.
14041420
CONTEXT: JSON data, line 1: { "a":...
14051421
select json '{ "a": "dollar \u0024 character" }' ->> 'a' as correct_everywhere;
@@ -1408,8 +1424,18 @@ select json '{ "a": "dollar \u0024 character" }' ->> 'a' as correct_everywhere;
14081424
dollar $ character
14091425
(1 row)
14101426

1411-
select json '{ "a": "null \u0000 escape" }' ->> 'a' as not_unescaped;
1412-
not_unescaped
1427+
select json '{ "a": "dollar \\u0024 character" }' ->> 'a' as not_an_escape;
1428+
not_an_escape
1429+
-------------------------
1430+
dollar \u0024 character
1431+
(1 row)
1432+
1433+
select json '{ "a": "null \u0000 escape" }' ->> 'a' as fails;
1434+
ERROR: unsupported Unicode escape sequence
1435+
DETAIL: \u0000 cannot be converted to text.
1436+
CONTEXT: JSON data, line 1: { "a":...
1437+
select json '{ "a": "null \\u0000 escape" }' ->> 'a' as not_an_escape;
1438+
not_an_escape
14131439
--------------------
14141440
null \u0000 escape
14151441
(1 row)

0 commit comments

Comments
 (0)