Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to content

Commit 4cbf390

Browse files
committed
Fix jsonb Unicode escape processing, and in consequence disallow \u0000.
We've been trying to support \u0000 in JSON values since commit 78ed8e0, and have introduced increasingly worse hacks to try to make it work, such as commit 0ad1a81. However, it fundamentally can't work in the way envisioned, because the stored representation looks the same as for \\u0000 which is not the same thing at all. It's also entirely bogus to output \u0000 when de-escaped output is called for. The right way to do this would be to store an actual 0x00 byte, and then throw error only if asked to produce de-escaped textual output. However, getting to that point seems likely to take considerable work and may well never be practical in the 9.4.x series. To preserve our options for better behavior while getting rid of the nasty side-effects of 0ad1a81, revert that commit in toto and instead throw error if \u0000 is used in a context where it needs to be de-escaped. (These are the same contexts where non-ASCII Unicode escapes throw error if the database encoding isn't UTF8, so this behavior is by no means without precedent.) In passing, make both the \u0000 case and the non-ASCII Unicode case report ERRCODE_UNTRANSLATABLE_CHARACTER / "unsupported Unicode escape sequence" rather than claiming there's something wrong with the input syntax. Back-patch to 9.4, where we have to do something because 0ad1a81 broke things for many cases having nothing to do with \u0000. 9.3 also has bogus behavior, but only for that specific escape value, so given the lack of field complaints it seems better to leave 9.3 alone.
1 parent 70da7ae commit 4cbf390

File tree

9 files changed

+245
-120
lines changed

9 files changed

+245
-120
lines changed

doc/src/sgml/json.sgml

Lines changed: 11 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -69,12 +69,14 @@
6969
regardless of the database encoding, and are checked only for syntactic
7070
correctness (that is, that four hex digits follow <literal>\u</>).
7171
However, the input function for <type>jsonb</> is stricter: it disallows
72-
Unicode escapes for non-ASCII characters (those
73-
above <literal>U+007F</>) unless the database encoding is UTF8. It also
74-
insists that any use of Unicode surrogate pairs to designate characters
75-
outside the Unicode Basic Multilingual Plane be correct. Valid Unicode
76-
escapes, except for <literal>\u0000</>, are then converted to the
77-
equivalent ASCII or UTF8 character for storage.
72+
Unicode escapes for non-ASCII characters (those above <literal>U+007F</>)
73+
unless the database encoding is UTF8. The <type>jsonb</> type also
74+
rejects <literal>\u0000</> (because that cannot be represented in
75+
<productname>PostgreSQL</productname>'s <type>text</> type), and it insists
76+
that any use of Unicode surrogate pairs to designate characters outside
77+
the Unicode Basic Multilingual Plane be correct. Valid Unicode escapes
78+
are converted to the equivalent ASCII or UTF8 character for storage;
79+
this includes folding surrogate pairs into a single character.
7880
</para>
7981

8082
<note>
@@ -101,7 +103,7 @@
101103
constitutes valid <type>jsonb</type> data that do not apply to
102104
the <type>json</type> type, nor to JSON in the abstract, corresponding
103105
to limits on what can be represented by the underlying data type.
104-
Specifically, <type>jsonb</> will reject numbers that are outside the
106+
Notably, <type>jsonb</> will reject numbers that are outside the
105107
range of the <productname>PostgreSQL</productname> <type>numeric</> data
106108
type, while <type>json</> will not. Such implementation-defined
107109
restrictions are permitted by <acronym>RFC</> 7159. However, in
@@ -134,7 +136,8 @@
134136
<row>
135137
<entry><type>string</></entry>
136138
<entry><type>text</></entry>
137-
<entry>See notes above concerning encoding restrictions</entry>
139+
<entry><literal>\u0000</> is disallowed, as are non-ASCII Unicode
140+
escapes if database encoding is not UTF8</entry>
138141
</row>
139142
<row>
140143
<entry><type>number</></entry>

doc/src/sgml/release-9.4.sgml

Lines changed: 0 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -101,22 +101,6 @@
101101
</para>
102102
</listitem>
103103

104-
<listitem>
105-
<para>
106-
Unicode escapes in <link linkend="datatype-json"><type>JSON</type></link>
107-
text values are no longer rendered with the backslash escaped
108-
(Andrew Dunstan)
109-
</para>
110-
111-
<para>
112-
Previously, all backslashes in text values being formed into JSON
113-
were escaped. Now a backslash followed by <literal>u</> and four
114-
hexadecimal digits is not escaped, as this is a legal sequence in a
115-
JSON string value, and escaping the backslash led to some perverse
116-
results.
117-
</para>
118-
</listitem>
119-
120104
<listitem>
121105
<para>
122106
When converting values of type <type>date</>, <type>timestamp</>

src/backend/utils/adt/json.c

Lines changed: 10 additions & 30 deletions
Original file line numberDiff line numberDiff line change
@@ -807,14 +807,17 @@ json_lex_string(JsonLexContext *lex)
807807
* For UTF8, replace the escape sequence by the actual
808808
* utf8 character in lex->strval. Do this also for other
809809
* encodings if the escape designates an ASCII character,
810-
* otherwise raise an error. We don't ever unescape a
811-
* \u0000, since that would result in an impermissible nul
812-
* byte.
810+
* otherwise raise an error.
813811
*/
814812

815813
if (ch == 0)
816814
{
817-
appendStringInfoString(lex->strval, "\\u0000");
815+
/* We can't allow this, since our TEXT type doesn't */
816+
ereport(ERROR,
817+
(errcode(ERRCODE_UNTRANSLATABLE_CHARACTER),
818+
errmsg("unsupported Unicode escape sequence"),
819+
errdetail("\\u0000 cannot be converted to text."),
820+
report_json_context(lex)));
818821
}
819822
else if (GetDatabaseEncoding() == PG_UTF8)
820823
{
@@ -834,8 +837,8 @@ json_lex_string(JsonLexContext *lex)
834837
else
835838
{
836839
ereport(ERROR,
837-
(errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
838-
errmsg("invalid input syntax for type json"),
840+
(errcode(ERRCODE_UNTRANSLATABLE_CHARACTER),
841+
errmsg("unsupported Unicode escape sequence"),
839842
errdetail("Unicode escape values cannot be used for code point values above 007F when the server encoding is not UTF8."),
840843
report_json_context(lex)));
841844
}
@@ -2374,30 +2377,7 @@ escape_json(StringInfo buf, const char *str)
23742377
appendStringInfoString(buf, "\\\"");
23752378
break;
23762379
case '\\':
2377-
2378-
/*
2379-
* Unicode escapes are passed through as is. There is no
2380-
* requirement that they denote a valid character in the
2381-
* server encoding - indeed that is a big part of their
2382-
* usefulness.
2383-
*
2384-
* All we require is that they consist of \uXXXX where the Xs
2385-
* are hexadecimal digits. It is the responsibility of the
2386-
* caller of, say, to_json() to make sure that the unicode
2387-
* escape is valid.
2388-
*
2389-
* In the case of a jsonb string value being escaped, the only
2390-
* unicode escape that should be present is \u0000, all the
2391-
* other unicode escapes will have been resolved.
2392-
*/
2393-
if (p[1] == 'u' &&
2394-
isxdigit((unsigned char) p[2]) &&
2395-
isxdigit((unsigned char) p[3]) &&
2396-
isxdigit((unsigned char) p[4]) &&
2397-
isxdigit((unsigned char) p[5]))
2398-
appendStringInfoCharMacro(buf, *p);
2399-
else
2400-
appendStringInfoString(buf, "\\\\");
2380+
appendStringInfoString(buf, "\\\\");
24012381
break;
24022382
default:
24032383
if ((unsigned char) *p < ' ')

src/test/regress/expected/json.out

Lines changed: 42 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -426,20 +426,6 @@ select to_json(timestamptz '2014-05-28 12:22:35.614298-04');
426426
(1 row)
427427

428428
COMMIT;
429-
-- unicode escape - backslash is not escaped
430-
select to_json(text '\uabcd');
431-
to_json
432-
----------
433-
"\uabcd"
434-
(1 row)
435-
436-
-- any other backslash is escaped
437-
select to_json(text '\abcd');
438-
to_json
439-
----------
440-
"\\abcd"
441-
(1 row)
442-
443429
--json_agg
444430
SELECT json_agg(q)
445431
FROM ( SELECT $$a$$ || x AS b, y AS c,
@@ -1400,6 +1386,36 @@ ERROR: invalid input syntax for type json
14001386
DETAIL: Unicode low surrogate must follow a high surrogate.
14011387
CONTEXT: JSON data, line 1: { "a":...
14021388
--handling of simple unicode escapes
1389+
select json '{ "a": "the Copyright \u00a9 sign" }' as correct_in_utf8;
1390+
correct_in_utf8
1391+
---------------------------------------
1392+
{ "a": "the Copyright \u00a9 sign" }
1393+
(1 row)
1394+
1395+
select json '{ "a": "dollar \u0024 character" }' as correct_everywhere;
1396+
correct_everywhere
1397+
-------------------------------------
1398+
{ "a": "dollar \u0024 character" }
1399+
(1 row)
1400+
1401+
select json '{ "a": "dollar \\u0024 character" }' as not_an_escape;
1402+
not_an_escape
1403+
--------------------------------------
1404+
{ "a": "dollar \\u0024 character" }
1405+
(1 row)
1406+
1407+
select json '{ "a": "null \u0000 escape" }' as not_unescaped;
1408+
not_unescaped
1409+
--------------------------------
1410+
{ "a": "null \u0000 escape" }
1411+
(1 row)
1412+
1413+
select json '{ "a": "null \\u0000 escape" }' as not_an_escape;
1414+
not_an_escape
1415+
---------------------------------
1416+
{ "a": "null \\u0000 escape" }
1417+
(1 row)
1418+
14031419
select json '{ "a": "the Copyright \u00a9 sign" }' ->> 'a' as correct_in_utf8;
14041420
correct_in_utf8
14051421
----------------------
@@ -1412,8 +1428,18 @@ select json '{ "a": "dollar \u0024 character" }' ->> 'a' as correct_everywhere;
14121428
dollar $ character
14131429
(1 row)
14141430

1415-
select json '{ "a": "null \u0000 escape" }' ->> 'a' as not_unescaped;
1416-
not_unescaped
1431+
select json '{ "a": "dollar \\u0024 character" }' ->> 'a' as not_an_escape;
1432+
not_an_escape
1433+
-------------------------
1434+
dollar \u0024 character
1435+
(1 row)
1436+
1437+
select json '{ "a": "null \u0000 escape" }' ->> 'a' as fails;
1438+
ERROR: unsupported Unicode escape sequence
1439+
DETAIL: \u0000 cannot be converted to text.
1440+
CONTEXT: JSON data, line 1: { "a":...
1441+
select json '{ "a": "null \\u0000 escape" }' ->> 'a' as not_an_escape;
1442+
not_an_escape
14171443
--------------------
14181444
null \u0000 escape
14191445
(1 row)

src/test/regress/expected/json_1.out

Lines changed: 44 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -426,20 +426,6 @@ select to_json(timestamptz '2014-05-28 12:22:35.614298-04');
426426
(1 row)
427427

428428
COMMIT;
429-
-- unicode escape - backslash is not escaped
430-
select to_json(text '\uabcd');
431-
to_json
432-
----------
433-
"\uabcd"
434-
(1 row)
435-
436-
-- any other backslash is escaped
437-
select to_json(text '\abcd');
438-
to_json
439-
----------
440-
"\\abcd"
441-
(1 row)
442-
443429
--json_agg
444430
SELECT json_agg(q)
445431
FROM ( SELECT $$a$$ || x AS b, y AS c,
@@ -1378,7 +1364,7 @@ select * from json_populate_recordset(row('def',99,null)::jpop,'[{"a":[100,200,3
13781364

13791365
-- handling of unicode surrogate pairs
13801366
select json '{ "a": "\ud83d\ude04\ud83d\udc36" }' -> 'a' as correct_in_utf8;
1381-
ERROR: invalid input syntax for type json
1367+
ERROR: unsupported Unicode escape sequence
13821368
DETAIL: Unicode escape values cannot be used for code point values above 007F when the server encoding is not UTF8.
13831369
CONTEXT: JSON data, line 1: { "a":...
13841370
select json '{ "a": "\ud83d\ud83d" }' -> 'a'; -- 2 high surrogates in a row
@@ -1398,8 +1384,38 @@ ERROR: invalid input syntax for type json
13981384
DETAIL: Unicode low surrogate must follow a high surrogate.
13991385
CONTEXT: JSON data, line 1: { "a":...
14001386
--handling of simple unicode escapes
1387+
select json '{ "a": "the Copyright \u00a9 sign" }' as correct_in_utf8;
1388+
correct_in_utf8
1389+
---------------------------------------
1390+
{ "a": "the Copyright \u00a9 sign" }
1391+
(1 row)
1392+
1393+
select json '{ "a": "dollar \u0024 character" }' as correct_everywhere;
1394+
correct_everywhere
1395+
-------------------------------------
1396+
{ "a": "dollar \u0024 character" }
1397+
(1 row)
1398+
1399+
select json '{ "a": "dollar \\u0024 character" }' as not_an_escape;
1400+
not_an_escape
1401+
--------------------------------------
1402+
{ "a": "dollar \\u0024 character" }
1403+
(1 row)
1404+
1405+
select json '{ "a": "null \u0000 escape" }' as not_unescaped;
1406+
not_unescaped
1407+
--------------------------------
1408+
{ "a": "null \u0000 escape" }
1409+
(1 row)
1410+
1411+
select json '{ "a": "null \\u0000 escape" }' as not_an_escape;
1412+
not_an_escape
1413+
---------------------------------
1414+
{ "a": "null \\u0000 escape" }
1415+
(1 row)
1416+
14011417
select json '{ "a": "the Copyright \u00a9 sign" }' ->> 'a' as correct_in_utf8;
1402-
ERROR: invalid input syntax for type json
1418+
ERROR: unsupported Unicode escape sequence
14031419
DETAIL: Unicode escape values cannot be used for code point values above 007F when the server encoding is not UTF8.
14041420
CONTEXT: JSON data, line 1: { "a":...
14051421
select json '{ "a": "dollar \u0024 character" }' ->> 'a' as correct_everywhere;
@@ -1408,8 +1424,18 @@ select json '{ "a": "dollar \u0024 character" }' ->> 'a' as correct_everywhere;
14081424
dollar $ character
14091425
(1 row)
14101426

1411-
select json '{ "a": "null \u0000 escape" }' ->> 'a' as not_unescaped;
1412-
not_unescaped
1427+
select json '{ "a": "dollar \\u0024 character" }' ->> 'a' as not_an_escape;
1428+
not_an_escape
1429+
-------------------------
1430+
dollar \u0024 character
1431+
(1 row)
1432+
1433+
select json '{ "a": "null \u0000 escape" }' ->> 'a' as fails;
1434+
ERROR: unsupported Unicode escape sequence
1435+
DETAIL: \u0000 cannot be converted to text.
1436+
CONTEXT: JSON data, line 1: { "a":...
1437+
select json '{ "a": "null \\u0000 escape" }' ->> 'a' as not_an_escape;
1438+
not_an_escape
14131439
--------------------
14141440
null \u0000 escape
14151441
(1 row)

src/test/regress/expected/jsonb.out

Lines changed: 56 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -60,12 +60,18 @@ LINE 1: SELECT '"\u000g"'::jsonb;
6060
^
6161
DETAIL: "\u" must be followed by four hexadecimal digits.
6262
CONTEXT: JSON data, line 1: "\u000g...
63-
SELECT '"\u0000"'::jsonb; -- OK, legal escape
64-
jsonb
65-
----------
66-
"\u0000"
63+
SELECT '"\u0045"'::jsonb; -- OK, legal escape
64+
jsonb
65+
-------
66+
"E"
6767
(1 row)
6868

69+
SELECT '"\u0000"'::jsonb; -- ERROR, we don't support U+0000
70+
ERROR: unsupported Unicode escape sequence
71+
LINE 1: SELECT '"\u0000"'::jsonb;
72+
^
73+
DETAIL: \u0000 cannot be converted to text.
74+
CONTEXT: JSON data, line 1: ...
6975
-- use octet_length here so we don't get an odd unicode char in the
7076
-- output
7177
SELECT octet_length('"\uaBcD"'::jsonb::text); -- OK, uppercase and lower case both OK
@@ -1798,20 +1804,62 @@ LINE 1: SELECT jsonb '{ "a": "\ude04X" }' -> 'a';
17981804
DETAIL: Unicode low surrogate must follow a high surrogate.
17991805
CONTEXT: JSON data, line 1: { "a":...
18001806
-- handling of simple unicode escapes
1801-
SELECT jsonb '{ "a": "the Copyright \u00a9 sign" }' ->> 'a' AS correct_in_utf8;
1807+
SELECT jsonb '{ "a": "the Copyright \u00a9 sign" }' as correct_in_utf8;
1808+
correct_in_utf8
1809+
-------------------------------
1810+
{"a": "the Copyright © sign"}
1811+
(1 row)
1812+
1813+
SELECT jsonb '{ "a": "dollar \u0024 character" }' as correct_everywhere;
1814+
correct_everywhere
1815+
-----------------------------
1816+
{"a": "dollar $ character"}
1817+
(1 row)
1818+
1819+
SELECT jsonb '{ "a": "dollar \\u0024 character" }' as not_an_escape;
1820+
not_an_escape
1821+
-----------------------------------
1822+
{"a": "dollar \\u0024 character"}
1823+
(1 row)
1824+
1825+
SELECT jsonb '{ "a": "null \u0000 escape" }' as fails;
1826+
ERROR: unsupported Unicode escape sequence
1827+
LINE 1: SELECT jsonb '{ "a": "null \u0000 escape" }' as fails;
1828+
^
1829+
DETAIL: \u0000 cannot be converted to text.
1830+
CONTEXT: JSON data, line 1: { "a":...
1831+
SELECT jsonb '{ "a": "null \\u0000 escape" }' as not_an_escape;
1832+
not_an_escape
1833+
------------------------------
1834+
{"a": "null \\u0000 escape"}
1835+
(1 row)
1836+
1837+
SELECT jsonb '{ "a": "the Copyright \u00a9 sign" }' ->> 'a' as correct_in_utf8;
18021838
correct_in_utf8
18031839
----------------------
18041840
the Copyright © sign
18051841
(1 row)
18061842

1807-
SELECT jsonb '{ "a": "dollar \u0024 character" }' ->> 'a' AS correct_everyWHERE;
1843+
SELECT jsonb '{ "a": "dollar \u0024 character" }' ->> 'a' as correct_everywhere;
18081844
correct_everywhere
18091845
--------------------
18101846
dollar $ character
18111847
(1 row)
18121848

1813-
SELECT jsonb '{ "a": "null \u0000 escape" }' ->> 'a' AS not_unescaped;
1814-
not_unescaped
1849+
SELECT jsonb '{ "a": "dollar \\u0024 character" }' ->> 'a' as not_an_escape;
1850+
not_an_escape
1851+
-------------------------
1852+
dollar \u0024 character
1853+
(1 row)
1854+
1855+
SELECT jsonb '{ "a": "null \u0000 escape" }' ->> 'a' as fails;
1856+
ERROR: unsupported Unicode escape sequence
1857+
LINE 1: SELECT jsonb '{ "a": "null \u0000 escape" }' ->> 'a' as fai...
1858+
^
1859+
DETAIL: \u0000 cannot be converted to text.
1860+
CONTEXT: JSON data, line 1: { "a":...
1861+
SELECT jsonb '{ "a": "null \\u0000 escape" }' ->> 'a' as not_an_escape;
1862+
not_an_escape
18151863
--------------------
18161864
null \u0000 escape
18171865
(1 row)

0 commit comments

Comments
 (0)