Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to content

Commit 9b0ce89

Browse files
committed
Fix backslash-escaping multibyte chars in COPY FROM.
If a multi-byte character is escaped with a backslash in TEXT mode input, and the encoding is one of the client-only encodings where the bytes after the first one can have an ASCII byte "embedded" in the char, we didn't skip the character correctly. After a backslash, we only skipped the first byte of the next character, so if it was a multi-byte character, we would try to process its second byte as if it was a separate character. If it was one of the characters with special meaning, like '\n', '\r', or another '\\', that would cause trouble. One such exmple is the byte sequence '\x5ca45c2e666f6f' in Big5 encoding. That's supposed to be [backslash][two-byte character][.][f][o][o], but because the second byte of the two-byte character is 0x5c, we incorrectly treat it as another backslash. And because the next character is a dot, we parse it as end-of-copy marker, and throw an "end-of-copy marker corrupt" error. Backpatch to all supported versions. Reviewed-by: John Naylor, Kyotaro Horiguchi Discussion: https://www.postgresql.org/message-id/a897f84f-8dca-8798-3139-07da5bb38728%40iki.fi
1 parent 77e760d commit 9b0ce89

File tree

1 file changed

+9
-1
lines changed

1 file changed

+9
-1
lines changed

src/backend/commands/copy.c

Lines changed: 9 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4281,7 +4281,7 @@ CopyReadLineText(CopyState cstate)
42814281
break;
42824282
}
42834283
else if (!cstate->csv_mode)
4284-
4284+
{
42854285
/*
42864286
* If we are here, it means we found a backslash followed by
42874287
* something other than a period. In non-CSV mode, anything
@@ -4292,8 +4292,16 @@ CopyReadLineText(CopyState cstate)
42924292
* backslashes are not special, so we want to process the
42934293
* character after the backslash just like a normal character,
42944294
* so we don't increment in those cases.
4295+
*
4296+
* Set 'c' to skip whole character correctly in multi-byte
4297+
* encodings. If we don't have the whole character in the
4298+
* buffer yet, we might loop back to process it, after all,
4299+
* but that's OK because multi-byte characters cannot have any
4300+
* special meaning.
42954301
*/
42964302
raw_buf_ptr++;
4303+
c = c2;
4304+
}
42974305
}
42984306

42994307
/*

0 commit comments

Comments
 (0)