From d3933e0b6c6885a762ae2716f37bda595d4e9a33 Mon Sep 17 00:00:00 2001 From: Alex Dowad Date: Mon, 14 Nov 2022 11:02:21 +0200 Subject: [PATCH] Fix regression test for GH-9535 on PHP-8.2+ Some of the legacy text encodings which were used in this regression test are deprecated in PHP-8.2+. The deprecation warnings break the expected output. Since using these encodings in mbstring is now deprecated, I think there is little point in keeping them in this test. So they are now removed from it. Further, in 219fff376b, I made a change to avoid a situation where the legacy UTF7-IMAP conversion code gets stuck in a wrong state when its attempt to emit a character fails. When a Base64-encoded section of input ended with -, the previous code would FIRST emit a character if necessary (using the CK or "check" macro, which causes the function to return immediately if the downstream filter function returns an error code), and THEN update its own state to indicate that it is now in ASCII rather than Base64 mode. If the downstream filter function returned an error code, the CK macro would then cause the UTF7-IMAP filter function to return immediately WITHOUT setting its own state to indicate that the Base64-encoded section was done. I fixed this by updating the filter state as needed BEFORE calling CK... but I missed updating the filter state in the case where the Base64 section ends normally and there is no need to emit anything. Again, in 6d525a425e, I modified the legacy conversion code for ISO-2022-KR to try to comply more closely with the RFC for this text encoding. The RFC states that before any occurrence of 'Shift In' or 'Shift Out' codes in a ISO-2022-KR string, a special escape sequence must appear at least ONCE, at the beginning of a line. The previous code did not comply with this requirement. I made it comply by always emitting this escape sequence at the beginning of the first line. Since mb_strcut (wrongly) determines when it has consumed enough of the input string by looking at the length of its output in bytes, this extra escape sequence makes mb_strcut consume 4 bytes less of an ISO-2022-KR string than would otherwise be the case. When this strange behavior of mb_strcut is fixed, this test will have to be adjusted to restore the previous expected outputs for ISO-2022-KR. --- .../libmbfl/filters/mbfilter_utf7imap.c | 3 + ext/mbstring/tests/gh9535.phpt | 60 +++++++------------ 2 files changed, 24 insertions(+), 39 deletions(-) diff --git a/ext/mbstring/libmbfl/filters/mbfilter_utf7imap.c b/ext/mbstring/libmbfl/filters/mbfilter_utf7imap.c index 45722ce2c89..f3cbb87f2fd 100644 --- a/ext/mbstring/libmbfl/filters/mbfilter_utf7imap.c +++ b/ext/mbstring/libmbfl/filters/mbfilter_utf7imap.c @@ -147,6 +147,9 @@ int mbfl_filt_conv_utf7imap_wchar(int c, mbfl_convert_filter *filter) * or it could be that it ended on the first half of a surrogate pair */ filter->cache = filter->status = 0; CK((*filter->output_function)(MBFL_BAD_INPUT, filter->data)); + } else { + /* Base64-encoded section properly terminated by - */ + filter->cache = filter->status = 0; } } else { /* illegal character */ filter->cache = filter->status = 0; diff --git a/ext/mbstring/tests/gh9535.phpt b/ext/mbstring/tests/gh9535.phpt index f22a473561a..9b491673731 100644 --- a/ext/mbstring/tests/gh9535.phpt +++ b/ext/mbstring/tests/gh9535.phpt @@ -5,9 +5,6 @@ mbstring --FILE-- ---EXPECTF-- -BASE64: 宛如繁 -HTML-ENTITIES: 宛如 -Quoted-Printable: %s +--EXPECT-- UTF-16: 宛如繁星般宛如 UTF-16BE: 宛如繁星般宛如 UTF-16LE: 宛如繁星般宛如 @@ -101,9 +98,6 @@ CP50220: 宛如繁星 CP50221: 宛如繁星 CP50222: 宛如繁星 -BASE64: 星のように -HTML-ENTITIES: 星の -Quoted-Printable: 星の UTF-16: 星のように月のように UTF-16BE: 星のように月のように UTF-16LE: 星のように月のように @@ -118,9 +112,6 @@ CP50220: 星のように月の CP50221: 星のように月の CP50222: 星のように月の -BASE64: %s -HTML-ENTITIES: あa& -Quoted-Printable: あa UTF-16: あaいb UTF-16BE: あaいb UTF-16LE: あaいb @@ -135,9 +126,6 @@ CP50220: あa CP50221: あa CP50222: あa -BASE64: AAAAAA -HTML-ENTITIES: AAAAAAAAAA -Quoted-Printable: AAAAAAAAAA UTF-16: AAAAA UTF-16BE: AAAAA UTF-16LE: AAAAA @@ -146,15 +134,12 @@ UTF7-IMAP: AAAAAAAAAA ISO-2022-JP-MS: AAAAAAAAAA GB18030: AAAAAAAAAA HZ: AAAAAAAAAA -ISO-2022-KR: AAAAAAAAAA +ISO-2022-KR: AAAAAA ISO-2022-JP-MOBILE#KDDI: AAAAAAAAAA CP50220: AAAAAAAAAA CP50221: AAAAAAAAAA CP50222: AAAAAAAAAA -BASE64:%s -HTML-ENTITIES: ?? -Quoted-Printable: ?? UTF-16: ? UTF-16BE: ? UTF-16LE: ? @@ -163,25 +148,22 @@ UTF7-IMAP: ?? ISO-2022-JP-MS: ?? GB18030: ?? HZ: ?? -ISO-2022-KR: ?? +ISO-2022-KR: ISO-2022-JP-MOBILE#KDDI: ?? CP50220: ?? CP50221: ?? CP50222: ?? -string(0) "" -string(2) "??" -string(2) "??" -string(2) "??" -string(2) "??" -string(2) "??" -string(2) "??" -string(2) "??" -string(2) "??" -string(2) "??" -string(2) "??" -string(2) "??" -string(2) "??" -string(2) "??" -string(2) "??" -string(2) "??" +UTF-16: ?? +UTF-16BE: ?? +UTF-16LE: ?? +UTF-7: ?? +UTF7-IMAP: ?? +ISO-2022-JP-MS: ?? +GB18030: ?? +HZ: ?? +ISO-2022-KR: +ISO-2022-JP-MOBILE#KDDI: ?? +CP50220: ?? +CP50221: ?? +CP50222: ??