Fuzzing revealed a small difference between the number of error
markers which the legacy ISO-2022-JP and JIS7/8 conversion code
emitted for truncated escape sequences and those emitted by the
new code. The behavior of the old code seems more reasonable
here, so we will imitate it.
In 04e59c916f, I amended the UTF-8 conversion code, so that when given
invalid input, it would emit a number of errors markers harmonizing
with the WHATWG's specification of the standard UTF-8 decoding
algorithm. (Which, gentle reader of commit logs, you can find online
at https://encoding.spec.whatwg.org/#utf-8-decoder.) However, the code
in 04e59c916f was faulty in the case that a truncated UTF-8 code unit
starts with 0xF1.
Then, in dc1ba61d09, when making a small refactoring to a different
part of the UTF-8 conversion code, I inexplicably broke part of the
working code, causing the same fault which was already present with
truncated UTF-8 code units starting with 0xF1 to also occur with
0xF2 and 0xF3 as well. I don't remember what inane thoughts I was
thinking when I pulled off this feat of utter mental confusion.
None of these cases were covered by unit tests, by the way.
Thankfully, my trusty fuzzer picked up on this when testing the
new implementation of mb_parse_str (since the legacy UTF-8
conversion filter did not suffer from the same problem, and I was
fuzzing to find any differences in behavior between the old and
new implementations).
Fortuitously, the fuzzer also picked up another issue which was
present in 04e59c916f. I was emitting only one error marker for
truncated code units starting with 0xE0 or 0xED, in cases where
the WHATWG standard indicates two should be emitted. Examples
are 0xE0 0x9F <END OF STRING> or 0xED 0xA0 <END OF STRING>.
Code units starting with 0xE0-0xED should have 3 bytes. If the
first byte is 0xE0, the second MUST be 0xA0 or greater. (Otherwise,
the codepoint could have fit in a two-byte code unit.) And if the
first byte is 0xED, the second MUST be 0x9F or less. According to
the WHATWG algorithm, step 4, if the second byte is outside the
legal range, then the decoder should emit an error... AND
reprocess the out-of-range byte. The reprocessing will then
cause another error. That's why the decoder should indicate two
errors and not one.
Fuzzing revealed that something was missed here when making the new
encoding conversion code match the behavior of the old code. In the
next major release of PHP, support for these non-encodings will be
dropped, but in the meantime, it is better to match the legacy
behavior.
This fix is another solution to replace d0527427be, use zend_try and zend_catch to make sure persistent stream will be released when error occurred.
Closes GH-9332.
This reverts commit d0527427be.
This patch makes Swoole/Swow can not work anymore, because Coroutine will yield to another one during socket operation, EG(record_errors) assertion will always fail, and zend_begin_record_errors() was only used during compile time before.
Note: zend_emit_recorded_errors() and the typo fix are reserved.
This is not actually related to SSL handshake but stream socket creation
which does not clean errors if the error handler is set. This fix
prevents emitting errors until the stream is freed.
The comparator function used at ksort in SORT_REGULAR mode
need to be consistent with basic comparison rules. These rules
were changed in PHP-8.0 for numeric strings, but comparator
used at ksort kept the old behaviour. It leads to inconsistent
situations, when after ksort the first key is GREATER than some
of the next ones by according to the basic comparison operators.
Closes GH-9293.