This code (written by yours truly) was very broken on input
strings long enough to require processing in multiple chunks.
Fuzzing revealed this very quickly; after initial rework,
further fuzzing also found a couple of very obscure bugs in
corner cases.
Because of checking for maximum line length *before* certain other checks,
the new conversion filter for QPrint could produce different results from
the old one in some cases. This was discovered while fuzzing the new
implementation of mb_decode_numericentity.
If two codepoints which needed to be collapsed into a single kuten code
were separated, with one at the end of one buffer and the other at the
beginning of the next buffer, they were not converted correctly.
This was discovered while fuzzing the new implementation of
mb_decode_numericentity.
Previously, I had adjusted this code so that if a character which could
be part of a special Docomo/Softbank/KDDI 'keypad' emoji appeared at
the end of one buffer, and the 'keypad' character appeared at the
beginning of the next, they would still be combined. However, this
broke the handling of such a character appearing at the end of one
buffer, and a character which is NOT 'keypad' appearing at the
beginning of the next.
This was found while fuzzing the new implementation of
mb_decode_numericentity.
While fuzzing the new mb_decode_numericentity implementation, I discovered
that the fast conversion filter for 'HTML-ENTITIES' did not correctly
handle an empty named entity ('&;'), nor did it correctly handle
invalid named entities whose names were a prefix of a valid entity.
Also, it did not correctly handle the case where a named entity is
truncated and another named entity starts abruptly.
When I was working on this code before, it really, really
looked like the index into `uhc3_ucs_table` could never
overrun the size of the table. Why did I get this wrong?
Don't know. Anyways, libfuzzer tore away my illusions
and unequivocally demonstrated that the index CAN be
larger than the size of the table.
This new implementation uses the new encoding conversion filters.
Aside from fewer LOC and (hopefully) improved readability,
the differences are as follows:
BEHAVIOR CHANGES:
- The old implementation used signed arithmetic when operating
on the 'convmap'. This meant that results could be surprising when
using convmap entries with 1 in the MSB. Further, types like 'int'
were used rather than those with a specific bit width, such as
'int32_t'. This meant that results could also depend on the
platform width of an 'int'.
Now unsigned arithmetic is used, with explicit bit widths.
- Similarly, while converting decimal numeric entities, the
legacy implementation would ensure that the value never overflowed
INT_MAX, and if it did, the entity would be treated as invalid
and passed through unconverted.
However, that again means that results depend on the platform
size of an 'int'. So now, we use a value with explicit bit width
(32 bits) to hold the value of a deconverted decimal entity, and
ensure that the entity value does not overflow that.
Further, because we are using an UNSIGNED 32-bit value rather
than a signed one, the ceiling for how large a decimal entity
can be is higher now.
All of this will probably not affect anyone, since Unicode
codepoints above U+10FFFF are invalid anyways. To see the
difference, you need to be using a text encoding like UCS-4,
which allows huge 'codepoints'.
- If it saw something which looked like a hex entity, but
turned out not to be a valid numeric entity, the old
implementation would sometimes convert the hexadecimal
digits a-f to A-F (uppercase). The new implementation passes
invalid numeric entities through without performing case
conversion.
- The old implementation of mb_encode_numericentity was
limited in how many decimal/hex digits it could emit.
If a text encoding like UCS-4 was in use, where 'codepoints'
can have huge values (larger than the valid range
stipulated by the Unicode standard), it would not error
out on a 'codepoint' whose value was too large for it,
but would rather mangle the value and emit a numeric
entity which decoded to some other random codepoint.
The new implementation is able to emit enough digits to
express any value which fits in 32 bits.
PERFORMANCE:
Based on micro-benchmarks run on my development machine:
Decoding numeric HTML entities is about 4 times faster, for
both decimal and hexadecimal entities, across a variety of
input string lengths. Encoding is about 3 times faster.
Even for single-character strings, this is about 50% faster for
ASCII, UTF-8, and UTF-16. For long strings, the performance gain is
enormous, since the old code would convert the ENTIRE string, just
to pick out the first codepoint.
In all text conversion filters which require the wchar buffer used for output
to have some minimum size, it's better to include an assertion; this will
help us to catch bugs, and will also help future readers to understand what
we expect of the function arguments.
For UTF-7 and UTF7-IMAP, these assertions were already there, but I have
added comments explaining why the minimum size is what it is.
I didn't think this through carefully enough when first writing this code,
but it's not necessary to reserve space for the 1-2 wchars which may be emitted
before exiting the function.
Why? Well, we are guaranteed that when we enter the function, there are at
least 3 spaces in the wchar buffer. The only way those can be consumed is if
wchars are emitted in the main 'while' loop, but if it does emit any wchars,
it will set 'bits' to zero at the same time, which means the final part will
not emit anything. 'bits' can be incremented again by the main loop, but the
main loop only runs while there are still at least 3 spaces in the buffer.
So basically, we are guaranteed that when the main loop terminates, either
there are 3 or more spaces remaining in the wchar buffer, or else 'bits' is
zero, or both.
In d62f535caa, the legacy mbstring conversion filters for Shift-JIS
was updated to restore backwards-compatible mappings for 0x5C/0x7E.
Make the same change to the newer fast conversion filters.
According to the relevant Japan Industrial Standards Committee standards,
SJIS 0x5C is a Yen sign, and 0x7E is an overline.
However, this conflicts with the implementation of SJIS in various legacy
software (notably Microsoft products), where SJIS 0x5C and 0x7E are taken
as equivalent to the same ASCII bytes.
Prior to PHP 8.1, mbstring's implementation of SJIS handled these bytes
compatibly with Microsoft products. This was changed in PHP 8.1.0, in an
attempt to comply with the JISC specifications. However, after discussion
with various concerned Japanese developers, it seems that the historical
behavior was more useful in the majority of applications which process
SJIS-encoded text.
Since we are now treating SJIS 0x5C as equivalent to U+005C and 0x7E as
equivalent to U+007E, it does not make sense to convert U+203E (OVERLINE)
to 0x7E, nor does it make sense to convert U+00A5 (YEN SIGN) to 0x5C. Restore
the mappings for those codepoints from before PHP 8.1.0.
Fixes GH-8281.
After Nikita Popov found a buffer overrun bug in one of my pull
requests, I was prompted to add more assertions in a38c7e5703 to help
me catch such bugs myself more easily in testing.
Wouldn't you just know it... as soon as I added those assertions, the
mbstring test suite caught another buffer overrun bug in my UTF-7
conversion code, which I wrote the better part of a year ago.
Then, when I started fuzzing the code with libfuzzer, I found
and fixed another buffer overflow:
If we enter the main loop, which normally outputs 3 decoded Base64
characters, where the first half of a surrogate pair had appeared at
the end of the previous run, but the second half does not appear
on this run, we need to output one error marker.
Then, at the end of the main loop, if the Base64 input ends at an
unexpected position AND the last character was not a legal
Base64-encoded character, we need to output two error markers
for that. The three error markers plus two valid, decoded bytes
can push us over the available space in our wchar buffer.
One bug in the previous implementation; when it saw a sequence of
codepoints which looked like they might need to be emitted as a special
KDDI emoji, it would totally forget whether it was in ASCII mode,
JISX 0208 mode, or something else. So it could not reliably emit the
correct escape sequence to switch to the right mode.
Further, if the input ends with a codepoint which looks like it could
be part of a special KDDI emoji, then the legacy code did not emit
an escape sequence to switch back to ASCII mode at the end of the
string. This means that the emitted ISO-2022-JP-KDDI strings could not
always be safely concatenated.
There were bugs in the legacy implementation. Lots of them.
It did not properly track whether it has switched to JISX 0213 plane 1
or plane 2. If it processes a character in plane 1 and then immediately
one in plane 2, it failed to emit the escape code to switch to plane 2.
Further, when converting codepoints from 0x80-0xFF to ISO-2022-JP-2004,
the legacy implementation would totally disregard which mode it was
operating in. Such codepoints would pass through directly to the output
without any escape sequences being emitted.
If that was not enough, all the legacy implementations of JISX 0213:2004
encodings had another common bug; their 'flush function' did not call
the next flush function in the chain of conversion filters. So if any
of these encodings were converted to an encoding where the flush
function was needed to finish the output string, then the output
would be truncated.
All the legacy implementations of JISX 0213:2004 encodings had a
common bug; their 'flush function' did not call the next flush function
in the chain of conversion filters. So if any of these encodings were
converted to an encoding where the flush function was needed to finish
the output string, then the output would be truncated.
All the legacy implementations of JISX 0213:2004 encodings had a
common bug; their 'flush function' did not call the next flush function
in the chain of conversion filters. So if any of these encodings were
converted to an encoding where the flush function was needed to finish
the output string, then the output would be truncated.
When working on this, I read RFC 1557 again and realized that the
comment at the top of the file was totally mistaken. Further, the
legacy code did not obey the RFC. (It would emit the "ESC $ ) C"
sequence anywhere, not just at the beginning of a line as the RFC
requires.)
The new code obeys the RFC; one quirk is that it always emits the
escape sequence at the beginning of each output string, even if the
string is completely ASCII (in which case the escape sequence is
allowed, but not required).
The new code doesn't always generate the same number of error markers
for invalid escapes as the old code did.
The old code could not emit the special KDDI emoji for national flags.
Further, there was a bug in the test which the old code used to
determine whether an 0xF byte should be emitted at the end of a string
(to switch back to ASCII mode). As a result, it would not always switch
back to ASCII mode, meaning that it was not always safe to concatenate
the resulting strings.