1
0
mirror of https://github.com/php/php-src.git synced 2026-04-04 14:42:49 +02:00
Commit Graph

2277 Commits

Author SHA1 Message Date
Eric Norris
09237f6126 Update request startup error messages 2022-07-18 23:19:59 +01:00
Alex Dowad
76a92c26e3 mb_decode_numericentity decodes valid entities which are truncated at end of string
Since mb_decode_numericentity does not require all HTML entities
to end with ';', but allows them to be terminated by ANY non-digit
character, it doesn't make sense that valid entities which butt
up against the end of the input string are not converted.

As it turned out, supporting this case also made it possible
to simplify the code nicely.
2022-07-18 15:11:47 +02:00
Alex Dowad
5d6bd557b3 mb_decode_numericentity converts entities which immediately follow a valid/invalid entity
Thanks to Kamil Tieleka for suggesting that some of the behaviors of
the legacy implementation which the new mb_decode_numericentity
implementation took care to maintain were actually bugs and should
be fixed. Thanks also to Trevor Rowbotham for providing a link to
the HTML specification, showing how HTML numeric entities should
be interpreted.

mb_decode_numericentity now processes numeric entities in the
following situations where the old implementation would not:

- &<ENTITY> (for example, &&#65;)
- &#<ENTITY>
- &#x<ENTITY>
- <VALID BUT UNTERMINATED DECIMAL ENTITY><ENTITY> (for example, &#65&#65;)
- <VALID BUT UNTERMINATED HEX ENTITY><ENTITY>
- <INVALID AND UNTERMINATED DECIMAL ENTITY><ENTITY> (it does not matter why
  the first entity is invalid; the value could be too big, it could have
  too many digits, or it could not match the 'convmap' parameter)
- <INVALID AND UNTERMINATED HEX ENTITY><ENTITY>

This is consistent with the way that web browsers process
HTML entities.
2022-07-18 15:11:32 +02:00
Alex Dowad
30bfeef48d mbfl_strwidth does not need to use legacy conversion filters now
...Because we have the new (faster) conversion filters now for
ALL text encodings supported by mbstring.
2022-07-18 15:11:32 +02:00
Alex Dowad
40f5048aa7 Fix new conversion filter for UUEncode
This code (written by yours truly) was very broken on input
strings long enough to require processing in multiple chunks.
Fuzzing revealed this very quickly; after initial rework,
further fuzzing also found a couple of very obscure bugs in
corner cases.
2022-07-18 15:11:32 +02:00
Alex Dowad
5fee30b630 Fix new conversion filter for QPrint (same order of check as legacy code)
Because of checking for maximum line length *before* certain other checks,
the new conversion filter for QPrint could produce different results from
the old one in some cases. This was discovered while fuzzing the new
implementation of mb_decode_numericentity.
2022-07-18 15:11:32 +02:00
Alex Dowad
3cf432798e Fix new conversion filter for CP50220 (multi-codepoint kana at end of buffer)
If two codepoints which needed to be collapsed into a single kuten code
were separated, with one at the end of one buffer and the other at the
beginning of the next buffer, they were not converted correctly.
This was discovered while fuzzing the new implementation of
mb_decode_numericentity.
2022-07-18 15:11:31 +02:00
Alex Dowad
7559bf77d2 Fix new conversion filters for mobile SJIS variants ('0' at end of buffer)
Previously, I had adjusted this code so that if a character which could
be part of a special Docomo/Softbank/KDDI 'keypad' emoji appeared at
the end of one buffer, and the 'keypad' character appeared at the
beginning of the next, they would still be combined. However, this
broke the handling of such a character appearing at the end of one
buffer, and a character which is NOT 'keypad' appearing at the
beginning of the next.

This was found while fuzzing the new implementation of
mb_decode_numericentity.
2022-07-18 15:11:31 +02:00
Alex Dowad
fa83a8e15e Fix new conversion filter for HTML entities
While fuzzing the new mb_decode_numericentity implementation, I discovered
that the fast conversion filter for 'HTML-ENTITIES' did not correctly
handle an empty named entity ('&;'), nor did it correctly handle
invalid named entities whose names were a prefix of a valid entity.
Also, it did not correctly handle the case where a named entity is
truncated and another named entity starts abruptly.
2022-07-18 15:11:31 +02:00
Alex Dowad
9c3972fb3d Fix legacy conversion filter for HZ 2022-07-18 15:11:31 +02:00
Alex Dowad
1526bab6d0 Fix legacy conversion filter for GB18030 2022-07-18 15:11:31 +02:00
Alex Dowad
6938e35122 Fix legacy conversion filter for CP50220 2022-07-18 15:11:31 +02:00
Alex Dowad
1662f7f79f Fix legacy conversion filter for UTF-7 2022-07-18 15:11:31 +02:00
Alex Dowad
c8e4f313fa Fix legacy conversion filter for ISO-2022-KR
When I was working on this code before, it really, really
looked like the index into `uhc3_ucs_table` could never
overrun the size of the table. Why did I get this wrong?
Don't know. Anyways, libfuzzer tore away my illusions
and unequivocally demonstrated that the index CAN be
larger than the size of the table.
2022-07-18 15:11:31 +02:00
Alex Dowad
cebb8009c6 Fix legacy conversion filters for... almost all 8-bit text encodings 2022-07-18 15:11:31 +02:00
Alex Dowad
2eff19e38f Fix legacy conversion filter for HTML entities 2022-07-18 15:11:31 +02:00
Alex Dowad
87b71595ba Fix legacy conversion filter for Base64 2022-07-18 15:11:31 +02:00
Alex Dowad
7ece8f18b0 Fix legacy conversion filter for MacJapanese 2022-07-18 15:11:31 +02:00
Alex Dowad
d7bab66135 Fix legacy conversion filter for SJIS-2004 2022-07-18 15:11:31 +02:00
Alex Dowad
31cbb7a3a5 Fix legacy conversion filter for QPrint 2022-07-18 15:11:30 +02:00
Alex Dowad
048f6cbcde Fix legacy conversion filter for JIS 2022-07-18 15:11:30 +02:00
Alex Dowad
91969e908f New implementation of mb_{de,en}code_numericentity
This new implementation uses the new encoding conversion filters.
Aside from fewer LOC and (hopefully) improved readability,
the differences are as follows:

BEHAVIOR CHANGES:

- The old implementation used signed arithmetic when operating
on the 'convmap'. This meant that results could be surprising when
using convmap entries with 1 in the MSB. Further, types like 'int'
were used rather than those with a specific bit width, such as
'int32_t'. This meant that results could also depend on the
platform width of an 'int'.

Now unsigned arithmetic is used, with explicit bit widths.

- Similarly, while converting decimal numeric entities, the
legacy implementation would ensure that the value never overflowed
INT_MAX, and if it did, the entity would be treated as invalid
and passed through unconverted.

However, that again means that results depend on the platform
size of an 'int'. So now, we use a value with explicit bit width
(32 bits) to hold the value of a deconverted decimal entity, and
ensure that the entity value does not overflow that.

Further, because we are using an UNSIGNED 32-bit value rather
than a signed one, the ceiling for how large a decimal entity
can be is higher now.

All of this will probably not affect anyone, since Unicode
codepoints above U+10FFFF are invalid anyways. To see the
difference, you need to be using a text encoding like UCS-4,
which allows huge 'codepoints'.

- If it saw something which looked like a hex entity, but
turned out not to be a valid numeric entity, the old
implementation would sometimes convert the hexadecimal
digits a-f to A-F (uppercase). The new implementation passes
invalid numeric entities through without performing case
conversion.

- The old implementation of mb_encode_numericentity was
limited in how many decimal/hex digits it could emit.
If a text encoding like UCS-4 was in use, where 'codepoints'
can have huge values (larger than the valid range
stipulated by the Unicode standard), it would not error
out on a 'codepoint' whose value was too large for it,
but would rather mangle the value and emit a numeric
entity which decoded to some other random codepoint.
The new implementation is able to emit enough digits to
express any value which fits in 32 bits.

PERFORMANCE:

Based on micro-benchmarks run on my development machine:

Decoding numeric HTML entities is about 4 times faster, for
both decimal and hexadecimal entities, across a variety of
input string lengths. Encoding is about 3 times faster.
2022-07-18 15:11:30 +02:00
jcm
dbdef4a55c QA -mb_convert_encoding_array - error for object item in array
Closes GH-9023.
2022-07-15 17:34:35 +02:00
Christoph M. Becker
d6fc165028 Drop useless TODO comment
Cf. <https://github.com/php/php-src/pull/9018#issuecomment-1185481492>.
2022-07-15 14:23:59 +02:00
jcm
30d89b19cf QA - mb_http_input - function returns FALSE for type 'L' or 'l'
Closes GH-9018.
2022-07-15 14:22:39 +02:00
Arnaud Le Blanc
4df3dd7679 Reduce memory allocated by var_export, json_encode, serialize, and other (#8902)
smart_str uses an over-allocated string to optimize for append operations. Functions that use smart_str tend to return the over-allocated string directly. This results in unnecessary memory usage, especially for small strings.

The overhead can be up to 231 bytes for strings smaller than that, and 4095 for other strings. This can be avoided for strings smaller than `4096 - zend_string header size - 1` by reallocating the string.

This change introduces `smart_str_trim_to_size()`, and calls it in `smart_str_extract()`. Functions that use `smart_str` are updated to use `smart_str_extract()`.

Fixes GH-8896
2022-07-08 14:47:46 +02:00
Máté Kocsis
56137cd26e Declare ext/mbstring constants in stubs (#8798) 2022-06-23 17:34:08 +02:00
Alex Dowad
880803a21e Use fast conversion filters to implement php_mb_ord
Even for single-character strings, this is about 50% faster for
ASCII, UTF-8, and UTF-16. For long strings, the performance gain is
enormous, since the old code would convert the ENTIRE string, just
to pick out the first codepoint.
2022-06-12 15:24:41 +02:00
Alex Dowad
9468fa7ff2 mbfl_strlen does not need to use old conversion filters any more 2022-06-12 15:24:41 +02:00
Alex Dowad
950a7db9fe Use fast text conversion filters to implement mb_check_encoding
Benchmarking reveals that this is about 8% slower for UTF-8 strings
which have a bad codepoint at the very beginning of the string.
For good strings, or those where the first bad codepoint is much
later in the string, it is significantly faster (2-3 times faster
in many cases).
2022-06-12 15:24:41 +02:00
Alex Dowad
8533fccd63 Assert minimum size of wchar buffer in text conversion filters
In all text conversion filters which require the wchar buffer used for output
to have some minimum size, it's better to include an assertion; this will
help us to catch bugs, and will also help future readers to understand what
we expect of the function arguments.

For UTF-7 and UTF7-IMAP, these assertions were already there, but I have
added comments explaining why the minimum size is what it is.
2022-06-12 15:24:40 +02:00
Alex Dowad
871e61f942 Fully use available buffer space where converting Base64
I didn't think this through carefully enough when first writing this code,
but it's not necessary to reserve space for the 1-2 wchars which may be emitted
before exiting the function.

Why? Well, we are guaranteed that when we enter the function, there are at
least 3 spaces in the wchar buffer. The only way those can be consumed is if
wchars are emitted in the main 'while' loop, but if it does emit any wchars,
it will set 'bits' to zero at the same time, which means the final part will
not emit anything. 'bits' can be incremented again by the main loop, but the
main loop only runs while there are still at least 3 spaces in the buffer.

So basically, we are guaranteed that when the main loop terminates, either
there are 3 or more spaces remaining in the wchar buffer, or else 'bits' is
zero, or both.
2022-06-12 15:24:40 +02:00
Alex Dowad
13479ee2bd Restore backwards-compatible mappings for 0x5C/0x7E in SJIS (for fast conversion filter)
In d62f535caa, the legacy mbstring conversion filters for Shift-JIS
was updated to restore backwards-compatible mappings for 0x5C/0x7E.
Make the same change to the newer fast conversion filters.
2022-06-11 17:09:16 +02:00
Christoph M. Becker
85a95a2982 Merge branch 'PHP-8.1'
* PHP-8.1:
  Restore backwards-compatible mappings of 0x5C and 0x7E in SJIS
2022-06-11 16:32:33 +02:00
Alex Dowad
d62f535caa Restore backwards-compatible mappings of 0x5C and 0x7E in SJIS
According to the relevant Japan Industrial Standards Committee standards,
SJIS 0x5C is a Yen sign, and 0x7E is an overline.

However, this conflicts with the implementation of SJIS in various legacy
software (notably Microsoft products), where SJIS 0x5C and 0x7E are taken
as equivalent to the same ASCII bytes.

Prior to PHP 8.1, mbstring's implementation of SJIS handled these bytes
compatibly with Microsoft products. This was changed in PHP 8.1.0, in an
attempt to comply with the JISC specifications. However, after discussion
with various concerned Japanese developers, it seems that the historical
behavior was more useful in the majority of applications which process
SJIS-encoded text.

Since we are now treating SJIS 0x5C as equivalent to U+005C and 0x7E as
equivalent to U+007E, it does not make sense to convert U+203E (OVERLINE)
to 0x7E, nor does it make sense to convert U+00A5 (YEN SIGN) to 0x5C. Restore
the mappings for those codepoints from before PHP 8.1.0.

Fixes GH-8281.
2022-06-11 16:31:47 +02:00
Remi Collet
5b2e413e89 Merge branch 'PHP-8.1'
* PHP-8.1:
  NEWS for GH-8685
  NEWS for GH-8685
  Fix GH-8685 mbstring requires pcre
2022-06-03 07:55:51 +02:00
Remi Collet
966a90873d Merge branch 'PHP-8.0' into PHP-8.1
* PHP-8.0:
  NEWS for GH-8685
  Fix GH-8685 mbstring requires pcre
2022-06-03 07:54:58 +02:00
Remi Collet
2eb2f9d74f Fix GH-8685 mbstring requires pcre 2022-06-03 07:53:48 +02:00
Alex Dowad
492021168d php_mb_convert_encoding{,_ex} returns zend_string
That's what all existing callers want anyways. This avoids 2
unnecessary copies of the converted string.
2022-05-28 21:53:39 +02:00
Alex Dowad
e2c4fc5755 Fix buffer overflow bugs in CP50222 text conversion code 2022-05-28 21:53:39 +02:00
Alex Dowad
1f17b5468f Fix buffer overflow bug in HZ text conversion code 2022-05-28 21:53:39 +02:00
Alex Dowad
0154a5ac9f Use fast text conversion filters to implement php_mb_convert_encoding_ex 2022-05-28 21:53:38 +02:00
Alex Dowad
8dddd3cfad Fix buffer overflow bugs in UTF-7 text conversion
After Nikita Popov found a buffer overrun bug in one of my pull
requests, I was prompted to add more assertions in a38c7e5703 to help
me catch such bugs myself more easily in testing.

Wouldn't you just know it... as soon as I added those assertions, the
mbstring test suite caught another buffer overrun bug in my UTF-7
conversion code, which I wrote the better part of a year ago.

Then, when I started fuzzing the code with libfuzzer, I found
and fixed another buffer overflow:

If we enter the main loop, which normally outputs 3 decoded Base64
characters, where the first half of a surrogate pair had appeared at
the end of the previous run, but the second half does not appear
on this run, we need to output one error marker.

Then, at the end of the main loop, if the Base64 input ends at an
unexpected position AND the last character was not a legal
Base64-encoded character, we need to output two error markers
for that. The three error markers plus two valid, decoded bytes
can push us over the available space in our wchar buffer.
2022-05-28 21:53:38 +02:00
Alex Dowad
e4b9aa1870 Add assertions to help catch buffer overflows in mbstring text conversion code 2022-05-28 21:53:38 +02:00
Alex Dowad
a789088527 Add more tests for mbstring encoding conversion
When testing the preceding commits, I used a script to generate a large
number of random strings and try to find strings which would yield
different outputs from the new and old encoding conversion code.
Some were found. In most cases, analysis revealed that the new code
was correct and the old code was not.

In all cases where the new code was incorrect, regression tests were
added. However, there may be some value in adding regression tests
for cases where the old code was incorrect as well. That is done here.

This does not cover every case where the new and old code yielded
different results. Some of them were very obscure, and it is proving
difficult even to reproduce them (since I did not keep a record of
all the input strings which triggered the differing output).
2022-05-28 21:53:38 +02:00
Alex Dowad
b2f963f91c For JIS/ISO-2022-JP, treat a truncated escape sequence as error 2022-05-28 21:53:37 +02:00
Alex Dowad
4afa72126e Implement fast text conversion interface for QPrint 2022-05-28 21:53:37 +02:00
Alex Dowad
3fda9f5095 Implement fast text conversion interface for HTML-ENTITIES 2022-05-28 21:53:37 +02:00
Alex Dowad
dc1ba61d09 Simplify code for converting UTF-8
An overly complex boolean test was used to check if a 3-byte code unit
was valid. Convert it to an equivalent test with fewer terms.
2022-05-28 21:53:37 +02:00
Alex Dowad
85690ae26d Implement fast text conversion interface for Base64 2022-05-28 21:53:37 +02:00