1
0
mirror of https://github.com/php/php-src.git synced 2026-04-17 21:11:02 +02:00
Commit Graph

506 Commits

Author SHA1 Message Date
Alex Dowad
e2c4fc5755 Fix buffer overflow bugs in CP50222 text conversion code 2022-05-28 21:53:39 +02:00
Alex Dowad
1f17b5468f Fix buffer overflow bug in HZ text conversion code 2022-05-28 21:53:39 +02:00
Alex Dowad
0154a5ac9f Use fast text conversion filters to implement php_mb_convert_encoding_ex 2022-05-28 21:53:38 +02:00
Alex Dowad
8dddd3cfad Fix buffer overflow bugs in UTF-7 text conversion
After Nikita Popov found a buffer overrun bug in one of my pull
requests, I was prompted to add more assertions in a38c7e5703 to help
me catch such bugs myself more easily in testing.

Wouldn't you just know it... as soon as I added those assertions, the
mbstring test suite caught another buffer overrun bug in my UTF-7
conversion code, which I wrote the better part of a year ago.

Then, when I started fuzzing the code with libfuzzer, I found
and fixed another buffer overflow:

If we enter the main loop, which normally outputs 3 decoded Base64
characters, where the first half of a surrogate pair had appeared at
the end of the previous run, but the second half does not appear
on this run, we need to output one error marker.

Then, at the end of the main loop, if the Base64 input ends at an
unexpected position AND the last character was not a legal
Base64-encoded character, we need to output two error markers
for that. The three error markers plus two valid, decoded bytes
can push us over the available space in our wchar buffer.
2022-05-28 21:53:38 +02:00
Alex Dowad
e4b9aa1870 Add assertions to help catch buffer overflows in mbstring text conversion code 2022-05-28 21:53:38 +02:00
Alex Dowad
b2f963f91c For JIS/ISO-2022-JP, treat a truncated escape sequence as error 2022-05-28 21:53:37 +02:00
Alex Dowad
4afa72126e Implement fast text conversion interface for QPrint 2022-05-28 21:53:37 +02:00
Alex Dowad
3fda9f5095 Implement fast text conversion interface for HTML-ENTITIES 2022-05-28 21:53:37 +02:00
Alex Dowad
dc1ba61d09 Simplify code for converting UTF-8
An overly complex boolean test was used to check if a 3-byte code unit
was valid. Convert it to an equivalent test with fewer terms.
2022-05-28 21:53:37 +02:00
Alex Dowad
85690ae26d Implement fast text conversion interface for Base64 2022-05-28 21:53:37 +02:00
Alex Dowad
7c2587b1f6 Implement fast text conversion interface for UUENCODE 2022-05-28 21:53:37 +02:00
Alex Dowad
06a15e6395 Implement fast text conversion interface for '8bit' 2022-05-28 21:53:37 +02:00
Alex Dowad
073a88f34c Implement fast text conversion interface for ISO-2022-JP-KDDI
One bug in the previous implementation; when it saw a sequence of
codepoints which looked like they might need to be emitted as a special
KDDI emoji, it would totally forget whether it was in ASCII mode,
JISX 0208 mode, or something else. So it could not reliably emit the
correct escape sequence to switch to the right mode.

Further, if the input ends with a codepoint which looks like it could
be part of a special KDDI emoji, then the legacy code did not emit
an escape sequence to switch back to ASCII mode at the end of the
string. This means that the emitted ISO-2022-JP-KDDI strings could not
always be safely concatenated.
2022-05-28 21:53:36 +02:00
Alex Dowad
c96ad91014 Implement fast text conversion interface for ISO-2022-JP-MS 2022-05-28 21:53:36 +02:00
Alex Dowad
a08f062cad Implement fast text conversion interface for mobile variants of UTF-8 2022-05-28 21:53:36 +02:00
Alex Dowad
321dbd0413 Implement fast text conversion interface for ISO-2022-JP-2004
There were bugs in the legacy implementation. Lots of them.

It did not properly track whether it has switched to JISX 0213 plane 1
or plane 2. If it processes a character in plane 1 and then immediately
one in plane 2, it failed to emit the escape code to switch to plane 2.

Further, when converting codepoints from 0x80-0xFF to ISO-2022-JP-2004,
the legacy implementation would totally disregard which mode it was
operating in. Such codepoints would pass through directly to the output
without any escape sequences being emitted.

If that was not enough, all the legacy implementations of JISX 0213:2004
encodings had another common bug; their 'flush function' did not call
the next flush function in the chain of conversion filters. So if any
of these encodings were converted to an encoding where the flush
function was needed to finish the output string, then the output
would be truncated.
2022-05-28 21:53:36 +02:00
Alex Dowad
29e21c0e6f Implement fast text conversion interface for SJIS-2004
All the legacy implementations of JISX 0213:2004 encodings had a
common bug; their 'flush function' did not call the next flush function
in the chain of conversion filters. So if any of these encodings were
converted to an encoding where the flush function was needed to finish
the output string, then the output would be truncated.
2022-05-28 21:53:36 +02:00
Alex Dowad
e5fdd5cef2 Implement fast text conversion interface for EUC-JP-2004
All the legacy implementations of JISX 0213:2004 encodings had a
common bug; their 'flush function' did not call the next flush function
in the chain of conversion filters. So if any of these encodings were
converted to an encoding where the flush function was needed to finish
the output string, then the output would be truncated.
2022-05-28 21:53:36 +02:00
Alex Dowad
67d83f57c1 Implement fast text conversion interface for mobile SJIS variants 2022-05-28 21:53:36 +02:00
Alex Dowad
0d635d93f5 Implement fast text conversion interface for UTF7-IMAP
The old code would convert a 0x00 byte in the input to 0x00 in the
output, but this clearly violates the RFC which defines UTF7-IMAP.
2022-05-28 21:53:35 +02:00
Alex Dowad
6cf30356e0 Implement fast text conversion interface for SJIS-mac 2022-05-28 21:53:35 +02:00
Alex Dowad
c9479899c6 Implement fast text conversion interface for ISO-2022-KR
When working on this, I read RFC 1557 again and realized that the
comment at the top of the file was totally mistaken. Further, the
legacy code did not obey the RFC. (It would emit the "ESC $ ) C"
sequence anywhere, not just at the beginning of a line as the RFC
requires.)

The new code obeys the RFC; one quirk is that it always emits the
escape sequence at the beginning of each output string, even if the
string is completely ASCII (in which case the escape sequence is
allowed, but not required).

The new code doesn't always generate the same number of error markers
for invalid escapes as the old code did.

The old code could not emit the special KDDI emoji for national flags.

Further, there was a bug in the test which the old code used to
determine whether an 0xF byte should be emitted at the end of a string
(to switch back to ASCII mode). As a result, it would not always switch
back to ASCII mode, meaning that it was not always safe to concatenate
the resulting strings.
2022-05-28 21:53:35 +02:00
Alex Dowad
763284a531 Implement fast text conversion interface for '7bit' 2022-05-28 21:53:35 +02:00
Alex Dowad
3f12d26e3a Merge branch 'PHP-8.1'
* PHP-8.1:
  Error handling for UTF-8 complies with WHATWG specification
2022-04-16 20:32:12 +02:00
Alex Dowad
04e59c916f Error handling for UTF-8 complies with WHATWG specification
In 7502c86342, I adjusted the number of error markers emitted on
invalid UTF-8 text to be more consistent with mbstring's behavior on
other text encodings (generally, it emits one error marker for one
unexpected byte). I didn't expect that anybody would actually care one
way or the other, but felt that it was better to be consistent than
not.

Later, Martin Auswöger kindly pointed out that the WHATWG encoding
specification, which governs how various text encodings are handled
by web browsers, does actually specify how many error markers should
be generated for any given piece of invalid UTF-8 text.

Until now, we have never really paid much attention to the WHATWG
specification, but we do want to comply with as many relevant
specifications as possible. And since PHP is commonly used for web
applications, compatibility with the behavior of web browsers is
obviously a good thing.
2022-04-16 15:04:38 +02:00
Christoph M. Becker
20c0eb47df Merge branch 'PHP-8.1'
* PHP-8.1:
  Fix GH-8208: mb_encode_mimeheader: $indent functionality broken
2022-03-17 17:35:06 +01:00
Christoph M. Becker
5003831260 Merge branch 'PHP-8.0' into PHP-8.1
* PHP-8.0:
  Fix GH-8208: mb_encode_mimeheader: $indent functionality broken
2022-03-17 17:34:31 +01:00
Christoph M. Becker
d0417ebc93 Fix GH-8208: mb_encode_mimeheader: $indent functionality broken
We also need to factor in the indent, when getting the encoder result.

Closes GH-8213.
2022-03-17 17:31:58 +01:00
Alex Dowad
ff76694f28 Merge branch 'PHP-8.1'
* PHP-8.1:
  mb_check_encoding($str, '7bit') rejects strings with bytes over 0x7F
2022-02-22 23:58:57 +02:00
Alex Dowad
8a8533d263 mb_check_encoding($str, '7bit') rejects strings with bytes over 0x7F
This was the old behavior of mb_check_encoding() before 3e7acf901d,
but yours truly broke it. If only we had more thorough tests at that
time, this might not have slipped through the cracks.

Thanks to divinity76 for the report.
2022-02-22 23:56:56 +02:00
Dmitry Stogov
e1782c08bf Fix ASAN undefined behavior (unsigned char << 24)
ext/mbstring/libmbfl/filters/mbfilter_utf32.c:259:20: runtime error: left shift of 128 by 24 places cannot be represented in type 'int'
2022-01-11 09:13:22 +03:00
Alex Dowad
53ffba967c Implement fast text conversion interface for CP5022{0,1,2} 2021-12-26 22:19:51 +02:00
Alex Dowad
01afd9f141 Implement fast text conversion interface for JIS 2021-12-26 22:19:51 +02:00
Alex Dowad
cb4626c5b2 Implement fast text conversion interface for GB18030 2021-12-26 22:19:51 +02:00
Alex Dowad
3e8088dc80 Implement fast text conversion interface for EUC-JP-MS 2021-12-26 22:19:51 +02:00
Alex Dowad
e5af94b74f Implement fast text conversion interface for CP51932 2021-12-26 22:19:51 +02:00
Alex Dowad
6ef1b35223 Implement fast text conversion interface for EUC-CN 2021-12-26 22:19:51 +02:00
Alex Dowad
9bd08a97d9 Implement fast text conversion interface for EUC-TW 2021-12-26 22:19:51 +02:00
Alex Dowad
661a10160b Implement fast text conversion interface for CP936 2021-12-26 22:19:51 +02:00
Alex Dowad
20555371d5 Implement fast text conversion interface for CP932 2021-12-26 22:19:51 +02:00
Alex Dowad
43bb97c539 Implement fast text conversion interface for EUC-KR 2021-12-26 22:19:51 +02:00
Alex Dowad
c0936d48b0 Implement fast text conversion interface for UHC 2021-12-26 22:19:51 +02:00
Alex Dowad
40809cb19f Implement fast text conversion interface for HZ 2021-12-26 22:19:51 +02:00
Alex Dowad
da58d42d94 Implement fast text conversion interface for CP950 2021-12-26 22:19:51 +02:00
Alex Dowad
eac50a360f Implement fast text conversion interface for Big5 2021-12-26 22:19:51 +02:00
Alex Dowad
3c73225125 New internal interface for fast text conversion in mbstring
When converting text to/from wchars, mbstring makes one function call
for each and every byte or wchar to be converted. Typically, each of
these conversion functions contains a state machine, and its state has
to be restored and then saved for every single one of these calls.
It doesn't take much to see that this is grossly inefficient.

Instead of converting one byte or wchar on each call, the new
conversion functions will either fill up or drain a whole buffer of
wchars on each call. In benchmarks, this is about 3-10× faster.

Adding the new, faster conversion functions for all supported legacy
text encodings still needs some work. Also, all the code which uses
the old-style conversion functions needs to be converted to use the
new ones. After that, the old code can be dropped. (The mailparse
extension will also have to be fixed up so it will still compile.)
2021-12-21 08:33:11 +02:00
Christoph M. Becker
97f78b3bb7 Merge branch 'PHP-8.1'
* PHP-8.1:
  Fix #81693: mb_check_encoding(7bit) segfaults
2021-12-03 22:50:27 +01:00
Christoph M. Becker
929d847152 Fix #81693: mb_check_encoding(7bit) segfaults
`php_mb_check_encoding()` now uses conversion to `mbfl_encoding_wchar`.
Since `mbfl_encoding_7bit` has no `input_filter`, no filter can be
found.  Since we don't actually need to convert to wchar, we encode to
8bit.

Closes GH-7712.
2021-12-03 22:49:47 +01:00
Alex Dowad
9962aa9774 Merge branch 'PHP-8.1'
* PHP-8.1:
  mb_detect_encoding will not return non-encodings
  Improve detection accuracy of mb_detect_encoding
2021-10-19 18:11:35 +02:00
Alex Dowad
28b346bc06 Improve detection accuracy of mb_detect_encoding
Originally, `mb_detect_encoding` essentially just checked all candidate
encodings to see which ones the input string was valid in. However, it
was only able to do this for a limited few of all the text encodings
which are officially supported by mbstring.

In 3e7acf901d, I modified it so it could 'detect' any text encoding
supported by mbstring. While this is arguably an improvement, if the
only text encodings one is interested in are those which
`mb_detect_encoding` could originally handle, the old
`mb_detect_encoding` may have been preferable. Because the new one has
more possible encodings which it can guess, it also has more chances to
get the answer wrong.

This commit adjusts the detection heuristics to provide accurate
detection in a wider variety of scenarios. While the previous detection
code would frequently confuse UTF-32BE with UTF-32LE or UTF-16BE with
UTF-16LE, the adjusted code is extremely accurate in those cases.
Detection for Chinese text in Chinese encodings like GB18030 or BIG5
and for Japanese text in Japanese encodings like EUC-JP or SJIS is
greatly improved. Detection of UTF-7 is also greatly improved. An 8KB
table, with one bit for each codepoint from U+0000 up to U+FFFF, is
used to achieve this.

One significant constraint is that the heuristics are completely based
on looking at each codepoint in a string in isolation, treating some
codepoints as 'likely' and others as 'unlikely'. It might still be
possible to achieve great gains in detection accuracy by looking at
sequences of codepoints rather than individual codepoints. However,
this might require huge tables. Further, we might need a huge corpus
of text in various languages to derive those tables.

Accuracy is still dismal when trying to distinguish single-byte
encodings like ISO-8859-1, ISO-8859-2, KOI8-R, and so on. This is
because the valid bytes in these encodings are basically all the same,
and all valid bytes decode to 'likely' codepoints, so our method of
detection (which is based on rating codepoints as likely or unlikely)
cannot tell any difference between the candidates at all. It just
selects the first encoding in the provided list of candidates.

Speaking of which, if one wants to get good results from
`mb_detect_encoding`, it is important to order the list of candidate
encodings according to your prior belief of which are more likely to
be correct. When the function cannot tell any difference between two
candidates, it returns whichever appeared earlier in the array.
2021-10-19 18:05:51 +02:00