1
0
mirror of https://github.com/php/php-src.git synced 2026-04-04 14:42:49 +02:00
Commit Graph

777 Commits

Author SHA1 Message Date
Eric Norris
09237f6126 Update request startup error messages 2022-07-18 23:19:59 +01:00
Alex Dowad
76a92c26e3 mb_decode_numericentity decodes valid entities which are truncated at end of string
Since mb_decode_numericentity does not require all HTML entities
to end with ';', but allows them to be terminated by ANY non-digit
character, it doesn't make sense that valid entities which butt
up against the end of the input string are not converted.

As it turned out, supporting this case also made it possible
to simplify the code nicely.
2022-07-18 15:11:47 +02:00
Alex Dowad
5d6bd557b3 mb_decode_numericentity converts entities which immediately follow a valid/invalid entity
Thanks to Kamil Tieleka for suggesting that some of the behaviors of
the legacy implementation which the new mb_decode_numericentity
implementation took care to maintain were actually bugs and should
be fixed. Thanks also to Trevor Rowbotham for providing a link to
the HTML specification, showing how HTML numeric entities should
be interpreted.

mb_decode_numericentity now processes numeric entities in the
following situations where the old implementation would not:

- &<ENTITY> (for example, &&#65;)
- &#<ENTITY>
- &#x<ENTITY>
- <VALID BUT UNTERMINATED DECIMAL ENTITY><ENTITY> (for example, &#65&#65;)
- <VALID BUT UNTERMINATED HEX ENTITY><ENTITY>
- <INVALID AND UNTERMINATED DECIMAL ENTITY><ENTITY> (it does not matter why
  the first entity is invalid; the value could be too big, it could have
  too many digits, or it could not match the 'convmap' parameter)
- <INVALID AND UNTERMINATED HEX ENTITY><ENTITY>

This is consistent with the way that web browsers process
HTML entities.
2022-07-18 15:11:32 +02:00
Alex Dowad
40f5048aa7 Fix new conversion filter for UUEncode
This code (written by yours truly) was very broken on input
strings long enough to require processing in multiple chunks.
Fuzzing revealed this very quickly; after initial rework,
further fuzzing also found a couple of very obscure bugs in
corner cases.
2022-07-18 15:11:32 +02:00
Alex Dowad
5fee30b630 Fix new conversion filter for QPrint (same order of check as legacy code)
Because of checking for maximum line length *before* certain other checks,
the new conversion filter for QPrint could produce different results from
the old one in some cases. This was discovered while fuzzing the new
implementation of mb_decode_numericentity.
2022-07-18 15:11:32 +02:00
Alex Dowad
3cf432798e Fix new conversion filter for CP50220 (multi-codepoint kana at end of buffer)
If two codepoints which needed to be collapsed into a single kuten code
were separated, with one at the end of one buffer and the other at the
beginning of the next buffer, they were not converted correctly.
This was discovered while fuzzing the new implementation of
mb_decode_numericentity.
2022-07-18 15:11:31 +02:00
Alex Dowad
7559bf77d2 Fix new conversion filters for mobile SJIS variants ('0' at end of buffer)
Previously, I had adjusted this code so that if a character which could
be part of a special Docomo/Softbank/KDDI 'keypad' emoji appeared at
the end of one buffer, and the 'keypad' character appeared at the
beginning of the next, they would still be combined. However, this
broke the handling of such a character appearing at the end of one
buffer, and a character which is NOT 'keypad' appearing at the
beginning of the next.

This was found while fuzzing the new implementation of
mb_decode_numericentity.
2022-07-18 15:11:31 +02:00
Alex Dowad
fa83a8e15e Fix new conversion filter for HTML entities
While fuzzing the new mb_decode_numericentity implementation, I discovered
that the fast conversion filter for 'HTML-ENTITIES' did not correctly
handle an empty named entity ('&;'), nor did it correctly handle
invalid named entities whose names were a prefix of a valid entity.
Also, it did not correctly handle the case where a named entity is
truncated and another named entity starts abruptly.
2022-07-18 15:11:31 +02:00
Alex Dowad
9c3972fb3d Fix legacy conversion filter for HZ 2022-07-18 15:11:31 +02:00
Alex Dowad
1526bab6d0 Fix legacy conversion filter for GB18030 2022-07-18 15:11:31 +02:00
Alex Dowad
6938e35122 Fix legacy conversion filter for CP50220 2022-07-18 15:11:31 +02:00
Alex Dowad
91969e908f New implementation of mb_{de,en}code_numericentity
This new implementation uses the new encoding conversion filters.
Aside from fewer LOC and (hopefully) improved readability,
the differences are as follows:

BEHAVIOR CHANGES:

- The old implementation used signed arithmetic when operating
on the 'convmap'. This meant that results could be surprising when
using convmap entries with 1 in the MSB. Further, types like 'int'
were used rather than those with a specific bit width, such as
'int32_t'. This meant that results could also depend on the
platform width of an 'int'.

Now unsigned arithmetic is used, with explicit bit widths.

- Similarly, while converting decimal numeric entities, the
legacy implementation would ensure that the value never overflowed
INT_MAX, and if it did, the entity would be treated as invalid
and passed through unconverted.

However, that again means that results depend on the platform
size of an 'int'. So now, we use a value with explicit bit width
(32 bits) to hold the value of a deconverted decimal entity, and
ensure that the entity value does not overflow that.

Further, because we are using an UNSIGNED 32-bit value rather
than a signed one, the ceiling for how large a decimal entity
can be is higher now.

All of this will probably not affect anyone, since Unicode
codepoints above U+10FFFF are invalid anyways. To see the
difference, you need to be using a text encoding like UCS-4,
which allows huge 'codepoints'.

- If it saw something which looked like a hex entity, but
turned out not to be a valid numeric entity, the old
implementation would sometimes convert the hexadecimal
digits a-f to A-F (uppercase). The new implementation passes
invalid numeric entities through without performing case
conversion.

- The old implementation of mb_encode_numericentity was
limited in how many decimal/hex digits it could emit.
If a text encoding like UCS-4 was in use, where 'codepoints'
can have huge values (larger than the valid range
stipulated by the Unicode standard), it would not error
out on a 'codepoint' whose value was too large for it,
but would rather mangle the value and emit a numeric
entity which decoded to some other random codepoint.
The new implementation is able to emit enough digits to
express any value which fits in 32 bits.

PERFORMANCE:

Based on micro-benchmarks run on my development machine:

Decoding numeric HTML entities is about 4 times faster, for
both decimal and hexadecimal entities, across a variety of
input string lengths. Encoding is about 3 times faster.
2022-07-18 15:11:30 +02:00
jcm
dbdef4a55c QA -mb_convert_encoding_array - error for object item in array
Closes GH-9023.
2022-07-15 17:34:35 +02:00
jcm
30d89b19cf QA - mb_http_input - function returns FALSE for type 'L' or 'l'
Closes GH-9018.
2022-07-15 14:22:39 +02:00
Christoph M. Becker
85a95a2982 Merge branch 'PHP-8.1'
* PHP-8.1:
  Restore backwards-compatible mappings of 0x5C and 0x7E in SJIS
2022-06-11 16:32:33 +02:00
Alex Dowad
d62f535caa Restore backwards-compatible mappings of 0x5C and 0x7E in SJIS
According to the relevant Japan Industrial Standards Committee standards,
SJIS 0x5C is a Yen sign, and 0x7E is an overline.

However, this conflicts with the implementation of SJIS in various legacy
software (notably Microsoft products), where SJIS 0x5C and 0x7E are taken
as equivalent to the same ASCII bytes.

Prior to PHP 8.1, mbstring's implementation of SJIS handled these bytes
compatibly with Microsoft products. This was changed in PHP 8.1.0, in an
attempt to comply with the JISC specifications. However, after discussion
with various concerned Japanese developers, it seems that the historical
behavior was more useful in the majority of applications which process
SJIS-encoded text.

Since we are now treating SJIS 0x5C as equivalent to U+005C and 0x7E as
equivalent to U+007E, it does not make sense to convert U+203E (OVERLINE)
to 0x7E, nor does it make sense to convert U+00A5 (YEN SIGN) to 0x5C. Restore
the mappings for those codepoints from before PHP 8.1.0.

Fixes GH-8281.
2022-06-11 16:31:47 +02:00
Alex Dowad
a789088527 Add more tests for mbstring encoding conversion
When testing the preceding commits, I used a script to generate a large
number of random strings and try to find strings which would yield
different outputs from the new and old encoding conversion code.
Some were found. In most cases, analysis revealed that the new code
was correct and the old code was not.

In all cases where the new code was incorrect, regression tests were
added. However, there may be some value in adding regression tests
for cases where the old code was incorrect as well. That is done here.

This does not cover every case where the new and old code yielded
different results. Some of them were very obscure, and it is proving
difficult even to reproduce them (since I did not keep a record of
all the input strings which triggered the differing output).
2022-05-28 21:53:38 +02:00
Alex Dowad
4afa72126e Implement fast text conversion interface for QPrint 2022-05-28 21:53:37 +02:00
Alex Dowad
3fda9f5095 Implement fast text conversion interface for HTML-ENTITIES 2022-05-28 21:53:37 +02:00
Alex Dowad
85690ae26d Implement fast text conversion interface for Base64 2022-05-28 21:53:37 +02:00
Alex Dowad
7c2587b1f6 Implement fast text conversion interface for UUENCODE 2022-05-28 21:53:37 +02:00
Alex Dowad
321dbd0413 Implement fast text conversion interface for ISO-2022-JP-2004
There were bugs in the legacy implementation. Lots of them.

It did not properly track whether it has switched to JISX 0213 plane 1
or plane 2. If it processes a character in plane 1 and then immediately
one in plane 2, it failed to emit the escape code to switch to plane 2.

Further, when converting codepoints from 0x80-0xFF to ISO-2022-JP-2004,
the legacy implementation would totally disregard which mode it was
operating in. Such codepoints would pass through directly to the output
without any escape sequences being emitted.

If that was not enough, all the legacy implementations of JISX 0213:2004
encodings had another common bug; their 'flush function' did not call
the next flush function in the chain of conversion filters. So if any
of these encodings were converted to an encoding where the flush
function was needed to finish the output string, then the output
would be truncated.
2022-05-28 21:53:36 +02:00
Alex Dowad
67d83f57c1 Implement fast text conversion interface for mobile SJIS variants 2022-05-28 21:53:36 +02:00
Alex Dowad
0d635d93f5 Implement fast text conversion interface for UTF7-IMAP
The old code would convert a 0x00 byte in the input to 0x00 in the
output, but this clearly violates the RFC which defines UTF7-IMAP.
2022-05-28 21:53:35 +02:00
Alex Dowad
6cf30356e0 Implement fast text conversion interface for SJIS-mac 2022-05-28 21:53:35 +02:00
Alex Dowad
c9479899c6 Implement fast text conversion interface for ISO-2022-KR
When working on this, I read RFC 1557 again and realized that the
comment at the top of the file was totally mistaken. Further, the
legacy code did not obey the RFC. (It would emit the "ESC $ ) C"
sequence anywhere, not just at the beginning of a line as the RFC
requires.)

The new code obeys the RFC; one quirk is that it always emits the
escape sequence at the beginning of each output string, even if the
string is completely ASCII (in which case the escape sequence is
allowed, but not required).

The new code doesn't always generate the same number of error markers
for invalid escapes as the old code did.

The old code could not emit the special KDDI emoji for national flags.

Further, there was a bug in the test which the old code used to
determine whether an 0xF byte should be emitted at the end of a string
(to switch back to ASCII mode). As a result, it would not always switch
back to ASCII mode, meaning that it was not always safe to concatenate
the resulting strings.
2022-05-28 21:53:35 +02:00
Alex Dowad
8b70a7db4f Merge branch 'PHP-8.1'
* PHP-8.1:
  mb_detect_encoding recognizes all letters in Hungarian alphabet
2022-05-25 13:10:23 +02:00
Alex Dowad
58d0aad75c mb_detect_encoding recognizes all letters in Hungarian alphabet 2022-05-25 08:22:07 +02:00
Alex Dowad
212b31b51c Merge branch 'PHP-8.1'
* PHP-8.1:
  mb_detect_encoding recognizes all letters in Czech alphabet
2022-05-25 07:53:39 +02:00
Alex Dowad
6a4b6d2344 mb_detect_encoding recognizes all letters in Czech alphabet 2022-05-25 07:52:39 +02:00
Alex Dowad
83db088fc2 Merge branch 'PHP-8.1'
* PHP-8.1:
  Fix mb_detect_encoding's recognition of Slavic names
2022-05-24 15:33:24 +02:00
Alex Dowad
9bb97ee8bc Fix mb_detect_encoding's recognition of Slavic names
Thanks to Côme Chilliet for reporting that mb_detect_encoding was not
detecting the desired text encoding for strings containing š or Ž.
These characters are used in Czech, Serbian, Croatian, Bosnian,
Macedonian, etc. names.
2022-05-24 15:32:20 +02:00
Christoph M. Becker
c9787b4785 Fix skip clause
The function required is called `mb_ereg_search()`.
2022-05-06 15:41:10 +02:00
Alex Dowad
3f12d26e3a Merge branch 'PHP-8.1'
* PHP-8.1:
  Error handling for UTF-8 complies with WHATWG specification
2022-04-16 20:32:12 +02:00
Alex Dowad
04e59c916f Error handling for UTF-8 complies with WHATWG specification
In 7502c86342, I adjusted the number of error markers emitted on
invalid UTF-8 text to be more consistent with mbstring's behavior on
other text encodings (generally, it emits one error marker for one
unexpected byte). I didn't expect that anybody would actually care one
way or the other, but felt that it was better to be consistent than
not.

Later, Martin Auswöger kindly pointed out that the WHATWG encoding
specification, which governs how various text encodings are handled
by web browsers, does actually specify how many error markers should
be generated for any given piece of invalid UTF-8 text.

Until now, we have never really paid much attention to the WHATWG
specification, but we do want to comply with as many relevant
specifications as possible. And since PHP is commonly used for web
applications, compatibility with the behavior of web browsers is
obviously a good thing.
2022-04-16 15:04:38 +02:00
Christoph M. Becker
20c0eb47df Merge branch 'PHP-8.1'
* PHP-8.1:
  Fix GH-8208: mb_encode_mimeheader: $indent functionality broken
2022-03-17 17:35:06 +01:00
Christoph M. Becker
5003831260 Merge branch 'PHP-8.0' into PHP-8.1
* PHP-8.0:
  Fix GH-8208: mb_encode_mimeheader: $indent functionality broken
2022-03-17 17:34:31 +01:00
Christoph M. Becker
d0417ebc93 Fix GH-8208: mb_encode_mimeheader: $indent functionality broken
We also need to factor in the indent, when getting the encoder result.

Closes GH-8213.
2022-03-17 17:31:58 +01:00
Alex Dowad
ff76694f28 Merge branch 'PHP-8.1'
* PHP-8.1:
  mb_check_encoding($str, '7bit') rejects strings with bytes over 0x7F
2022-02-22 23:58:57 +02:00
Alex Dowad
8a8533d263 mb_check_encoding($str, '7bit') rejects strings with bytes over 0x7F
This was the old behavior of mb_check_encoding() before 3e7acf901d,
but yours truly broke it. If only we had more thorough tests at that
time, this might not have slipped through the cracks.

Thanks to divinity76 for the report.
2022-02-22 23:56:56 +02:00
Christoph M. Becker
58cbee1ce3 Merge branch 'PHP-8.1'
* PHP-8.1:
  Fix GH-7902: mb_send_mail may delimit headers with LF only
2022-01-18 13:11:01 +01:00
Christoph M. Becker
69f6b09b2a Merge branch 'PHP-8.0' into PHP-8.1
* PHP-8.0:
  Fix GH-7902: mb_send_mail may delimit headers with LF only
2022-01-18 13:09:52 +01:00
Christoph M. Becker
03816fba46 Fix GH-7902: mb_send_mail may delimit headers with LF only
Email headers are supposed to be separated with CRLF. Period.

We introduce a `CRLF` macro for better comprehensibility right away.

Closes GH-7907.
2022-01-18 13:08:08 +01:00
Christoph M. Becker
51eec5086f Run mb_send_mail tests on Windows, too
We use the run-tests.php `{MAIL}` abstraction instead of `cat`.

Closes GH-7908.
2022-01-07 22:46:02 +01:00
Alex Dowad
53ffba967c Implement fast text conversion interface for CP5022{0,1,2} 2021-12-26 22:19:51 +02:00
Alex Dowad
c0936d48b0 Implement fast text conversion interface for UHC 2021-12-26 22:19:51 +02:00
Alex Dowad
40809cb19f Implement fast text conversion interface for HZ 2021-12-26 22:19:51 +02:00
Alex Dowad
3c73225125 New internal interface for fast text conversion in mbstring
When converting text to/from wchars, mbstring makes one function call
for each and every byte or wchar to be converted. Typically, each of
these conversion functions contains a state machine, and its state has
to be restored and then saved for every single one of these calls.
It doesn't take much to see that this is grossly inefficient.

Instead of converting one byte or wchar on each call, the new
conversion functions will either fill up or drain a whole buffer of
wchars on each call. In benchmarks, this is about 3-10× faster.

Adding the new, faster conversion functions for all supported legacy
text encodings still needs some work. Also, all the code which uses
the old-style conversion functions needs to be converted to use the
new ones. After that, the old code can be dropped. (The mailparse
extension will also have to be fixed up so it will still compile.)
2021-12-21 08:33:11 +02:00
Alex Dowad
edc6b756c1 Merge branch 'PHP-8.1'
* PHP-8.1:
  mb_convert_encoding will not auto-detect input string as UUEncode, Base64, QPrint
2021-12-20 22:47:18 +02:00
Alex Dowad
f07c193583 mb_convert_encoding will not auto-detect input string as UUEncode, Base64, QPrint
In a2bc57e0e5, mb_detect_encoding was modified to ensure it would never
return 'UUENCODE', 'QPrint', or other non-encodings as the "detected
text encoding". Before mb_detect_encoding was enhanced so that it could
detect any supported text encoding, those were never returned, and they
are not desired. Actually, we want to eventually remove them completely
from mbstring, since PHP already contains other implementations of
UUEncode, QPrint, Base64, and HTML entities.

For more clarity on why we need to suppress UUEncode, etc. from being
detected by mb_detect_encoding, the existing UUEncode implementation
in mbstring *never* treats any input as erroneous. It just accepts
everything. This means that it would *always* be treated as a valid
choice by mb_detect_encoding, and would be returned in many, many cases
where the input is obviously not UUEncoded.

It turns out that the form of mb_convert_encoding where the user passes
multiple candidate encodings (and mbstring auto-detects which one to
use) was also affected by the same issue. Apply the same fix.
2021-12-20 22:09:33 +02:00