1
0
mirror of https://github.com/php/php-src.git synced 2026-03-26 01:02:25 +01:00
Commit Graph

742 Commits

Author SHA1 Message Date
Christoph M. Becker
20c0eb47df Merge branch 'PHP-8.1'
* PHP-8.1:
  Fix GH-8208: mb_encode_mimeheader: $indent functionality broken
2022-03-17 17:35:06 +01:00
Christoph M. Becker
5003831260 Merge branch 'PHP-8.0' into PHP-8.1
* PHP-8.0:
  Fix GH-8208: mb_encode_mimeheader: $indent functionality broken
2022-03-17 17:34:31 +01:00
Christoph M. Becker
d0417ebc93 Fix GH-8208: mb_encode_mimeheader: $indent functionality broken
We also need to factor in the indent, when getting the encoder result.

Closes GH-8213.
2022-03-17 17:31:58 +01:00
Alex Dowad
ff76694f28 Merge branch 'PHP-8.1'
* PHP-8.1:
  mb_check_encoding($str, '7bit') rejects strings with bytes over 0x7F
2022-02-22 23:58:57 +02:00
Alex Dowad
8a8533d263 mb_check_encoding($str, '7bit') rejects strings with bytes over 0x7F
This was the old behavior of mb_check_encoding() before 3e7acf901d,
but yours truly broke it. If only we had more thorough tests at that
time, this might not have slipped through the cracks.

Thanks to divinity76 for the report.
2022-02-22 23:56:56 +02:00
Christoph M. Becker
58cbee1ce3 Merge branch 'PHP-8.1'
* PHP-8.1:
  Fix GH-7902: mb_send_mail may delimit headers with LF only
2022-01-18 13:11:01 +01:00
Christoph M. Becker
69f6b09b2a Merge branch 'PHP-8.0' into PHP-8.1
* PHP-8.0:
  Fix GH-7902: mb_send_mail may delimit headers with LF only
2022-01-18 13:09:52 +01:00
Christoph M. Becker
03816fba46 Fix GH-7902: mb_send_mail may delimit headers with LF only
Email headers are supposed to be separated with CRLF. Period.

We introduce a `CRLF` macro for better comprehensibility right away.

Closes GH-7907.
2022-01-18 13:08:08 +01:00
Christoph M. Becker
51eec5086f Run mb_send_mail tests on Windows, too
We use the run-tests.php `{MAIL}` abstraction instead of `cat`.

Closes GH-7908.
2022-01-07 22:46:02 +01:00
Alex Dowad
53ffba967c Implement fast text conversion interface for CP5022{0,1,2} 2021-12-26 22:19:51 +02:00
Alex Dowad
c0936d48b0 Implement fast text conversion interface for UHC 2021-12-26 22:19:51 +02:00
Alex Dowad
40809cb19f Implement fast text conversion interface for HZ 2021-12-26 22:19:51 +02:00
Alex Dowad
3c73225125 New internal interface for fast text conversion in mbstring
When converting text to/from wchars, mbstring makes one function call
for each and every byte or wchar to be converted. Typically, each of
these conversion functions contains a state machine, and its state has
to be restored and then saved for every single one of these calls.
It doesn't take much to see that this is grossly inefficient.

Instead of converting one byte or wchar on each call, the new
conversion functions will either fill up or drain a whole buffer of
wchars on each call. In benchmarks, this is about 3-10× faster.

Adding the new, faster conversion functions for all supported legacy
text encodings still needs some work. Also, all the code which uses
the old-style conversion functions needs to be converted to use the
new ones. After that, the old code can be dropped. (The mailparse
extension will also have to be fixed up so it will still compile.)
2021-12-21 08:33:11 +02:00
Alex Dowad
edc6b756c1 Merge branch 'PHP-8.1'
* PHP-8.1:
  mb_convert_encoding will not auto-detect input string as UUEncode, Base64, QPrint
2021-12-20 22:47:18 +02:00
Alex Dowad
f07c193583 mb_convert_encoding will not auto-detect input string as UUEncode, Base64, QPrint
In a2bc57e0e5, mb_detect_encoding was modified to ensure it would never
return 'UUENCODE', 'QPrint', or other non-encodings as the "detected
text encoding". Before mb_detect_encoding was enhanced so that it could
detect any supported text encoding, those were never returned, and they
are not desired. Actually, we want to eventually remove them completely
from mbstring, since PHP already contains other implementations of
UUEncode, QPrint, Base64, and HTML entities.

For more clarity on why we need to suppress UUEncode, etc. from being
detected by mb_detect_encoding, the existing UUEncode implementation
in mbstring *never* treats any input as erroneous. It just accepts
everything. This means that it would *always* be treated as a valid
choice by mb_detect_encoding, and would be returned in many, many cases
where the input is obviously not UUEncoded.

It turns out that the form of mb_convert_encoding where the user passes
multiple candidate encodings (and mbstring auto-detects which one to
use) was also affected by the same issue. Apply the same fix.
2021-12-20 22:09:33 +02:00
Christoph M. Becker
97f78b3bb7 Merge branch 'PHP-8.1'
* PHP-8.1:
  Fix #81693: mb_check_encoding(7bit) segfaults
2021-12-03 22:50:27 +01:00
Christoph M. Becker
929d847152 Fix #81693: mb_check_encoding(7bit) segfaults
`php_mb_check_encoding()` now uses conversion to `mbfl_encoding_wchar`.
Since `mbfl_encoding_7bit` has no `input_filter`, no filter can be
found.  Since we don't actually need to convert to wchar, we encode to
8bit.

Closes GH-7712.
2021-12-03 22:49:47 +01:00
Alex Dowad
ee3caef8eb Merge branch 'PHP-8.1'
* PHP-8.1:
  Add unit tests for mb_detect_encoding on Polish text
2021-11-26 17:43:40 +02:00
Alex Dowad
1a2c608053 Add unit tests for mb_detect_encoding on Polish text 2021-11-26 17:42:53 +02:00
Alex Dowad
9308974f8c Deprecate use of mbstring to convert text to Base64/QPrint/HTML entities/etc
The purpose of mbstring is for working with Unicode and legacy text
encodings; but Base64, QPrint, etc. are not text encodings and don't
really belong in mbstring. PHP already contains separate implementations
of Base64, QPrint, and HTML entities. It will be better to eventually
remove these non-encodings from mbstring.

Regarding HTML entities... there is a bit more to say. mbstring's
implementation of HTML entities is different from the other built-in
implementation (htmlspecialchars and htmlentities). Those functions
convert <, >, and & to HTML entities, but mbstring does not.

It appears that the original author of mbstring intended for something
to be done with <, >, and &. He used a table to identify which
characters should be converted to HTML entities, and </>/& all have a
special value in that table. However, nothing ever checks for that
special value, so the characters are passed through unconverted.

This seems like a very useless implementation of HTML entities. The most
important characters which need to be expressed as entities in HTML
documents are those three!
2021-11-01 11:23:21 +02:00
Alex Dowad
9962aa9774 Merge branch 'PHP-8.1'
* PHP-8.1:
  mb_detect_encoding will not return non-encodings
  Improve detection accuracy of mb_detect_encoding
2021-10-19 18:11:35 +02:00
Alex Dowad
a2bc57e0e5 mb_detect_encoding will not return non-encodings
Among the text encodings supported by mbstring are several which are
not really 'text encodings'. These include Base64, QPrint, UUencode,
HTML entities, '7 bit', and '8 bit'.

Rather than providing an explicit list of text encodings which they are
interested in, users may pass the output of mb_list_encodings to
mb_detect_encoding. Since Base64, QPrint, and so on are included in
the output of mb_list_encodings, mb_detect_encoding can return one of
these as its 'detected encoding' (and in fact, this often happens).
Before mb_detect_encoding was enhanced so it could detect any of the
supported text encodings, this did not happen, and it is never desired.
2021-10-19 18:05:52 +02:00
Alex Dowad
28b346bc06 Improve detection accuracy of mb_detect_encoding
Originally, `mb_detect_encoding` essentially just checked all candidate
encodings to see which ones the input string was valid in. However, it
was only able to do this for a limited few of all the text encodings
which are officially supported by mbstring.

In 3e7acf901d, I modified it so it could 'detect' any text encoding
supported by mbstring. While this is arguably an improvement, if the
only text encodings one is interested in are those which
`mb_detect_encoding` could originally handle, the old
`mb_detect_encoding` may have been preferable. Because the new one has
more possible encodings which it can guess, it also has more chances to
get the answer wrong.

This commit adjusts the detection heuristics to provide accurate
detection in a wider variety of scenarios. While the previous detection
code would frequently confuse UTF-32BE with UTF-32LE or UTF-16BE with
UTF-16LE, the adjusted code is extremely accurate in those cases.
Detection for Chinese text in Chinese encodings like GB18030 or BIG5
and for Japanese text in Japanese encodings like EUC-JP or SJIS is
greatly improved. Detection of UTF-7 is also greatly improved. An 8KB
table, with one bit for each codepoint from U+0000 up to U+FFFF, is
used to achieve this.

One significant constraint is that the heuristics are completely based
on looking at each codepoint in a string in isolation, treating some
codepoints as 'likely' and others as 'unlikely'. It might still be
possible to achieve great gains in detection accuracy by looking at
sequences of codepoints rather than individual codepoints. However,
this might require huge tables. Further, we might need a huge corpus
of text in various languages to derive those tables.

Accuracy is still dismal when trying to distinguish single-byte
encodings like ISO-8859-1, ISO-8859-2, KOI8-R, and so on. This is
because the valid bytes in these encodings are basically all the same,
and all valid bytes decode to 'likely' codepoints, so our method of
detection (which is based on rating codepoints as likely or unlikely)
cannot tell any difference between the candidates at all. It just
selects the first encoding in the provided list of candidates.

Speaking of which, if one wants to get good results from
`mb_detect_encoding`, it is important to order the list of candidate
encodings according to your prior belief of which are more likely to
be correct. When the function cannot tell any difference between two
candidates, it returns whichever appeared earlier in the array.
2021-10-19 18:05:51 +02:00
Alex Dowad
dcaa010fff Strict validation of conversion flags to mb_convert_kana
mb_convert_kana is controlled by user-provided flags, which specify what it should convert
and to what. These flags come in inverse pairs, for example "fullwidth numerals to halfwidth
numerals" and "halfwidth numerals to fullwidth numerals". It does not make sense to combine
inverse flags.

But, clever reader of commit logs, you will surely say: What if I want all my halfwidth
numerals to become fullwidth, and all my fullwidth numerals to become halfwidth? Much too
clever, you are! Let's put aside the fact that this bizarre switch-up is ridiculous and
will never be used, and face up to another stark reality: mb_convert_kana does not work
for that case, and never has. This was probably never noticed because nobody ever tried.

Disallowing useless combinations of flags gives freedom to rearrange the kana conversion
code without changing behavior.

We can also reject unrecognized flags. This may help users to catch bugs.

Interestingly, the existing tests used a 'Z' flag, which is useless (it's not recognized
at all).
2021-10-01 19:27:39 +02:00
Alex Dowad
1170981b33 Fix mb_str_split on empty strings in variable-length text encodings
Previously, when passed an empty string, and given an encoding which
uses a variable number of bytes per character (and which doesn't have
a 'character length table'), mb_str_split would return an array
containing a single empty string, rather than an empty array.

The ISO-2022 encodings are among those which were affected by this bug.
2021-09-20 11:27:54 +02:00
Alex Dowad
57eafd44c6 Add more tests for mb_decode_numericentity 2021-09-20 11:27:54 +02:00
Alex Dowad
be11d95170 Add more tests for mb_encode_numericentity 2021-09-20 11:27:54 +02:00
Alex Dowad
1c905434b9 Add more tests for mb_substr 2021-09-20 11:27:54 +02:00
Alex Dowad
f663344f33 Merge branch 'PHP-8.1'
* PHP-8.1:
  Bug #81390: mb_detect_encoding should not prematurely stop processing input
  mb_detect_encoding with only one candidate encoding uses mb_check_encoding
  Optimize text encoding detection for speed (eliminate Unicode property lookups)
2021-09-20 11:27:07 +02:00
Alex Dowad
c25a1ef8d0 Bug #81390: mb_detect_encoding should not prematurely stop processing input
As a performance optimization, mb_detect_encoding tries to stop
processing the input string early when there is only one 'candidate'
encoding which the input string is valid in. However, the code which
keeps count of how many candidate encodings have already been rejected
was buggy. This caused mb_detect_encoding to prematurely stop
processing the input when it should have continued.

As a result, it did not notice that in the test case provided by Alec,
the input string was not valid in UTF-16.
2021-09-20 11:21:39 +02:00
Alex Dowad
86a0d4b22d Add more tests for mb_convert_kana 2021-09-06 13:16:23 +02:00
Alex Dowad
d7eb442993 Add more tests for ISO-2022-JP-2004 text conversion 2021-09-06 13:16:23 +02:00
Alex Dowad
907d0c3248 Add more tests for UTF7-IMAP text conversion 2021-09-06 13:16:23 +02:00
Alex Dowad
bf940a13ff Add another test for SJIS-Mobile text conversion 2021-09-06 13:16:23 +02:00
Alex Dowad
32df61c558 Add more tests for UTF-7 text conversion 2021-09-06 13:16:23 +02:00
Alex Dowad
ae71bfdee7 Add more tests for UCS-4 text conversion 2021-09-06 13:16:23 +02:00
Alex Dowad
fd0e0c7390 Add another test for UCS-2 text conversion 2021-09-06 13:16:23 +02:00
Alex Dowad
edf2bd95d9 Add more tests for ISO-2022-JP and JIS7/8 text conversion 2021-09-06 13:16:23 +02:00
Alex Dowad
6a2dca3420 Add more tests for ISO-2022-JP-KDDI text conversion 2021-09-06 13:16:23 +02:00
Alex Dowad
d2f5a8b328 Add more tests for SJIS-mac text conversion 2021-09-06 13:16:23 +02:00
Alex Dowad
0957f54eb1 Treat truncated escape sequences for CP5022{0,1,2} as error 2021-09-06 13:16:23 +02:00
Alex Dowad
df32267494 Add more tests for UTF7-IMAP text conversion 2021-08-31 13:41:34 +02:00
Alex Dowad
16a1e0a219 In UTF7-IMAP, reject the 2nd part of surrogate pair if it appears unexpectedly 2021-08-31 13:41:34 +02:00
Alex Dowad
355464935d Add another test for UTF-7 text conversion 2021-08-31 13:41:34 +02:00
Alex Dowad
51b6c687db Add another test for GB18030 text conversion 2021-08-31 13:41:34 +02:00
Alex Dowad
a0415b22ab Add more tests for CP5022{0,1,2} text conversion 2021-08-31 13:41:34 +02:00
Alex Dowad
e3f6a9fbfe CP5022{0,1,2} supports 'IBM extension' codes from ku 115-119
mbstring has always had the conversion tables to support CP932 codes
in ku 115-119, and the conversion code for CP5022x has an 'if' clause
specifically to handle such characters... but that 'if' clause was dead
code, since a guard clause earlier in the same function prevented it
from accepting 2-byte characters with a starting byte of 0x93-0x97.

Adjust the guard clause so that these characters can be converted as
the original author apparently intended.

The code which handles ku 115-119 is the part which reads:

    } else if (s >= cp932ext3_ucs_table_min && s < cp932ext3_ucs_table_max) {
      w = cp932ext3_ucs_table[s - cp932ext3_ucs_table_min];
2021-08-31 13:41:34 +02:00
Alex Dowad
671dcee01e Add test for mb_str_split on UCS-2 text 2021-08-31 13:41:34 +02:00
Alex Dowad
776296e12f mbstring no longer provides 'long' substitutions for erroneous input bytes
Previously, mbstring had a special mode whereby it would convert
erroneous input byte sequences to output like "BAD+XXXX", where "XXXX"
would be the erroneous bytes expressed in hexadecimal. This mode could
be enabled by calling `mb_substitute_character("long")`.

However, accurately reproducing input byte sequences from the cached
state of a conversion filter is often tricky, and this significantly
complicates the implementation. Further, the means used for passing
the erroneous bytes through to where the "BAD+XXXX" text is generated
only allows for up to 3 bytes to be passed, meaning that some erroneous
byte sequences are truncated anyways.

More to the point, a search of publically available PHP code indicates
that nobody is really using this feature anyways.

Incidentally, this feature also provided error output like "JIS+XXXX"
if the input 'should have' represented a JISX 0208 codepoint, but it
decodes to a codepoint which does not exist in the JISX 0208 charset.
Similarly, specific error output was provided for non-existent
JISX 0212 codepoints, and likewise for JISX 0213, CP932, and a few
other charsets. All of that is now consigned to the flames.

However, "long" error markers also include a somewhat more useful
"U+XXXX" marker for Unicode codepoints which were successfully
decoded from the input text, but cannot be represented in the output
encoding. Those are still supported.

With this change, there is no need to use a variety of special values
in the high bits of a wchar to represent different types of error
values. We can (and will) just use a single error value. This will be
equal to -1.

One complicating factor: Text conversion functions return an integer to
indicate whether the conversion operation should be immediately
aborted, and the magic 'abort' marker is -1. Also, almost all of these
functions would return the received byte/codepoint to indicate success.
That doesn't work with the new error value; if an input filter detects
an error and passes -1 to the output filter, and the output filter
returns it back, that would be taken to mean 'abort'.

Therefore, amend all these functions to return 0 for success.
2021-08-31 13:41:34 +02:00
Alex Dowad
15ba73cee3 Add more tests for UTF-8 text conversion 2021-08-30 16:29:58 +02:00