1
0
mirror of https://github.com/php/php-src.git synced 2026-04-21 15:08:16 +02:00
Commit Graph

481 Commits

Author SHA1 Message Date
Christoph M. Becker 20c0eb47df Merge branch 'PHP-8.1'
* PHP-8.1:
  Fix GH-8208: mb_encode_mimeheader: $indent functionality broken
2022-03-17 17:35:06 +01:00
Christoph M. Becker 5003831260 Merge branch 'PHP-8.0' into PHP-8.1
* PHP-8.0:
  Fix GH-8208: mb_encode_mimeheader: $indent functionality broken
2022-03-17 17:34:31 +01:00
Christoph M. Becker d0417ebc93 Fix GH-8208: mb_encode_mimeheader: $indent functionality broken
We also need to factor in the indent, when getting the encoder result.

Closes GH-8213.
2022-03-17 17:31:58 +01:00
Alex Dowad ff76694f28 Merge branch 'PHP-8.1'
* PHP-8.1:
  mb_check_encoding($str, '7bit') rejects strings with bytes over 0x7F
2022-02-22 23:58:57 +02:00
Alex Dowad 8a8533d263 mb_check_encoding($str, '7bit') rejects strings with bytes over 0x7F
This was the old behavior of mb_check_encoding() before 3e7acf901d,
but yours truly broke it. If only we had more thorough tests at that
time, this might not have slipped through the cracks.

Thanks to divinity76 for the report.
2022-02-22 23:56:56 +02:00
Dmitry Stogov e1782c08bf Fix ASAN undefined behavior (unsigned char << 24)
ext/mbstring/libmbfl/filters/mbfilter_utf32.c:259:20: runtime error: left shift of 128 by 24 places cannot be represented in type 'int'
2022-01-11 09:13:22 +03:00
Alex Dowad 53ffba967c Implement fast text conversion interface for CP5022{0,1,2} 2021-12-26 22:19:51 +02:00
Alex Dowad 01afd9f141 Implement fast text conversion interface for JIS 2021-12-26 22:19:51 +02:00
Alex Dowad cb4626c5b2 Implement fast text conversion interface for GB18030 2021-12-26 22:19:51 +02:00
Alex Dowad 3e8088dc80 Implement fast text conversion interface for EUC-JP-MS 2021-12-26 22:19:51 +02:00
Alex Dowad e5af94b74f Implement fast text conversion interface for CP51932 2021-12-26 22:19:51 +02:00
Alex Dowad 6ef1b35223 Implement fast text conversion interface for EUC-CN 2021-12-26 22:19:51 +02:00
Alex Dowad 9bd08a97d9 Implement fast text conversion interface for EUC-TW 2021-12-26 22:19:51 +02:00
Alex Dowad 661a10160b Implement fast text conversion interface for CP936 2021-12-26 22:19:51 +02:00
Alex Dowad 20555371d5 Implement fast text conversion interface for CP932 2021-12-26 22:19:51 +02:00
Alex Dowad 43bb97c539 Implement fast text conversion interface for EUC-KR 2021-12-26 22:19:51 +02:00
Alex Dowad c0936d48b0 Implement fast text conversion interface for UHC 2021-12-26 22:19:51 +02:00
Alex Dowad 40809cb19f Implement fast text conversion interface for HZ 2021-12-26 22:19:51 +02:00
Alex Dowad da58d42d94 Implement fast text conversion interface for CP950 2021-12-26 22:19:51 +02:00
Alex Dowad eac50a360f Implement fast text conversion interface for Big5 2021-12-26 22:19:51 +02:00
Alex Dowad 3c73225125 New internal interface for fast text conversion in mbstring
When converting text to/from wchars, mbstring makes one function call
for each and every byte or wchar to be converted. Typically, each of
these conversion functions contains a state machine, and its state has
to be restored and then saved for every single one of these calls.
It doesn't take much to see that this is grossly inefficient.

Instead of converting one byte or wchar on each call, the new
conversion functions will either fill up or drain a whole buffer of
wchars on each call. In benchmarks, this is about 3-10× faster.

Adding the new, faster conversion functions for all supported legacy
text encodings still needs some work. Also, all the code which uses
the old-style conversion functions needs to be converted to use the
new ones. After that, the old code can be dropped. (The mailparse
extension will also have to be fixed up so it will still compile.)
2021-12-21 08:33:11 +02:00
Christoph M. Becker 97f78b3bb7 Merge branch 'PHP-8.1'
* PHP-8.1:
  Fix #81693: mb_check_encoding(7bit) segfaults
2021-12-03 22:50:27 +01:00
Christoph M. Becker 929d847152 Fix #81693: mb_check_encoding(7bit) segfaults
`php_mb_check_encoding()` now uses conversion to `mbfl_encoding_wchar`.
Since `mbfl_encoding_7bit` has no `input_filter`, no filter can be
found.  Since we don't actually need to convert to wchar, we encode to
8bit.

Closes GH-7712.
2021-12-03 22:49:47 +01:00
Alex Dowad 9962aa9774 Merge branch 'PHP-8.1'
* PHP-8.1:
  mb_detect_encoding will not return non-encodings
  Improve detection accuracy of mb_detect_encoding
2021-10-19 18:11:35 +02:00
Alex Dowad 28b346bc06 Improve detection accuracy of mb_detect_encoding
Originally, `mb_detect_encoding` essentially just checked all candidate
encodings to see which ones the input string was valid in. However, it
was only able to do this for a limited few of all the text encodings
which are officially supported by mbstring.

In 3e7acf901d, I modified it so it could 'detect' any text encoding
supported by mbstring. While this is arguably an improvement, if the
only text encodings one is interested in are those which
`mb_detect_encoding` could originally handle, the old
`mb_detect_encoding` may have been preferable. Because the new one has
more possible encodings which it can guess, it also has more chances to
get the answer wrong.

This commit adjusts the detection heuristics to provide accurate
detection in a wider variety of scenarios. While the previous detection
code would frequently confuse UTF-32BE with UTF-32LE or UTF-16BE with
UTF-16LE, the adjusted code is extremely accurate in those cases.
Detection for Chinese text in Chinese encodings like GB18030 or BIG5
and for Japanese text in Japanese encodings like EUC-JP or SJIS is
greatly improved. Detection of UTF-7 is also greatly improved. An 8KB
table, with one bit for each codepoint from U+0000 up to U+FFFF, is
used to achieve this.

One significant constraint is that the heuristics are completely based
on looking at each codepoint in a string in isolation, treating some
codepoints as 'likely' and others as 'unlikely'. It might still be
possible to achieve great gains in detection accuracy by looking at
sequences of codepoints rather than individual codepoints. However,
this might require huge tables. Further, we might need a huge corpus
of text in various languages to derive those tables.

Accuracy is still dismal when trying to distinguish single-byte
encodings like ISO-8859-1, ISO-8859-2, KOI8-R, and so on. This is
because the valid bytes in these encodings are basically all the same,
and all valid bytes decode to 'likely' codepoints, so our method of
detection (which is based on rating codepoints as likely or unlikely)
cannot tell any difference between the candidates at all. It just
selects the first encoding in the provided list of candidates.

Speaking of which, if one wants to get good results from
`mb_detect_encoding`, it is important to order the list of candidate
encodings according to your prior belief of which are more likely to
be correct. When the function cannot tell any difference between two
candidates, it returns whichever appeared earlier in the array.
2021-10-19 18:05:51 +02:00
Alex Dowad dcaa010fff Strict validation of conversion flags to mb_convert_kana
mb_convert_kana is controlled by user-provided flags, which specify what it should convert
and to what. These flags come in inverse pairs, for example "fullwidth numerals to halfwidth
numerals" and "halfwidth numerals to fullwidth numerals". It does not make sense to combine
inverse flags.

But, clever reader of commit logs, you will surely say: What if I want all my halfwidth
numerals to become fullwidth, and all my fullwidth numerals to become halfwidth? Much too
clever, you are! Let's put aside the fact that this bizarre switch-up is ridiculous and
will never be used, and face up to another stark reality: mb_convert_kana does not work
for that case, and never has. This was probably never noticed because nobody ever tried.

Disallowing useless combinations of flags gives freedom to rearrange the kana conversion
code without changing behavior.

We can also reject unrecognized flags. This may help users to catch bugs.

Interestingly, the existing tests used a 'Z' flag, which is useless (it's not recognized
at all).
2021-10-01 19:27:39 +02:00
Alex Dowad 0b32a15eb0 Optimize mb_str{,im}width for performance
Rather than doing a linear search of a table of fullwidth codepoint
ranges for every input character,

1) Short-cut the search if the codepoint is below the first such range
2) Otherwise, do a binary (rather than linear) search
2021-09-29 18:19:01 +02:00
Alex Dowad f4365d2c26 Remove unused typedef 'mbfl_encoding_id' 2021-09-29 18:19:01 +02:00
Alex Dowad 3bf431969e Don't check for impossible error condition in mb_substr_count 2021-09-29 18:19:01 +02:00
Alex Dowad 8c32deb605 Don't check for impossible error condition in mb_strwidth 2021-09-29 18:19:01 +02:00
Alex Dowad bf78070cbe Don't check for impossible error condition in mb_strlen 2021-09-29 18:19:01 +02:00
Alex Dowad 07c4b3b8c0 Simplify code for handling mbstring language aliases
Rather than using pointers to pointers to pointers (3 levels of indirection), what
makes sense is two levels. This reduces unnecessary pointer dereference operations.
2021-09-20 11:27:54 +02:00
Alex Dowad 2f096c4039 Remove useless constant MBFL_ENCTYPE_MWC2 2021-09-20 11:27:54 +02:00
Alex Dowad 68176fdfb1 Use char literals in HTML numeric entity {en,de}coding functions 2021-09-20 11:27:54 +02:00
Alex Dowad f663344f33 Merge branch 'PHP-8.1'
* PHP-8.1:
  Bug #81390: mb_detect_encoding should not prematurely stop processing input
  mb_detect_encoding with only one candidate encoding uses mb_check_encoding
  Optimize text encoding detection for speed (eliminate Unicode property lookups)
2021-09-20 11:27:07 +02:00
Alex Dowad c25a1ef8d0 Bug #81390: mb_detect_encoding should not prematurely stop processing input
As a performance optimization, mb_detect_encoding tries to stop
processing the input string early when there is only one 'candidate'
encoding which the input string is valid in. However, the code which
keeps count of how many candidate encodings have already been rejected
was buggy. This caused mb_detect_encoding to prematurely stop
processing the input when it should have continued.

As a result, it did not notice that in the test case provided by Alec,
the input string was not valid in UTF-16.
2021-09-20 11:21:39 +02:00
Alex Dowad 6acd4f7f3a Optimize text encoding detection for speed (eliminate Unicode property lookups)
...By just testing the input codepoints if they are within a few fixed
ranges instead. This avoids hash lookups in property tables.

From (micro-)benchmarking on my PC, this looks to be a bit less than 4x
faster than the existing code.
2021-09-20 11:20:53 +02:00
Nikita Popov e740907ec9 Merge branch 'PHP-8.1'
* PHP-8.1:
  Update Unicode tables to 14.0.0
2021-09-20 09:58:32 +02:00
Colin O'Dell fe36b81d5e Update Unicode tables to 14.0.0
Closes GH-7502.
2021-09-20 09:58:20 +02:00
Alex Dowad 92fb3de9d7 Remove unused MBFL_FILT_TL_*_MASK constants
Sending more unused, unneeded, unwanted, unrequired, unloved and
uncalled-for code where it belongs.
2021-09-06 13:16:23 +02:00
Alex Dowad 9e1447dbf3 Rename KANA2HIRA and HIRA2KANA constants (for mb_convert_kana)
mb_convert_kana is able to convert fullwidth katakana to fullwidth
hiragana (and vice versa). The constants referring to these modes had
names like MBFL_FILT_TL_ZEN2HAN_KANA2HIRA.

The "ZEN2HAN" part of the name is misleading, since these modes do not
convert fullwidth (zenkaku) kana to halfwidth (hankaku). The converted
characters are fullwidth both before and after the conversion. So...
let's name the constants accordingly.
2021-09-06 13:16:23 +02:00
Alex Dowad c8e65c9d74 Remove COMPAT2 conversion modes for mb_convert_kana
mb_convert_kana has conversion modes selected using 'M'/'m', which
convert a few various punctuation and symbol characters between
'ordinary' and full-width forms. The constants which refer to these
modes have names ending with COMPAT1.

Internally, there are similar conversion modes with names ending in
COMPAT2. They are like COMPAT1 modes, but they operate on a smaller
set of characters. But... that is all just dead code, because there is
no way for user code to select the COMPAT2 modes.

I have no idea what the original author intended those COMPAT2 modes to
actually be used for. Guess it doesn't really matter, anyways. At this
point, it's just more food for the flames.
2021-09-06 13:16:23 +02:00
Alex Dowad d2f5a8b328 Add more tests for SJIS-mac text conversion 2021-09-06 13:16:23 +02:00
Alex Dowad 0957f54eb1 Treat truncated escape sequences for CP5022{0,1,2} as error 2021-09-06 13:16:23 +02:00
Alex Dowad 64e379d81e Declare CP50222 flush function as 'static' 2021-09-06 13:16:23 +02:00
Alex Dowad a312620607 Remove redundant NULL checks in mbstring
Whoever originally wrote mbstring seems to have a deathly fear of NULL
pointers lurking behind every corner. A common pattern is that one
function will check if a pointer is NULL, then pass it to another
function, which will again check if it is NULL, then pass to yet another
function, which will yet again check if it is NULL... it's NULL checks
all the way down.

Remove all the NULL checks in places where pointers could not possibly
be NULL.
2021-09-06 13:16:23 +02:00
Alex Dowad 626f0fec54 Remove some dead code from mbstring
mbstring has a great deal of dead code. Some common types are:

- Default switch clauses which will never be taken
- If clauses intended to convert codepoints which were not present in
  a conversion table... but the codepoint in question *is* in the table,
  so the if clause is not needed.
- Bounds checks in places where it is not possible for a value to ever
  be out of bounds.
- Checks to see if an unmatched Unicode codepoint is in CP932 extension
  range 3... but every codepoint in range 3 is also in range 2, so no
  codepoint will ever be matched and converted by that code.
2021-09-06 13:16:23 +02:00
Alex Dowad 16a1e0a219 In UTF7-IMAP, reject the 2nd part of surrogate pair if it appears unexpectedly 2021-08-31 13:41:34 +02:00
Alex Dowad e3f6a9fbfe CP5022{0,1,2} supports 'IBM extension' codes from ku 115-119
mbstring has always had the conversion tables to support CP932 codes
in ku 115-119, and the conversion code for CP5022x has an 'if' clause
specifically to handle such characters... but that 'if' clause was dead
code, since a guard clause earlier in the same function prevented it
from accepting 2-byte characters with a starting byte of 0x93-0x97.

Adjust the guard clause so that these characters can be converted as
the original author apparently intended.

The code which handles ku 115-119 is the part which reads:

    } else if (s >= cp932ext3_ucs_table_min && s < cp932ext3_ucs_table_max) {
      w = cp932ext3_ucs_table[s - cp932ext3_ucs_table_min];
2021-08-31 13:41:34 +02:00
Alex Dowad f303fc8a9b Use bool in mbfl_filt_conv_output_hex (rather than int) 2021-08-31 13:41:34 +02:00