1
0
mirror of https://github.com/php/php-src.git synced 2026-04-04 22:52:40 +02:00
Commit Graph

2163 Commits

Author SHA1 Message Date
Alex Dowad
dcaa010fff Strict validation of conversion flags to mb_convert_kana
mb_convert_kana is controlled by user-provided flags, which specify what it should convert
and to what. These flags come in inverse pairs, for example "fullwidth numerals to halfwidth
numerals" and "halfwidth numerals to fullwidth numerals". It does not make sense to combine
inverse flags.

But, clever reader of commit logs, you will surely say: What if I want all my halfwidth
numerals to become fullwidth, and all my fullwidth numerals to become halfwidth? Much too
clever, you are! Let's put aside the fact that this bizarre switch-up is ridiculous and
will never be used, and face up to another stark reality: mb_convert_kana does not work
for that case, and never has. This was probably never noticed because nobody ever tried.

Disallowing useless combinations of flags gives freedom to rearrange the kana conversion
code without changing behavior.

We can also reject unrecognized flags. This may help users to catch bugs.

Interestingly, the existing tests used a 'Z' flag, which is useless (it's not recognized
at all).
2021-10-01 19:27:39 +02:00
Alex Dowad
7800491289 Inline SKIP_LONG_HEADER... macro which is only used once
I don't find that pulling this code out into a macro makes anything
clearer. Not at all.
2021-09-29 18:19:01 +02:00
Alex Dowad
0b32a15eb0 Optimize mb_str{,im}width for performance
Rather than doing a linear search of a table of fullwidth codepoint
ranges for every input character,

1) Short-cut the search if the codepoint is below the first such range
2) Otherwise, do a binary (rather than linear) search
2021-09-29 18:19:01 +02:00
Alex Dowad
f4365d2c26 Remove unused typedef 'mbfl_encoding_id' 2021-09-29 18:19:01 +02:00
Alex Dowad
3bf431969e Don't check for impossible error condition in mb_substr_count 2021-09-29 18:19:01 +02:00
Alex Dowad
8c32deb605 Don't check for impossible error condition in mb_strwidth 2021-09-29 18:19:01 +02:00
Alex Dowad
bf78070cbe Don't check for impossible error condition in mb_strlen 2021-09-29 18:19:01 +02:00
Alex Dowad
d3f56e5ac9 Rename php_mb_mbchar_bytes_ex to php_mb_mbchar_bytes
...And remove the original php_mb_mbchar_bytes, which was not being
used.
2021-09-29 18:19:01 +02:00
Alex Dowad
774cd960ab No need to null-terminate buffer in php_mb_chr
`mbfl_buffer_converter_feed_result` will not overrun the specified length.
2021-09-29 18:19:01 +02:00
Alex Dowad
abf83e5079 Rename php_mb_safe_strrchr_ex to php_mb_safe_strrchr
...And remove the original php_mb_safe_strrchr, which was not being
used anywhere.
2021-09-29 18:19:01 +02:00
Nikita Popov
c37b35fa41 Merge branch 'PHP-8.1'
* PHP-8.1:
  Use locale-independent case conversion in mb_send_mail()
2021-09-23 17:21:14 +02:00
Nikita Popov
46315defc7 Use locale-independent case conversion in mb_send_mail()
Headers should not be processed in a locale-depdendent fashion.
Switch from upper to lowercasing because that's the standard for
PHP and we provide an ASCII implementation of this operation.

This is adapted from GH-7506.
2021-09-23 17:20:54 +02:00
Alex Dowad
4e51810f9b Optimize mbstring upper/lowercasing: use fast path in more cases
The 'fast path' in the uppercase/lowercase functions for Unicode text can be used
for a slightly greater range of characters. This is not expected to have a big
impact on performance, since the number of characters which will use the 'fast path'
is only increased by about 50-60, and these are not very commonly used characters...
but still, it doesn't cost anything.
2021-09-20 11:27:54 +02:00
Alex Dowad
36c979e2b6 Use stack-allocated buffer in php_mb_chr 2021-09-20 11:27:54 +02:00
Alex Dowad
07c4b3b8c0 Simplify code for handling mbstring language aliases
Rather than using pointers to pointers to pointers (3 levels of indirection), what
makes sense is two levels. This reduces unnecessary pointer dereference operations.
2021-09-20 11:27:54 +02:00
Alex Dowad
2f096c4039 Remove useless constant MBFL_ENCTYPE_MWC2 2021-09-20 11:27:54 +02:00
Alex Dowad
1170981b33 Fix mb_str_split on empty strings in variable-length text encodings
Previously, when passed an empty string, and given an encoding which
uses a variable number of bytes per character (and which doesn't have
a 'character length table'), mb_str_split would return an array
containing a single empty string, rather than an empty array.

The ISO-2022 encodings are among those which were affected by this bug.
2021-09-20 11:27:54 +02:00
Alex Dowad
57eafd44c6 Add more tests for mb_decode_numericentity 2021-09-20 11:27:54 +02:00
Alex Dowad
be11d95170 Add more tests for mb_encode_numericentity 2021-09-20 11:27:54 +02:00
Alex Dowad
68176fdfb1 Use char literals in HTML numeric entity {en,de}coding functions 2021-09-20 11:27:54 +02:00
Alex Dowad
1c905434b9 Add more tests for mb_substr 2021-09-20 11:27:54 +02:00
Alex Dowad
f663344f33 Merge branch 'PHP-8.1'
* PHP-8.1:
  Bug #81390: mb_detect_encoding should not prematurely stop processing input
  mb_detect_encoding with only one candidate encoding uses mb_check_encoding
  Optimize text encoding detection for speed (eliminate Unicode property lookups)
2021-09-20 11:27:07 +02:00
Alex Dowad
c25a1ef8d0 Bug #81390: mb_detect_encoding should not prematurely stop processing input
As a performance optimization, mb_detect_encoding tries to stop
processing the input string early when there is only one 'candidate'
encoding which the input string is valid in. However, the code which
keeps count of how many candidate encodings have already been rejected
was buggy. This caused mb_detect_encoding to prematurely stop
processing the input when it should have continued.

As a result, it did not notice that in the test case provided by Alec,
the input string was not valid in UTF-16.
2021-09-20 11:21:39 +02:00
Alex Dowad
ca33ab59ad mb_detect_encoding with only one candidate encoding uses mb_check_encoding
...Because it's about 5% faster.
2021-09-20 11:20:53 +02:00
Alex Dowad
6acd4f7f3a Optimize text encoding detection for speed (eliminate Unicode property lookups)
...By just testing the input codepoints if they are within a few fixed
ranges instead. This avoids hash lookups in property tables.

From (micro-)benchmarking on my PC, this looks to be a bit less than 4x
faster than the existing code.
2021-09-20 11:20:53 +02:00
Nikita Popov
e740907ec9 Merge branch 'PHP-8.1'
* PHP-8.1:
  Update Unicode tables to 14.0.0
2021-09-20 09:58:32 +02:00
Colin O'Dell
fe36b81d5e Update Unicode tables to 14.0.0
Closes GH-7502.
2021-09-20 09:58:20 +02:00
Alex Dowad
86a0d4b22d Add more tests for mb_convert_kana 2021-09-06 13:16:23 +02:00
Alex Dowad
92fb3de9d7 Remove unused MBFL_FILT_TL_*_MASK constants
Sending more unused, unneeded, unwanted, unrequired, unloved and
uncalled-for code where it belongs.
2021-09-06 13:16:23 +02:00
Alex Dowad
9e1447dbf3 Rename KANA2HIRA and HIRA2KANA constants (for mb_convert_kana)
mb_convert_kana is able to convert fullwidth katakana to fullwidth
hiragana (and vice versa). The constants referring to these modes had
names like MBFL_FILT_TL_ZEN2HAN_KANA2HIRA.

The "ZEN2HAN" part of the name is misleading, since these modes do not
convert fullwidth (zenkaku) kana to halfwidth (hankaku). The converted
characters are fullwidth both before and after the conversion. So...
let's name the constants accordingly.
2021-09-06 13:16:23 +02:00
Alex Dowad
c8e65c9d74 Remove COMPAT2 conversion modes for mb_convert_kana
mb_convert_kana has conversion modes selected using 'M'/'m', which
convert a few various punctuation and symbol characters between
'ordinary' and full-width forms. The constants which refer to these
modes have names ending with COMPAT1.

Internally, there are similar conversion modes with names ending in
COMPAT2. They are like COMPAT1 modes, but they operate on a smaller
set of characters. But... that is all just dead code, because there is
no way for user code to select the COMPAT2 modes.

I have no idea what the original author intended those COMPAT2 modes to
actually be used for. Guess it doesn't really matter, anyways. At this
point, it's just more food for the flames.
2021-09-06 13:16:23 +02:00
Alex Dowad
d7eb442993 Add more tests for ISO-2022-JP-2004 text conversion 2021-09-06 13:16:23 +02:00
Alex Dowad
907d0c3248 Add more tests for UTF7-IMAP text conversion 2021-09-06 13:16:23 +02:00
Alex Dowad
bf940a13ff Add another test for SJIS-Mobile text conversion 2021-09-06 13:16:23 +02:00
Alex Dowad
32df61c558 Add more tests for UTF-7 text conversion 2021-09-06 13:16:23 +02:00
Alex Dowad
ae71bfdee7 Add more tests for UCS-4 text conversion 2021-09-06 13:16:23 +02:00
Alex Dowad
fd0e0c7390 Add another test for UCS-2 text conversion 2021-09-06 13:16:23 +02:00
Alex Dowad
edf2bd95d9 Add more tests for ISO-2022-JP and JIS7/8 text conversion 2021-09-06 13:16:23 +02:00
Alex Dowad
6a2dca3420 Add more tests for ISO-2022-JP-KDDI text conversion 2021-09-06 13:16:23 +02:00
Alex Dowad
d2f5a8b328 Add more tests for SJIS-mac text conversion 2021-09-06 13:16:23 +02:00
Alex Dowad
0957f54eb1 Treat truncated escape sequences for CP5022{0,1,2} as error 2021-09-06 13:16:23 +02:00
Alex Dowad
64e379d81e Declare CP50222 flush function as 'static' 2021-09-06 13:16:23 +02:00
Alex Dowad
a312620607 Remove redundant NULL checks in mbstring
Whoever originally wrote mbstring seems to have a deathly fear of NULL
pointers lurking behind every corner. A common pattern is that one
function will check if a pointer is NULL, then pass it to another
function, which will again check if it is NULL, then pass to yet another
function, which will yet again check if it is NULL... it's NULL checks
all the way down.

Remove all the NULL checks in places where pointers could not possibly
be NULL.
2021-09-06 13:16:23 +02:00
Alex Dowad
626f0fec54 Remove some dead code from mbstring
mbstring has a great deal of dead code. Some common types are:

- Default switch clauses which will never be taken
- If clauses intended to convert codepoints which were not present in
  a conversion table... but the codepoint in question *is* in the table,
  so the if clause is not needed.
- Bounds checks in places where it is not possible for a value to ever
  be out of bounds.
- Checks to see if an unmatched Unicode codepoint is in CP932 extension
  range 3... but every codepoint in range 3 is also in range 2, so no
  codepoint will ever be matched and converted by that code.
2021-09-06 13:16:23 +02:00
Alex Dowad
df32267494 Add more tests for UTF7-IMAP text conversion 2021-08-31 13:41:34 +02:00
Alex Dowad
16a1e0a219 In UTF7-IMAP, reject the 2nd part of surrogate pair if it appears unexpectedly 2021-08-31 13:41:34 +02:00
Alex Dowad
355464935d Add another test for UTF-7 text conversion 2021-08-31 13:41:34 +02:00
Alex Dowad
51b6c687db Add another test for GB18030 text conversion 2021-08-31 13:41:34 +02:00
Alex Dowad
a0415b22ab Add more tests for CP5022{0,1,2} text conversion 2021-08-31 13:41:34 +02:00
Alex Dowad
e3f6a9fbfe CP5022{0,1,2} supports 'IBM extension' codes from ku 115-119
mbstring has always had the conversion tables to support CP932 codes
in ku 115-119, and the conversion code for CP5022x has an 'if' clause
specifically to handle such characters... but that 'if' clause was dead
code, since a guard clause earlier in the same function prevented it
from accepting 2-byte characters with a starting byte of 0x93-0x97.

Adjust the guard clause so that these characters can be converted as
the original author apparently intended.

The code which handles ku 115-119 is the part which reads:

    } else if (s >= cp932ext3_ucs_table_min && s < cp932ext3_ucs_table_max) {
      w = cp932ext3_ucs_table[s - cp932ext3_ucs_table_min];
2021-08-31 13:41:34 +02:00