1
0
mirror of https://github.com/php/php-src.git synced 2026-03-25 16:52:18 +01:00
Commit Graph

827 Commits

Author SHA1 Message Date
David Carlier
c34d4fbbf4 Fix GH-16360 mb_substr overflow on start and length arguments.
occurs when they are negated to start working from the end instead
when set with ZEND_LONG_MIN.
2024-10-11 08:46:48 +01:00
Niels Dossche
bf70d9ba0d Fix GH-16261: Reference invariant broken in mb_convert_variables()
The behaviour is weird in the sense that the reference must get
unwrapped. What ended up happening is that when destroying the old
reference the sources list was not cleaned properly. We add handling for
that. Normally we would use use ZEND_TRY_ASSIGN_STRINGL but that doesn't
work here as it would keep the reference and change values through
references (see bug #26639).

Closes GH-16272.
2024-10-07 17:46:06 +02:00
Yuya Hamada
d840200cea Fix GH-16229: Address overflowed in mb_send_mail when empty string 2024-10-05 18:24:09 +09:00
Peter Kokot
218a93b898 Use EXTENSIONS instead of SKIPIF sections in *.phpt
This also fixes skipped tests due to different naming "zend-test"
instead of "zend_test" and "PDO" instead of "pdo":

- ext/dom/tests/libxml_global_state_entity_loader_bypass.phpt
- ext/simplexml/tests/libxml_global_state_entity_loader_bypass.phpt
- ext/xmlreader/tests/libxml_global_state_entity_loader_bypass.phpt
- ext/zend_test/tests/observer_sqlite_create_function.phpt

EXTENSIONS section is used for the Windows build to load the non-static
extensions.

Closes GH-13276
2024-01-31 11:18:21 +01:00
Alex Dowad
d8ef868b92 Return value of mb_get_info can be NULL
This has been the case at least since PHP 5.4. Thanks to Girgias for
pointing it out.

It appears that there are several global variables internal to mbstring
which can be queried via mb_get_info() and which could be NULL, but
at the very least, we know that "mbstring.http_input" is one of them.
2023-11-27 20:53:37 +02:00
Niels Dossche
297fec099e Merge branch 'PHP-8.1' into PHP-8.2
* PHP-8.1:
  Fix GH-11300: license issue: restricted unicode license headers
2023-07-01 21:56:40 +02:00
Niels Dossche
ee42621ff6 Fix GH-11300: license issue: restricted unicode license headers
Closes GH-11572.
2023-07-01 21:55:21 +02:00
Alex Dowad
c33589ea11 Merge branch 'PHP-8.1' into PHP-8.2
* PHP-8.1:
  Fix mb_strlen is wrong length for CP932 when 0x80.
2023-05-30 13:45:36 -07:00
Yuya Hamada
c50172e812 Fix mb_strlen is wrong length for CP932 when 0x80. 2023-05-30 13:44:30 -07:00
Randy Geraads
c5a623ba5e Added negative offset test for mb_strrpos
Should expose https://github.com/php/php-src/issues/11217
2023-05-15 10:36:37 +02:00
Ilija Tovilo
aa553af911 Fix segfault in mb_strrpos/mb_strripos with ASCII encoding and negative offset
We're setting the encoding from PHP_FUNCTION(mb_strpos), but mbfl_strpos would
discard it, setting it to mbfl_encoding_pass, making zend_memnrstr fail due to a
null-pointer exception.

Fixes GH-11217
Closes GH-11220
2023-05-15 10:36:37 +02:00
pakutoma
b721d0f71e Fix phpGH-10648: add check function pointer into mbfl_encoding
Previously, mbstring used the same logic for encoding validation as for
encoding conversion.

However, there are cases where we want to use different logic for validation
and conversion. For example, if a string ends up with missing input
required by the encoding, or if a character is input that is invalid
as an encoding but can be converted, the conversion should succeed and
the validation should fail.

To achieve this, a function pointer mb_check_fn has been added to
struct mbfl_encoding to implement the logic used for validation.
Also, added implementation of validation logic for UTF-7, UTF7-IMAP,
ISO-2022-JP and JIS.

(The same change has already been made to PHP 8.2 and 8.3; see
6fc8d014df. This commit is backporting the change to PHP 8.1.)
2023-03-25 09:52:10 +02:00
pakutoma
6fc8d014df Fix phpGH-10648: add check function pointer into mbfl_encoding
Previously, mbstring used the same logic for encoding validation as for
encoding conversion.

However, there are cases where we want to use different logic for validation
and conversion. For example, if a string ends up with missing input
required by the encoding, or if a character is input that is invalid
as an encoding but can be converted, the conversion should succeed and
the validation should fail.

To achieve this, a function pointer mb_check_fn has been added to
struct mbfl_encoding to implement the logic used for validation.
Also, added implementation of validation logic for UTF-7, UTF7-IMAP,
ISO-2022-JP and JIS.
2023-03-24 20:34:22 +02:00
Ilija Tovilo
805dafddbb Merge branch 'PHP-8.1' into PHP-8.2
* PHP-8.1:
  Enable GitHub actions cancel-in-progress for PRs
  mb_encode_mimeheader does not crash if provided encoding has no MIME name set
2023-03-07 11:02:00 +01:00
Alex Dowad
7c1ee5a02a mb_encode_mimeheader does not crash if provided encoding has no MIME name set 2023-03-07 11:30:21 +02:00
George Peter Banyard
73f9ffc5cd Merge branch 'PHP-8.1' into PHP-8.2
* PHP-8.1:
  Fix GH-10627: mb_convert_encoding crashes PHP on Windows
  ext/mbstring: fix new_value length check
2023-02-20 13:41:11 +00:00
Niels Dossche
ed0c0df351 Fix GH-10627: mb_convert_encoding crashes PHP on Windows
Fixes GH-10627

The php_mb_convert_encoding() function can return NULL on error, but
this case was not handled, which led to a NULL pointer dereference and
hence a crash.

Closes GH-10628

Signed-off-by: George Peter Banyard <girgias@php.net>
2023-02-20 13:33:11 +00:00
Jakub Zelenka
cc931af35d Fix GH-8086: Introduce mail.mixed_lf_and_crlf INI
When this INI option is enabled, it reverts the line separator for
headers and message to LF which was a non conformant behavior in PHP 7.
It is done because some non conformant MTAs fail to parse CRLF line
separator for headers and body.

This is used for mail and mb_send_mail functions.
2023-01-19 19:05:39 +00:00
Alex Dowad
1751f34cfa Merge branch 'PHP-8.1' into PHP-8.2
* PHP-8.1:
  Use different mblen_table for different SJIS variants
  Correct entry for 0x80,0xFD-FF in SJIS multi-byte character length table
2023-01-06 14:13:21 +02:00
Alex Dowad
d104481af8 Correct entry for 0x80,0xFD-FF in SJIS multi-byte character length table
As a performance optimization, mbstring implements some functions using
tables which give the (byte) length of a multi-byte character using a
lookup based on the value of the first byte. These tables are called
`mblen_table`.

For many years, the mblen_table for SJIS has had '2' in position 0x80.
That is wrong; it should have been '1'. Reasons:

For SJIS, SJIS-2004, and mobile variants of SJIS, 0x80 has never been
treated as the first byte of a 2-byte character. It has always been
treated as a single erroneous byte. On the other hand, 0x80 is a valid
character in MacJapanese... but a 1-byte character, not a 2-byte one.

The same applies to bytes 0xFD-FF; these are 1-byte characters in
MacJapanese, and in other SJIS variants, they are not valid (as the
first byte of a character).

Thanks to the GitHub user 'youkidearitai' for finding this problem.
2023-01-05 14:05:39 +02:00
Alex Dowad
f7a19181d7 Allow 'h' and 'k' flags to be combined for mb_convert_kana
The 'h' flag makes mb_convert_kana convert zenkaku hiragana to hankaku
katakana; 'k' makes it convert zenkaku katakana to hankaku katakana.

When working on the implementation of mb_convert_kana, I added some
additional checks to catch combinations of flags which do not make
sense; but there is no conflict between 'h' and 'k' (they control
conversions for two disjoint ranges of codepoints) and this combination
should not have been restricted.

Thanks to the GitHub user 'akira345' for reporting this problem.

Closes GH-10174.
2022-12-29 20:38:01 +02:00
Alex Dowad
b79a86f53a Merge branch 'PHP-8.1' into PHP-8.2
* PHP-8.1:
  Support Microsoft's "Best Fit" mappings for Windows-1252 text encoding
2022-12-09 15:37:56 +02:00
Alex Dowad
a1a69c3734 Support Microsoft's "Best Fit" mappings for Windows-1252 text encoding
In b5ff87ca71, I made a number of adjustments to our conversion code
for CP1252. One of the adjustments was to make the mappings match those
published by the Unicode Consortium in the file CP1252.TXT. These do
not include mappings for the CP1252 bytes 0x81, 0x8D, 0x8F, 0x90, and
0x9D.

Rostyslav Gulka reported that this caused a problem. His application
stores binary JPEG data in an MS-SQL database. When they SELECT the
binary data out of the database, it is treated as CP1252 text and
automatically converted to UTF-8. To recover the original binary
data, they then do a conversion from UTF-8 to CP1252.

Obviously, that does not work if certain CP1252 bytes do not map to
any Unicode codepoint at all.

While this is a very unusual application of text encoding conversion,
and we might choose not to support it if there was no other basis for
including those mappings, it seems that Microsoft does actually include
them in the Win32 API as "best fit" mappings. These are extra mappings
from Unicode to other text encodings, which the Win32 API function
WideCharToMultiByte uses by default unless the WC_NO_BEST_FIT_CHARS
flag was passed.

A list of these "best fit" mappings for CP1252 can be found here:

https://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WindowsBestFit/bestfit1252.txt
2022-12-09 15:18:37 +02:00
Alex Dowad
8f84192403 Fix mangled kana output for JIS encoding
For JIS encoding, hiragana and katakana can be input in multiple forms.
One form uses JISX 0201 escape sequences. Another is called 'GR-invoked'
kana.

In the context of ISO-2022 encoding, bytes with a zero bit in the MSB
are called "GL" (or "graphics left") and those with the MSB set are
called "GR" (or "graphics right"). Regarding the variants of
ISO-2022-JP which are called "JIS7" and "JIS8", Wikipedia states:

"Other, older variants known as JIS7 and JIS8 build directly on the
7-bit and 8-bit encodings defined by JIS X 0201 and allow use of JIS X
0201 kana from G1 without escape sequences, using Shift Out and Shift
In or setting the eighth bit (GR-invoked), respectively."

In harmony with this, we have always accepted bytes from 0xA3-0xDF and
decoded them to the corresponding hiragana/katakana. However, at some
point I accidentally broke output for these kana. You can see the
problem in 3v4l.org by running this program:

    <?php
    echo bin2hex(mb_convert_encoding("\xA3", 'JIS', 'JIS'));

The results are:

    Output for 8.2rc1 - rc3
    1b244200231b2842
    Output for 7.4.0 - 7.4.33, 8.0.1 - 8.0.25, 8.1.12
    1b2849231b2842
    Output for 8.1.0 - 8.1.11
    1b284923

You can see that from 8.1.0 - 8.1.11, there was a missing escape
sequence at the end. That was caused because the flush functions were
not being called properly, and has already been fixed. However, this
also shows that the output for 8.2rc1-rc3 is completely invalid.
It is trying to output a JISX 0208 sequence, but with 0x00 as one of
the JISX 0208 bytes, which is illegal.

Add the missing code which will make the new text conversion filters
behave the same as the old ones when outputting hiragana/katakana in
JIS encoding.
2022-11-22 15:49:19 +02:00
Alex Dowad
a618682373 For UTF-7, flag unnecessary extra trailing byte in Base64 section as error
This bug was found when I was fuzzing a patch related to mb_strpos.
In some cases, the legacy text conversion code for UTF-7 (and
UTF7-IMAP) would correctly recognize an error for a Base64-encoded
section which was not correctly padded with zero bits, but the new
(and faster) text conversion code would not.

Specifically, if the input string ended abruptly after the 4th or 7th
byte of a Base64-encoded section, the new conversion code would
confirm that the trailing padding bits from the previous byte (3rd or
6th) were zeroes, but would not check whether the 4th or 7th byte
itself encoded any non-zero bits. The legacy conversion code did
perform this check and would treat the input string as invalid.

Actually, even if the 4th or 7th byte does encode only (padding) zero
bits, this is still a problem, because there is no reason to have a
4th (or 7th) byte in that case. The UTF-7 string should have ended
on the previous byte instead.

Apply the same fix for both UTF-7 and UTF7-IMAP.
2022-11-21 14:49:01 +02:00
Alex Dowad
d3933e0b6c Fix regression test for GH-9535 on PHP-8.2+
Some of the legacy text encodings which were used in this regression
test are deprecated in PHP-8.2+. The deprecation warnings break the
expected output. Since using these encodings in mbstring is now
deprecated, I think there is little point in keeping them in this test.
So they are now removed from it.

Further, in 219fff376b, I made a change to avoid a situation where the
legacy UTF7-IMAP conversion code gets stuck in a wrong state when its
attempt to emit a character fails. When a Base64-encoded section of
input ended with -, the previous code would FIRST emit a character if
necessary (using the CK or "check" macro, which causes the function to
return immediately if the downstream filter function returns an error
code), and THEN update its own state to indicate that it is now in
ASCII rather than Base64 mode.

If the downstream filter function returned an error code, the CK macro
would then cause the UTF7-IMAP filter function to return immediately
WITHOUT setting its own state to indicate that the Base64-encoded
section was done.

I fixed this by updating the filter state as needed BEFORE calling CK...
but I missed updating the filter state in the case where the Base64
section ends normally and there is no need to emit anything.

Again, in 6d525a425e, I modified the legacy conversion code for
ISO-2022-KR to try to comply more closely with the RFC for this
text encoding. The RFC states that before any occurrence of 'Shift In'
or 'Shift Out' codes in a ISO-2022-KR string, a special escape
sequence must appear at least ONCE, at the beginning of a line.
The previous code did not comply with this requirement. I made it
comply by always emitting this escape sequence at the beginning of
the first line.

Since mb_strcut (wrongly) determines when it has consumed enough of
the input string by looking at the length of its output in bytes, this
extra escape sequence makes mb_strcut consume 4 bytes less of an
ISO-2022-KR string than would otherwise be the case. When this
strange behavior of mb_strcut is fixed, this test will have to be
adjusted to restore the previous expected outputs for ISO-2022-KR.
2022-11-14 11:46:12 +02:00
Alex Dowad
79ae3090e0 Merge branch 'PHP-8.1' into PHP-8.2
* PHP-8.1:
  [ci skip] NEWS
  Fix GH-9535 (unintended behavior change for mb_strcut in PHP 8.1)
2022-11-13 14:42:57 +02:00
NathanFreeman
fa0401b0b5 Fix GH-9535 (unintended behavior change for mb_strcut in PHP 8.1)
The existing implementation of mb_strcut extracts part of a
multi-byte encoded string by pulling out raw bytes and then running
them through a conversion filter to ensure that the output is valid
in the requested encoding.

If the conversion filter emits error markers when doing the final
'flush' operation which ends the conversion of the extracted bytes,
these error markers may (in some cases) be included in the output.
The conversion operation does not respect the value of
mb_substitute_character; rather, it always uses '?' as an error marker.
So this issue manifests itself as unwanted '?' characters being
inserted into the output.

This issue has existed for a long time, but became noticeable in PHP
8.1 because for at least some of the supported text encodings, mbstring
is now more strict about emitting error markers when strings end in an
illegal state.

The simplest fix is to suppress error markers during the final flush
operation.

While working on a fix for this problem, another problem with mb_strcut
was discovered; since it decides when to stop consuming bytes from
the input by looking at the byte length of its OUTPUT, anything which
causes extra bytes to be emitted to the output may cause mb_strcut to
not consume all the bytes in the requested range.

The one case where we DO emit extra output bytes is for encodings
which have a selectable mode, like ISO-2022-JP; if a string in such
an encoding ends in a mode which is not the default, we emit an ending
escape sequence which changes back to the default mode. This is done
so that concatenating strings in such encodings is safe.

However, as mentioned, this can cause the output of mb_strcut to be
shorter than it logically should be. This bug has existed for a long
time, and fixing it now will be a BC break, so we may not fix it right
away.

Therefore, tests for THIS fix which don't pass because of that OTHER
bug have been split out into a separate test file (gh9535b.phpt), and
that file has been marked XFAIL.
2022-11-13 14:37:55 +02:00
Alex Dowad
a116aaebd9 Merge branch 'PHP-8.1' into PHP-8.2
* PHP-8.1:
  Add regression test for problem with mb_encode_mimeheader reported as GH-9683
  In legacy text conversion filters, reset filter state in 'flush' function
2022-10-10 20:48:10 +09:00
Alex Dowad
faa5425b0f Add regression test for problem with mb_encode_mimeheader reported as GH-9683 2022-10-10 20:46:12 +09:00
Alex Dowad
9beb93f2cf Merge branch 'PHP-8.1' into PHP-8.2
* PHP-8.1:
  Restore backwards-compatible mappings of U+005C and U+007E to SJIS-2004
2022-10-05 12:27:32 +09:00
Alex Dowad
dd00e2f1e3 Restore backwards-compatible mappings of U+005C and U+007E to SJIS-2004
In 0d0029d729 and 315d48b434, I changed the mappings used for Unicode
to Shift-JIS-2004, in an attempt to follow the JISC specification
more closely. However, feedback from Japanese PHP users indicates
that most users of SJIS-2004 expect 0x5C and 0x7E to be treated as
equivalent to the same ASCII bytes. This is due to a long history of
non-complying implementations which then became a de-facto standard.

Therefore, restore the earlier mappings for U+005C and U+007E.

Thanks to the GitHub user 'youkidearitai' for reporting this issue.

Fixes GH-9528.
2022-10-05 12:18:38 +09:00
Alex Dowad
5f8993bc28 Merge branch 'PHP-8.1'
* PHP-8.1:
  Reintroduce legacy 'SJIS-win' text encoding in mbstring
2022-08-16 20:47:04 +02:00
Alex Dowad
371367ce3e Reintroduce legacy 'SJIS-win' text encoding in mbstring
In e2459857af, I combined mbstring's "SJIS-win" text encoding
into CP932. This was done after doing some testing which appeared
to show that the mappings for "SJIS-win" were the same as those
for "CP932".

Later, it was found that there was actually a small difference
prior to e2459857af when converting Unicode to CP932. The
mappings for the following two codepoints were different:

        CP932  SJIS-win
U+203E  0x7E   0x81 0x50
U+00A5  0x5C   0x81 0x8F

As shown, mbstring's "CP932" mapped Unicode's 'OVERLINE' and
'YEN SIGN' to the ASCII bytes which have conflicting uses in
most legacy Japanese text encodings. "SJIS-win" mapped these
to equivalent JIS X 0208 fullwidth characters.

Since e2459867af was not intended to cause any user-visible
change in behavior, I am rolling back the merge of "CP932"
and "SJIS-win".

It seems doubtful whether these two text encodings should
be kept separate or merged in a future release. An extensive
discussion of the related historical background and
compatibility issues involved can be found in this
GitHub thread:

https://github.com/php/php-src/issues/8308
2022-08-16 20:18:54 +02:00
Alex Dowad
93207535fa Add test to exercise _php_mb_encoding_handler_ex with multiple possible input encodings
Thanks to Kamil Tekiela for pointing out that there was no test
case for this.
2022-08-16 16:43:38 +02:00
Alex Dowad
d9269becca Fix problems with ISO-2022-KR conversion
• The legacy conversion code did not emit an error marker if an
  escape sequence was truncated.

• BOTH old and new conversion code would shift from KSC5601
  (KS X 1001) mode to ASCII mode on an invalid escape sequence.
  This doesn't make any sense.
2022-08-16 16:43:27 +02:00
Alex Dowad
bfccdbd858 SJIS-Mobile#SOFTBANK string can end immediately after special escape sequence
SJIS-Mobile#SOFTBANK text encoding supports special escape sequences,
which shift the decoder into a mode where each single byte represents
an emoji. To get out of this mode, a 0xF (SHIFT OUT) byte can be
used.

After one of these special escape sequences, the new conversion
code expected to see at least one more byte. However, there doesn't
seem to be any particular reason why it should be treated as an
error condition if a string ends abruptly after one of these
escapes. Well, the escape sequence is useless in that case, but
it is a complete and valid escape sequence.

The legacy conversion code did allow a string to end immediately
after one of these escape sequences. Amend the new code to allow
the same.
2022-08-16 16:43:27 +02:00
Alex Dowad
983a29d3c0 Legacy conversion code for '7bit' to '8bit' inserts error markers
The use of a special 'vtbl' for converting between '7bit' and
'8bit' text meant that '7bit' text would not be converted to
wchars before going to '8bit'. This meant that the special
value MBFL_BAD_INPUT, which we use to flag an erroneous byte
sequence in input text (and which is required by functions
like mb_check_encoding), would pass directly to the output,
instead of being converted to the error marker specified
by mb_substitute_character.

This issue dates back to the time when I removed the mbfl
'identify filters' and made encoding validity checking and
encoding detection rely only on the conversion filters.
2022-08-16 16:43:27 +02:00
Alex Dowad
c6bd08530e Adjust number of error markers emitted for truncated ISO-2022-JP escape sequence
Fuzzing revealed a small difference between the number of error
markers which the legacy ISO-2022-JP and JIS7/8 conversion code
emitted for truncated escape sequences and those emitted by the
new code. The behavior of the old code seems more reasonable
here, so we will imitate it.
2022-08-16 16:43:27 +02:00
Alex Dowad
128768a450 Adjust number of error markers emitted for truncated UTF-8 code units
In 04e59c916f, I amended the UTF-8 conversion code, so that when given
invalid input, it would emit a number of errors markers harmonizing
with the WHATWG's specification of the standard UTF-8 decoding
algorithm. (Which, gentle reader of commit logs, you can find online
at https://encoding.spec.whatwg.org/#utf-8-decoder.) However, the code
in 04e59c916f was faulty in the case that a truncated UTF-8 code unit
starts with 0xF1.

Then, in dc1ba61d09, when making a small refactoring to a different
part of the UTF-8 conversion code, I inexplicably broke part of the
working code, causing the same fault which was already present with
truncated UTF-8 code units starting with 0xF1 to also occur with
0xF2 and 0xF3 as well. I don't remember what inane thoughts I was
thinking when I pulled off this feat of utter mental confusion.

None of these cases were covered by unit tests, by the way.

Thankfully, my trusty fuzzer picked up on this when testing the
new implementation of mb_parse_str (since the legacy UTF-8
conversion filter did not suffer from the same problem, and I was
fuzzing to find any differences in behavior between the old and
new implementations).

Fortuitously, the fuzzer also picked up another issue which was
present in 04e59c916f. I was emitting only one error marker for
truncated code units starting with 0xE0 or 0xED, in cases where
the WHATWG standard indicates two should be emitted. Examples
are 0xE0 0x9F <END OF STRING> or 0xED 0xA0 <END OF STRING>.

Code units starting with 0xE0-0xED should have 3 bytes. If the
first byte is 0xE0, the second MUST be 0xA0 or greater. (Otherwise,
the codepoint could have fit in a two-byte code unit.) And if the
first byte is 0xED, the second MUST be 0x9F or less. According to
the WHATWG algorithm, step 4, if the second byte is outside the
legal range, then the decoder should emit an error... AND
reprocess the out-of-range byte. The reprocessing will then
cause another error. That's why the decoder should indicate two
errors and not one.
2022-08-16 16:43:27 +02:00
Alex Dowad
a4656895dd Imitate legacy behavior when converting non-encodings using mbstring
Fuzzing revealed that something was missed here when making the new
encoding conversion code match the behavior of the old code. In the
next major release of PHP, support for these non-encodings will be
dropped, but in the meantime, it is better to match the legacy
behavior.
2022-08-16 16:43:27 +02:00
Alex Dowad
aeccb139c3 Use new encoding conversion filters for mb_parse_str and php_mb_post_handler
When micro-benchmarking on relatively short ASCII strings, the new
implementation was about 30% faster than the old one.
2022-08-16 16:43:26 +02:00
Christoph M. Becker
d013d94985 Fix GH-9248: Segmentation fault in mb_strimwidth()
We need to initialize the optional argument `trimmarker` with its
default value.

Closes GH-9273.
2022-08-08 18:35:37 +02:00
Alex Dowad
5370f344d2 mb_strimwidth inserts error markers in invalid input string (for backwards compatibility)
The old implementation did this. It also did the same to the
trim marker, if the trim marker was invalid in the specified
encoding, but I have not imitated that behavior (for performance).
2022-08-02 11:07:06 +02:00
Alex Dowad
7299096095 New implementation of mb_strimwidth
This new implementation of mb_strimwidth uses the new text
encoding conversion filters. Changes from the previous
implementation:

• mb_strimwidth allows a negative 'from' argument, which
should count backwards from the end of the string. However,
the implementation of this feature was buggy (starting right
from when it was first implemented).

It used the following code:

    if ((from < 0) || (width < 0)) {
        swidth = mbfl_strwidth(&string);
    }
    if (from < 0) {
        from += swidth;
    }

Do you see the bug? 'from' is a count of CODEPOINTS, but
'swidth' is a count of TERMINAL COLUMNS. Adding those two
together does not make sense. If there were no fullwidth
characters in the input string, then the two counts coincide
and the feature would work correctly. However, each
fullwidth character would throw the result off by one,
causing more characters to be skipped than was requested.

• mb_strimwidth also allows a negative 'width' argument,
which again counts backwards from the end of the string;
in this case, it is not determining the START of the portion
which we want to extract, but rather, the END of that portion.
Perhaps unsurprisingly, this feature was also buggy.

Code:

    if (width < 0) {
        width = swidth + width - from;
    }

'swidth + width' is fine here; the problem is '- from'.
Again, that is subtracting a count of CODEPOINTS from a
count of TERMINAL COLUMNS. In this case, we really need
to count the terminal width of the string prefix skipped
over by 'from', and subtract that rather than the number
of codepoints which are being skipped.

As a result, if a 'from' count was passed along with a
negative 'width', for every fullwidth character in the
skipped prefix, the result of mb_strimwidth was one
terminal column wider than requested.

Since these situations were covered by unit tests, you
might wonder why the bugs were not caught. Well, as far as
I can see, it looks like the author of the 'tests' just
captured the actual output of mb_strimwidth and defined it
as 'correct'. The tests were written in such a way that it
was difficult to examine them and see whether they made
sense or not; but a careful examination of the inputs and
outputs clearly shows that the legacy tests did not conform
to the documented contract of mb_strimwidth.

• The old implementation would always pass the input string
through decoding/encoding filters before returning it to
the caller, even if it fit within the specified width. This
means that invalid byte sequences would be converted to
error markers. For performance, the new implementation
returns the very same string which was passed in if it
does not exceed the specified width. This means that
erroneous byte sequences are not converted to error markers
unless it is necessary to trim the string.

• The same applies to the 'trim marker' string.

• The old implementation was buggy in the (unusual)
case that the trim marker is wider than the requested
maximum width of the result. It did an unsigned subtraction
of the requested width and the width of the trim marker. If the
width of the trim marker was greater, that subtraction would
underflow and yield a huge number. As a result, mb_strimwidth
would then pass the input string through, even if it was
far wider than the requested maximum width.

In that case, since the input string is wider than the
requested width, and NONE of it will fit together with the
trim marker, the new implementation returns just the trim
marker. This is the one case where the output can be wider
than the requested width: when BOTH the input string and
also the trim marker are too wide.

• Since it passed the input string and trim marker through
decoding/encoding filters, when using "Quoted-Printable" as
the encoding, newlines could be inserted into the trim marker
to maintain the maximum line length for QP.

This is an extremely bizarre use case and I don't think there
is any point in worrying about it. QP will be removed from
mbstring in time, anyways.

PERFORMANCE:

• From micro-benchmarking with various input string lengths and
text encodings, it appears that the new implementation is 2-3x
faster for UTF-8 and UTF-16. For legacy Japanese text encodings
like ISO-2022-JP or SJIS, the new implementation is perhaps 25%
faster.

• Note that correctly implementing negative 'from' and 'width'
arguments imposes a small performance burden in such cases; one
which the old implementation did not pay. This slightly skews
benchmarking results in favor of the old implementation. However,
even so, the new implementation is faster in all cases which I
tested.
2022-08-02 11:07:06 +02:00
Christoph M. Becker
3e922bf08f Merge branch 'PHP-8.1'
* PHP-8.1:
  Fix GH-9008: mb_detect_encoding(): wrong results with null $encodings
2022-07-20 17:01:42 +02:00
Christoph M. Becker
c2bdaa48e1 Fix GH-9008: mb_detect_encoding(): wrong results with null $encodings
Passing `null` to `$encodings` is supposed to behave like passing the
result of `mb_detect_order()`.  Therefore, we need to remove the non-
encodings from the `elist` in this case as well.  Thus, we duplicate
the global `elist`, so we can modify it.

Closes GH-9063.
2022-07-20 16:58:55 +02:00
Alex Dowad
6d525a425e Fix legacy conversion filter for ISO-2022-KR 2022-07-20 07:44:20 +02:00
Alex Dowad
9ac49c0dd3 New implementation of mb_convert_kana
mb_convert_kana now uses the new text encoding conversion
filters. Microbenchmarking shows speed gains of 50%-150%
across various text encodings and input string lengths.

The behavior is the same as the old mb_convert_kana
except for one fix: if the 'zero codepoint' U+0000 appeared
in the input, the old implementation would sometimes drop
it, not passing it through to the output. This is now
fixed.
2022-07-20 07:44:19 +02:00
Eric Norris
09237f6126 Update request startup error messages 2022-07-18 23:19:59 +01:00