1
0
mirror of https://github.com/php/php-src.git synced 2026-03-26 01:02:25 +01:00
Commit Graph

717 Commits

Author SHA1 Message Date
Alex Dowad
2dc9026cbc Restore backwards-compatible mappings of 0x5C and 0x7E in SJIS
According to the relevant Japan Industrial Standards Committee standards,
SJIS 0x5C is a Yen sign, and 0x7E is an overline.

However, this conflicts with the implementation of SJIS in various legacy
software (notably Microsoft products), where SJIS 0x5C and 0x7E are taken
as equivalent to the same ASCII bytes.

Prior to PHP 8.1, mbstring's implementation of SJIS handled these bytes
compatibly with Microsoft products. This was changed in PHP 8.1.0, in an
attempt to comply with the JISC specifications. However, after discussion
with various concerned Japanese developers, it seems that the historical
behavior was more useful in the majority of applications which process
SJIS-encoded text.

Since we are now treating SJIS 0x5C as equivalent to U+005C and 0x7E as
equivalent to U+007E, it does not make sense to convert U+203E (OVERLINE)
to 0x7E, nor does it make sense to convert U+00A5 (YEN SIGN) to 0x5C. Restore
the mappings for those codepoints from before PHP 8.1.0.
2022-06-10 21:04:36 +02:00
Alex Dowad
58d0aad75c mb_detect_encoding recognizes all letters in Hungarian alphabet 2022-05-25 08:22:07 +02:00
Alex Dowad
6a4b6d2344 mb_detect_encoding recognizes all letters in Czech alphabet 2022-05-25 07:52:39 +02:00
Alex Dowad
9bb97ee8bc Fix mb_detect_encoding's recognition of Slavic names
Thanks to Côme Chilliet for reporting that mb_detect_encoding was not
detecting the desired text encoding for strings containing š or Ž.
These characters are used in Czech, Serbian, Croatian, Bosnian,
Macedonian, etc. names.
2022-05-24 15:32:20 +02:00
Alex Dowad
04e59c916f Error handling for UTF-8 complies with WHATWG specification
In 7502c86342, I adjusted the number of error markers emitted on
invalid UTF-8 text to be more consistent with mbstring's behavior on
other text encodings (generally, it emits one error marker for one
unexpected byte). I didn't expect that anybody would actually care one
way or the other, but felt that it was better to be consistent than
not.

Later, Martin Auswöger kindly pointed out that the WHATWG encoding
specification, which governs how various text encodings are handled
by web browsers, does actually specify how many error markers should
be generated for any given piece of invalid UTF-8 text.

Until now, we have never really paid much attention to the WHATWG
specification, but we do want to comply with as many relevant
specifications as possible. And since PHP is commonly used for web
applications, compatibility with the behavior of web browsers is
obviously a good thing.
2022-04-16 15:04:38 +02:00
Christoph M. Becker
5003831260 Merge branch 'PHP-8.0' into PHP-8.1
* PHP-8.0:
  Fix GH-8208: mb_encode_mimeheader: $indent functionality broken
2022-03-17 17:34:31 +01:00
Christoph M. Becker
d0417ebc93 Fix GH-8208: mb_encode_mimeheader: $indent functionality broken
We also need to factor in the indent, when getting the encoder result.

Closes GH-8213.
2022-03-17 17:31:58 +01:00
Alex Dowad
8a8533d263 mb_check_encoding($str, '7bit') rejects strings with bytes over 0x7F
This was the old behavior of mb_check_encoding() before 3e7acf901d,
but yours truly broke it. If only we had more thorough tests at that
time, this might not have slipped through the cracks.

Thanks to divinity76 for the report.
2022-02-22 23:56:56 +02:00
Christoph M. Becker
69f6b09b2a Merge branch 'PHP-8.0' into PHP-8.1
* PHP-8.0:
  Fix GH-7902: mb_send_mail may delimit headers with LF only
2022-01-18 13:09:52 +01:00
Christoph M. Becker
03816fba46 Fix GH-7902: mb_send_mail may delimit headers with LF only
Email headers are supposed to be separated with CRLF. Period.

We introduce a `CRLF` macro for better comprehensibility right away.

Closes GH-7907.
2022-01-18 13:08:08 +01:00
Alex Dowad
f07c193583 mb_convert_encoding will not auto-detect input string as UUEncode, Base64, QPrint
In a2bc57e0e5, mb_detect_encoding was modified to ensure it would never
return 'UUENCODE', 'QPrint', or other non-encodings as the "detected
text encoding". Before mb_detect_encoding was enhanced so that it could
detect any supported text encoding, those were never returned, and they
are not desired. Actually, we want to eventually remove them completely
from mbstring, since PHP already contains other implementations of
UUEncode, QPrint, Base64, and HTML entities.

For more clarity on why we need to suppress UUEncode, etc. from being
detected by mb_detect_encoding, the existing UUEncode implementation
in mbstring *never* treats any input as erroneous. It just accepts
everything. This means that it would *always* be treated as a valid
choice by mb_detect_encoding, and would be returned in many, many cases
where the input is obviously not UUEncoded.

It turns out that the form of mb_convert_encoding where the user passes
multiple candidate encodings (and mbstring auto-detects which one to
use) was also affected by the same issue. Apply the same fix.
2021-12-20 22:09:33 +02:00
Christoph M. Becker
929d847152 Fix #81693: mb_check_encoding(7bit) segfaults
`php_mb_check_encoding()` now uses conversion to `mbfl_encoding_wchar`.
Since `mbfl_encoding_7bit` has no `input_filter`, no filter can be
found.  Since we don't actually need to convert to wchar, we encode to
8bit.

Closes GH-7712.
2021-12-03 22:49:47 +01:00
Alex Dowad
1a2c608053 Add unit tests for mb_detect_encoding on Polish text 2021-11-26 17:42:53 +02:00
Alex Dowad
a2bc57e0e5 mb_detect_encoding will not return non-encodings
Among the text encodings supported by mbstring are several which are
not really 'text encodings'. These include Base64, QPrint, UUencode,
HTML entities, '7 bit', and '8 bit'.

Rather than providing an explicit list of text encodings which they are
interested in, users may pass the output of mb_list_encodings to
mb_detect_encoding. Since Base64, QPrint, and so on are included in
the output of mb_list_encodings, mb_detect_encoding can return one of
these as its 'detected encoding' (and in fact, this often happens).
Before mb_detect_encoding was enhanced so it could detect any of the
supported text encodings, this did not happen, and it is never desired.
2021-10-19 18:05:52 +02:00
Alex Dowad
28b346bc06 Improve detection accuracy of mb_detect_encoding
Originally, `mb_detect_encoding` essentially just checked all candidate
encodings to see which ones the input string was valid in. However, it
was only able to do this for a limited few of all the text encodings
which are officially supported by mbstring.

In 3e7acf901d, I modified it so it could 'detect' any text encoding
supported by mbstring. While this is arguably an improvement, if the
only text encodings one is interested in are those which
`mb_detect_encoding` could originally handle, the old
`mb_detect_encoding` may have been preferable. Because the new one has
more possible encodings which it can guess, it also has more chances to
get the answer wrong.

This commit adjusts the detection heuristics to provide accurate
detection in a wider variety of scenarios. While the previous detection
code would frequently confuse UTF-32BE with UTF-32LE or UTF-16BE with
UTF-16LE, the adjusted code is extremely accurate in those cases.
Detection for Chinese text in Chinese encodings like GB18030 or BIG5
and for Japanese text in Japanese encodings like EUC-JP or SJIS is
greatly improved. Detection of UTF-7 is also greatly improved. An 8KB
table, with one bit for each codepoint from U+0000 up to U+FFFF, is
used to achieve this.

One significant constraint is that the heuristics are completely based
on looking at each codepoint in a string in isolation, treating some
codepoints as 'likely' and others as 'unlikely'. It might still be
possible to achieve great gains in detection accuracy by looking at
sequences of codepoints rather than individual codepoints. However,
this might require huge tables. Further, we might need a huge corpus
of text in various languages to derive those tables.

Accuracy is still dismal when trying to distinguish single-byte
encodings like ISO-8859-1, ISO-8859-2, KOI8-R, and so on. This is
because the valid bytes in these encodings are basically all the same,
and all valid bytes decode to 'likely' codepoints, so our method of
detection (which is based on rating codepoints as likely or unlikely)
cannot tell any difference between the candidates at all. It just
selects the first encoding in the provided list of candidates.

Speaking of which, if one wants to get good results from
`mb_detect_encoding`, it is important to order the list of candidate
encodings according to your prior belief of which are more likely to
be correct. When the function cannot tell any difference between two
candidates, it returns whichever appeared earlier in the array.
2021-10-19 18:05:51 +02:00
Alex Dowad
c25a1ef8d0 Bug #81390: mb_detect_encoding should not prematurely stop processing input
As a performance optimization, mb_detect_encoding tries to stop
processing the input string early when there is only one 'candidate'
encoding which the input string is valid in. However, the code which
keeps count of how many candidate encodings have already been rejected
was buggy. This caused mb_detect_encoding to prematurely stop
processing the input when it should have continued.

As a result, it did not notice that in the test case provided by Alec,
the input string was not valid in UTF-16.
2021-09-20 11:21:39 +02:00
Alex Dowad
df32267494 Add more tests for UTF7-IMAP text conversion 2021-08-31 13:41:34 +02:00
Alex Dowad
16a1e0a219 In UTF7-IMAP, reject the 2nd part of surrogate pair if it appears unexpectedly 2021-08-31 13:41:34 +02:00
Alex Dowad
355464935d Add another test for UTF-7 text conversion 2021-08-31 13:41:34 +02:00
Alex Dowad
51b6c687db Add another test for GB18030 text conversion 2021-08-31 13:41:34 +02:00
Alex Dowad
a0415b22ab Add more tests for CP5022{0,1,2} text conversion 2021-08-31 13:41:34 +02:00
Alex Dowad
e3f6a9fbfe CP5022{0,1,2} supports 'IBM extension' codes from ku 115-119
mbstring has always had the conversion tables to support CP932 codes
in ku 115-119, and the conversion code for CP5022x has an 'if' clause
specifically to handle such characters... but that 'if' clause was dead
code, since a guard clause earlier in the same function prevented it
from accepting 2-byte characters with a starting byte of 0x93-0x97.

Adjust the guard clause so that these characters can be converted as
the original author apparently intended.

The code which handles ku 115-119 is the part which reads:

    } else if (s >= cp932ext3_ucs_table_min && s < cp932ext3_ucs_table_max) {
      w = cp932ext3_ucs_table[s - cp932ext3_ucs_table_min];
2021-08-31 13:41:34 +02:00
Alex Dowad
671dcee01e Add test for mb_str_split on UCS-2 text 2021-08-31 13:41:34 +02:00
Alex Dowad
776296e12f mbstring no longer provides 'long' substitutions for erroneous input bytes
Previously, mbstring had a special mode whereby it would convert
erroneous input byte sequences to output like "BAD+XXXX", where "XXXX"
would be the erroneous bytes expressed in hexadecimal. This mode could
be enabled by calling `mb_substitute_character("long")`.

However, accurately reproducing input byte sequences from the cached
state of a conversion filter is often tricky, and this significantly
complicates the implementation. Further, the means used for passing
the erroneous bytes through to where the "BAD+XXXX" text is generated
only allows for up to 3 bytes to be passed, meaning that some erroneous
byte sequences are truncated anyways.

More to the point, a search of publically available PHP code indicates
that nobody is really using this feature anyways.

Incidentally, this feature also provided error output like "JIS+XXXX"
if the input 'should have' represented a JISX 0208 codepoint, but it
decodes to a codepoint which does not exist in the JISX 0208 charset.
Similarly, specific error output was provided for non-existent
JISX 0212 codepoints, and likewise for JISX 0213, CP932, and a few
other charsets. All of that is now consigned to the flames.

However, "long" error markers also include a somewhat more useful
"U+XXXX" marker for Unicode codepoints which were successfully
decoded from the input text, but cannot be represented in the output
encoding. Those are still supported.

With this change, there is no need to use a variety of special values
in the high bits of a wchar to represent different types of error
values. We can (and will) just use a single error value. This will be
equal to -1.

One complicating factor: Text conversion functions return an integer to
indicate whether the conversion operation should be immediately
aborted, and the magic 'abort' marker is -1. Also, almost all of these
functions would return the received byte/codepoint to indicate success.
That doesn't work with the new error value; if an input filter detects
an error and passes -1 to the output filter, and the output filter
returns it back, that would be taken to mean 'abort'.

Therefore, amend all these functions to return 0 for success.
2021-08-31 13:41:34 +02:00
Alex Dowad
15ba73cee3 Add more tests for UTF-8 text conversion 2021-08-30 16:29:58 +02:00
Alex Dowad
51a32ccaf4 Add another test for UTF-16LE 2021-08-30 16:29:58 +02:00
Alex Dowad
7472c82c45 Add tests for UCS-4 text conversion 2021-08-30 16:29:58 +02:00
Alex Dowad
79015b23aa Add tests for UCS-2 text encoding 2021-08-30 16:29:58 +02:00
Alex Dowad
34ef8f3ca2 Add tests for '7bit' and '8bit' text encodings in mbstring 2021-08-30 16:29:58 +02:00
Alex Dowad
e6f1a72235 Add test suite for mobile variants of UTF-8 (and fix bugs) 2021-08-30 16:29:58 +02:00
Alex Dowad
1865576694 Add test suite for EUC-JP-WIN (or EUC-JP-MS) text encoding (and fix bugs) 2021-08-30 16:29:58 +02:00
Alex Dowad
0de4d6872e Add more tests for SJIS-2004 text conversion 2021-08-30 16:29:58 +02:00
Alex Dowad
c7d47cbb4c Add more tests for SJIS text conversion 2021-08-30 16:29:58 +02:00
Alex Dowad
299690a1cf Add more tests for ISO-2022-JP/JIS7/JIS8 text conversion 2021-08-30 16:29:58 +02:00
Alex Dowad
b2be85d11a Add more tests for ISO-2022-JP-MS text conversion 2021-08-30 16:29:58 +02:00
Alex Dowad
ae4c956089 Add more tests for ISO-2022-JP-KDDI text conversion 2021-08-30 16:29:58 +02:00
Alex Dowad
51e0d323e4 ISO-2022-JP-MS treats truncated multi-byte chars as error
Sigh. I included tests which were intended to check this case in the
test suite for ISO-2022-JP-MS, but those tests were faulty and didn't
actually test what they were supposed to.

Fixing the tests revealed that there were still bugs in this area.
2021-08-30 16:29:58 +02:00
Alex Dowad
57a81af041 ISO-2022-JP-KDDI text conversion doesn't swallow PUA codepoints
There was a bit of legacy code here which looks like the original author
of mbstring intended to allow conversion of Unicode Private Use Area
codepoints to ISO-2022-JP-KDDI. However, that code never worked.
It set the output variable to values which were not matched by any
of the 'if' clauses below, which meant that nothing was actually
emitted to the output. In other words, if one tried to convert Unicode
to ISO-2022-JP-KDDI, and the Unicode string contained PUA codepoints,
they would be quietly 'swallowed' and disappear.

I don't know what ISO-2022-JP-KDDI byte sequences the author wanted
to map those PUA codepoints to, and anyways, this use case is so obscure
that there is little point in worrying about it. However, it is better
to remove the non-functioning code than to leave it in.

This means that if now one tries to convert PUA codepoints to
ISO-2022-JP-KDDI, those codepoints will be treated as erroneous rather
than silently ignored.
2021-08-30 16:29:58 +02:00
Alex Dowad
51b9d7a5e1 Test behavior of 'long' illegal character markers
After mb_substitute_character("long"), mbstring will respond to
erroneous input by inserting 'long' error markers into the output.
Depending on the situation, these error markers will either look like
BAD+XXXX (for general bad input), U+XXXX (when the input is OK, but it
converts to Unicode codepoints which cannot be represented in the
output encoding), or an encoding-specific marker like JISX+XXXX or
W932+XXXX.

We have almost no tests for this feature. Add a bunch of tests to
ensure that all our legacy encoding handlers work in a reasonable
way when 'long' error markers are enabled.
2021-08-30 16:29:58 +02:00
Nikita Popov
43cb2548f7 Flush filter during non-strict encoding detection
If we reach the end of the string without reducing to a single
encoding, then we should flush to check whether the last character
is incomplete.
2021-08-27 14:48:32 +02:00
Nikita Popov
14173186db Add EXTENSIONS section 2021-08-11 14:03:18 +02:00
Nikita Popov
28500fe4ef Fixed bug #81349
The ascii to wchar was reporting errors using conv_illegal_output,
while it should have been using WCSGROUP_THROUGH. Effectively that
replaced illegal characters with '?' for the purpose of
identification.
2021-08-11 11:37:02 +02:00
Nikita Popov
89aa42c74b Add missing EXTENSIONS section 2021-07-28 12:38:41 +02:00
Nikita Popov
a1c1ee6a48 Don't use opaque for encoding detection score
opaque is used by the htmlentities filter, which means that we
end up trying to free the score value as a pointer. Don't try to
be overly tricky here and simply allocate a separate structure
to hold the number of illegal characters and the score.
2021-07-28 10:54:27 +02:00
Nikita Popov
9d0db2e98a Fixed bug #81298
Creation of the filter may fail for some special encodings, for
which detection is not supported.
2021-07-28 10:11:46 +02:00
Alex Dowad
13136a575d Fix conversion of GB18030 text (and add test suite)
- Truncated multi-byte characters are treated as an error
- Reject GB18030 4-byte codes which translate to (non-existent)
  Unicode codepoints above 0x10FFFF
- Add a number of missing mappings from the GB18030 standards
  (These mappings are supported by iconv. I don't know why they were
  missing from mbstring.)
2021-07-19 12:17:00 +02:00
Alex Dowad
73c6a5b89d Fix conversion of Big5 and CP950 text (and add test suite)
- Truncated multi-byte characters are treated as an error
- Follow recommended mappings from Unicode consortium
2021-07-19 12:17:00 +02:00
Nikita Popov
639015845f Deprecate calling mb_check_encoding() without argument
Part of https://wiki.php.net/rfc/deprecations_php_8_1.
2021-07-08 15:34:49 +02:00
Alex Dowad
b626e893ff Fix conversion of ISO-2022-KR text (and add test suite)
- Truncated multi-byte characters are treated as an error
- Truncated or unrecognized escape sequences are treated as an error
- ASCII control characters are not allowed to appear in the middle
  of a multi-byte character
2021-07-05 16:28:16 +02:00
Alex Dowad
0a8c00755d Fix conversion of EUC-JP-2004 text (and add test suite)
- Truncated multi-byte characters are treated as an error now
- Invalid multi-byte characters are treated as an error rather than
  being quietly swallowed
- ASCII control characters are not allowed to appear in the middle
  of a multi-byte character
2021-07-05 16:28:16 +02:00