Previously, mbstring used the same logic for encoding validation as for
encoding conversion.
However, there are cases where we want to use different logic for validation
and conversion. For example, if a string ends up with missing input
required by the encoding, or if a character is input that is invalid
as an encoding but can be converted, the conversion should succeed and
the validation should fail.
To achieve this, a function pointer mb_check_fn has been added to
struct mbfl_encoding to implement the logic used for validation.
Also, added implementation of validation logic for UTF-7, UTF7-IMAP,
ISO-2022-JP and JIS.
(The same change has already been made to PHP 8.2 and 8.3; see
6fc8d014df. This commit is backporting the change to PHP 8.1.)
Some places use an if check, which implicitly checks for a non-zero
value, and some places use > 0. The > 0 is the correct one because at
least some of those functions already use the CK() macro to return -1 on
error. Because -1 != 0 this is wrongly interpreted as a success instead
of a failure.
As a performance optimization, mbstring implements some functions using
tables which give the (byte) length of a multi-byte character using a
lookup based on the value of the first byte. These tables are called
`mblen_table`.
For many years, the mblen_table for SJIS has had '2' in position 0x80.
That is wrong; it should have been '1'. Reasons:
For SJIS, SJIS-2004, and mobile variants of SJIS, 0x80 has never been
treated as the first byte of a 2-byte character. It has always been
treated as a single erroneous byte. On the other hand, 0x80 is a valid
character in MacJapanese... but a 1-byte character, not a 2-byte one.
The same applies to bytes 0xFD-FF; these are 1-byte characters in
MacJapanese, and in other SJIS variants, they are not valid (as the
first byte of a character).
Thanks to the GitHub user 'youkidearitai' for finding this problem.
In b5ff87ca71, I made a number of adjustments to our conversion code
for CP1252. One of the adjustments was to make the mappings match those
published by the Unicode Consortium in the file CP1252.TXT. These do
not include mappings for the CP1252 bytes 0x81, 0x8D, 0x8F, 0x90, and
0x9D.
Rostyslav Gulka reported that this caused a problem. His application
stores binary JPEG data in an MS-SQL database. When they SELECT the
binary data out of the database, it is treated as CP1252 text and
automatically converted to UTF-8. To recover the original binary
data, they then do a conversion from UTF-8 to CP1252.
Obviously, that does not work if certain CP1252 bytes do not map to
any Unicode codepoint at all.
While this is a very unusual application of text encoding conversion,
and we might choose not to support it if there was no other basis for
including those mappings, it seems that Microsoft does actually include
them in the Win32 API as "best fit" mappings. These are extra mappings
from Unicode to other text encodings, which the Win32 API function
WideCharToMultiByte uses by default unless the WC_NO_BEST_FIT_CHARS
flag was passed.
A list of these "best fit" mappings for CP1252 can be found here:
https://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WindowsBestFit/bestfit1252.txt
The existing implementation of mb_strcut extracts part of a
multi-byte encoded string by pulling out raw bytes and then running
them through a conversion filter to ensure that the output is valid
in the requested encoding.
If the conversion filter emits error markers when doing the final
'flush' operation which ends the conversion of the extracted bytes,
these error markers may (in some cases) be included in the output.
The conversion operation does not respect the value of
mb_substitute_character; rather, it always uses '?' as an error marker.
So this issue manifests itself as unwanted '?' characters being
inserted into the output.
This issue has existed for a long time, but became noticeable in PHP
8.1 because for at least some of the supported text encodings, mbstring
is now more strict about emitting error markers when strings end in an
illegal state.
The simplest fix is to suppress error markers during the final flush
operation.
While working on a fix for this problem, another problem with mb_strcut
was discovered; since it decides when to stop consuming bytes from
the input by looking at the byte length of its OUTPUT, anything which
causes extra bytes to be emitted to the output may cause mb_strcut to
not consume all the bytes in the requested range.
The one case where we DO emit extra output bytes is for encodings
which have a selectable mode, like ISO-2022-JP; if a string in such
an encoding ends in a mode which is not the default, we emit an ending
escape sequence which changes back to the default mode. This is done
so that concatenating strings in such encodings is safe.
However, as mentioned, this can cause the output of mb_strcut to be
shorter than it logically should be. This bug has existed for a long
time, and fixing it now will be a BC break, so we may not fix it right
away.
Therefore, tests for THIS fix which don't pass because of that OTHER
bug have been split out into a separate test file (gh9535b.phpt), and
that file has been marked XFAIL.
Up until now, I believed that mbstring had been designed such
that (legacy) text conversion filter objects should not be
re-used after the 'flush' function is called to complete a
text conversion operation.
However, it turns out that the implementation of
_php_mb_encoding_handler_ex DID re-use filter objects
after flush. That means that functions which were based on
_php_mb_encoding_handler_ex, including mb_parse_str and
php_mb_post_handler, would break in some cases; state left
over from converting one substring (perhaps a variable name)
would affect the results of converting another substring
(perhaps the value of the same variable), and could cause
extraneous characters to get inserted into the output.
All this code should be deleted soon, but fixing it helps me
to avoid spurious failures when fuzzing the new/old code to
look for differences in behavior.
(This bug fix commit was originally applied to PHP-8.2 when fuzzing
the new mbstring text conversion code to check for differences with
the old code. Later, Kentaro Ohkouchi kindly reported a problem with
mb_encode_mimeheader under PHP 8.1 which was caused by the same issue.
Hence, this commit was backported to PHP-8.1.)
Fixes GH-9683.
In 0d0029d729 and 315d48b434, I changed the mappings used for Unicode
to Shift-JIS-2004, in an attempt to follow the JISC specification
more closely. However, feedback from Japanese PHP users indicates
that most users of SJIS-2004 expect 0x5C and 0x7E to be treated as
equivalent to the same ASCII bytes. This is due to a long history of
non-complying implementations which then became a de-facto standard.
Therefore, restore the earlier mappings for U+005C and U+007E.
Thanks to the GitHub user 'youkidearitai' for reporting this issue.
Fixes GH-9528.
In e2459857af, I combined mbstring's "SJIS-win" text encoding
into CP932. This was done after doing some testing which appeared
to show that the mappings for "SJIS-win" were the same as those
for "CP932".
Later, it was found that there was actually a small difference
prior to e2459857af when converting Unicode to CP932. The
mappings for the following two codepoints were different:
CP932 SJIS-win
U+203E 0x7E 0x81 0x50
U+00A5 0x5C 0x81 0x8F
As shown, mbstring's "CP932" mapped Unicode's 'OVERLINE' and
'YEN SIGN' to the ASCII bytes which have conflicting uses in
most legacy Japanese text encodings. "SJIS-win" mapped these
to equivalent JIS X 0208 fullwidth characters.
Since e2459867af was not intended to cause any user-visible
change in behavior, I am rolling back the merge of "CP932"
and "SJIS-win".
It seems doubtful whether these two text encodings should
be kept separate or merged in a future release. An extensive
discussion of the related historical background and
compatibility issues involved can be found in this
GitHub thread:
https://github.com/php/php-src/issues/8308
According to the relevant Japan Industrial Standards Committee standards,
SJIS 0x5C is a Yen sign, and 0x7E is an overline.
However, this conflicts with the implementation of SJIS in various legacy
software (notably Microsoft products), where SJIS 0x5C and 0x7E are taken
as equivalent to the same ASCII bytes.
Prior to PHP 8.1, mbstring's implementation of SJIS handled these bytes
compatibly with Microsoft products. This was changed in PHP 8.1.0, in an
attempt to comply with the JISC specifications. However, after discussion
with various concerned Japanese developers, it seems that the historical
behavior was more useful in the majority of applications which process
SJIS-encoded text.
Since we are now treating SJIS 0x5C as equivalent to U+005C and 0x7E as
equivalent to U+007E, it does not make sense to convert U+203E (OVERLINE)
to 0x7E, nor does it make sense to convert U+00A5 (YEN SIGN) to 0x5C. Restore
the mappings for those codepoints from before PHP 8.1.0.
In 7502c86342, I adjusted the number of error markers emitted on
invalid UTF-8 text to be more consistent with mbstring's behavior on
other text encodings (generally, it emits one error marker for one
unexpected byte). I didn't expect that anybody would actually care one
way or the other, but felt that it was better to be consistent than
not.
Later, Martin Auswöger kindly pointed out that the WHATWG encoding
specification, which governs how various text encodings are handled
by web browsers, does actually specify how many error markers should
be generated for any given piece of invalid UTF-8 text.
Until now, we have never really paid much attention to the WHATWG
specification, but we do want to comply with as many relevant
specifications as possible. And since PHP is commonly used for web
applications, compatibility with the behavior of web browsers is
obviously a good thing.
This was the old behavior of mb_check_encoding() before 3e7acf901d,
but yours truly broke it. If only we had more thorough tests at that
time, this might not have slipped through the cracks.
Thanks to divinity76 for the report.
`php_mb_check_encoding()` now uses conversion to `mbfl_encoding_wchar`.
Since `mbfl_encoding_7bit` has no `input_filter`, no filter can be
found. Since we don't actually need to convert to wchar, we encode to
8bit.
Closes GH-7712.
Originally, `mb_detect_encoding` essentially just checked all candidate
encodings to see which ones the input string was valid in. However, it
was only able to do this for a limited few of all the text encodings
which are officially supported by mbstring.
In 3e7acf901d, I modified it so it could 'detect' any text encoding
supported by mbstring. While this is arguably an improvement, if the
only text encodings one is interested in are those which
`mb_detect_encoding` could originally handle, the old
`mb_detect_encoding` may have been preferable. Because the new one has
more possible encodings which it can guess, it also has more chances to
get the answer wrong.
This commit adjusts the detection heuristics to provide accurate
detection in a wider variety of scenarios. While the previous detection
code would frequently confuse UTF-32BE with UTF-32LE or UTF-16BE with
UTF-16LE, the adjusted code is extremely accurate in those cases.
Detection for Chinese text in Chinese encodings like GB18030 or BIG5
and for Japanese text in Japanese encodings like EUC-JP or SJIS is
greatly improved. Detection of UTF-7 is also greatly improved. An 8KB
table, with one bit for each codepoint from U+0000 up to U+FFFF, is
used to achieve this.
One significant constraint is that the heuristics are completely based
on looking at each codepoint in a string in isolation, treating some
codepoints as 'likely' and others as 'unlikely'. It might still be
possible to achieve great gains in detection accuracy by looking at
sequences of codepoints rather than individual codepoints. However,
this might require huge tables. Further, we might need a huge corpus
of text in various languages to derive those tables.
Accuracy is still dismal when trying to distinguish single-byte
encodings like ISO-8859-1, ISO-8859-2, KOI8-R, and so on. This is
because the valid bytes in these encodings are basically all the same,
and all valid bytes decode to 'likely' codepoints, so our method of
detection (which is based on rating codepoints as likely or unlikely)
cannot tell any difference between the candidates at all. It just
selects the first encoding in the provided list of candidates.
Speaking of which, if one wants to get good results from
`mb_detect_encoding`, it is important to order the list of candidate
encodings according to your prior belief of which are more likely to
be correct. When the function cannot tell any difference between two
candidates, it returns whichever appeared earlier in the array.
As a performance optimization, mb_detect_encoding tries to stop
processing the input string early when there is only one 'candidate'
encoding which the input string is valid in. However, the code which
keeps count of how many candidate encodings have already been rejected
was buggy. This caused mb_detect_encoding to prematurely stop
processing the input when it should have continued.
As a result, it did not notice that in the test case provided by Alec,
the input string was not valid in UTF-16.
...By just testing the input codepoints if they are within a few fixed
ranges instead. This avoids hash lookups in property tables.
From (micro-)benchmarking on my PC, this looks to be a bit less than 4x
faster than the existing code.
mbstring has always had the conversion tables to support CP932 codes
in ku 115-119, and the conversion code for CP5022x has an 'if' clause
specifically to handle such characters... but that 'if' clause was dead
code, since a guard clause earlier in the same function prevented it
from accepting 2-byte characters with a starting byte of 0x93-0x97.
Adjust the guard clause so that these characters can be converted as
the original author apparently intended.
The code which handles ku 115-119 is the part which reads:
} else if (s >= cp932ext3_ucs_table_min && s < cp932ext3_ucs_table_max) {
w = cp932ext3_ucs_table[s - cp932ext3_ucs_table_min];
Previously, mbstring had a special mode whereby it would convert
erroneous input byte sequences to output like "BAD+XXXX", where "XXXX"
would be the erroneous bytes expressed in hexadecimal. This mode could
be enabled by calling `mb_substitute_character("long")`.
However, accurately reproducing input byte sequences from the cached
state of a conversion filter is often tricky, and this significantly
complicates the implementation. Further, the means used for passing
the erroneous bytes through to where the "BAD+XXXX" text is generated
only allows for up to 3 bytes to be passed, meaning that some erroneous
byte sequences are truncated anyways.
More to the point, a search of publically available PHP code indicates
that nobody is really using this feature anyways.
Incidentally, this feature also provided error output like "JIS+XXXX"
if the input 'should have' represented a JISX 0208 codepoint, but it
decodes to a codepoint which does not exist in the JISX 0208 charset.
Similarly, specific error output was provided for non-existent
JISX 0212 codepoints, and likewise for JISX 0213, CP932, and a few
other charsets. All of that is now consigned to the flames.
However, "long" error markers also include a somewhat more useful
"U+XXXX" marker for Unicode codepoints which were successfully
decoded from the input text, but cannot be represented in the output
encoding. Those are still supported.
With this change, there is no need to use a variety of special values
in the high bits of a wchar to represent different types of error
values. We can (and will) just use a single error value. This will be
equal to -1.
One complicating factor: Text conversion functions return an integer to
indicate whether the conversion operation should be immediately
aborted, and the magic 'abort' marker is -1. Also, almost all of these
functions would return the received byte/codepoint to indicate success.
That doesn't work with the new error value; if an input filter detects
an error and passes -1 to the output filter, and the output filter
returns it back, that would be taken to mean 'abort'.
Therefore, amend all these functions to return 0 for success.
This is to match the way that we handle UCS-2. When a BOM is found at
the beginning of a 'UCS-2' string (NOT 'UCS-2BE' or 'UCS-2LE'), we take
note of the intended byte order and handle the string accordingly, but
do NOT emit a BOM to the output. Rather, we just use the default byte
order for the requested output encoding.
Some might argue that if the input string used a BOM, and we are
emitting output in a text encoding where both big-endian and
little-endian byte orders are possible, we should include a BOM in the
output string. To such hypothetical debaters of minutiae, I can only
offer you a shoulder shrug. No reasonable program which handles UCS-2
and UCS-4 text should require a BOM.
Really, the concept of the BOM is a poor idea and should not have been
included in Unicode. Standardizing on a single byte order would have
been much better, similar to 'network byte order' for the Internet
Protocol. But this is not the place to speak at length of such things.
Sigh. I included tests which were intended to check this case in the
test suite for ISO-2022-JP-MS, but those tests were faulty and didn't
actually test what they were supposed to.
Fixing the tests revealed that there were still bugs in this area.
There was a bit of legacy code here which looks like the original author
of mbstring intended to allow conversion of Unicode Private Use Area
codepoints to ISO-2022-JP-KDDI. However, that code never worked.
It set the output variable to values which were not matched by any
of the 'if' clauses below, which meant that nothing was actually
emitted to the output. In other words, if one tried to convert Unicode
to ISO-2022-JP-KDDI, and the Unicode string contained PUA codepoints,
they would be quietly 'swallowed' and disappear.
I don't know what ISO-2022-JP-KDDI byte sequences the author wanted
to map those PUA codepoints to, and anyways, this use case is so obscure
that there is little point in worrying about it. However, it is better
to remove the non-functioning code than to leave it in.
This means that if now one tries to convert PUA codepoints to
ISO-2022-JP-KDDI, those codepoints will be treated as erroneous rather
than silently ignored.
After mb_substitute_character("long"), mbstring will respond to
erroneous input by inserting 'long' error markers into the output.
Depending on the situation, these error markers will either look like
BAD+XXXX (for general bad input), U+XXXX (when the input is OK, but it
converts to Unicode codepoints which cannot be represented in the
output encoding), or an encoding-specific marker like JISX+XXXX or
W932+XXXX.
We have almost no tests for this feature. Add a bunch of tests to
ensure that all our legacy encoding handlers work in a reasonable
way when 'long' error markers are enabled.
Some text encodings supported by mbstring (such as UCS-4) accept 4-byte
characters. When mbstring encounters an illegal byte sequence for the
encoding it is using, it should emit an 'illegal character' marker,
which can either be a single character like '?', an HTML hexadecimal
entity, or a marker string like 'BAD+XXXX'.
Because of the use of signed integers to hold 4-byte characters,
illegal 4-byte sequences with a 'negative' value (one with the high
bit set) were not handled correctly when emitting the illegal char
marker. The result is that such illegal sequences were just skipped
over (and the marker was not emitted to the output). Fix that.
The "wchar" encoding isn't really an encoding -- it's what we
internally use as the representation of decoded characters.
In practice, it tends to behave a lot like the 8bit encoding when
used from userland, because input code units end up being treated
as code points.
This patch removes the wchar encoding from the public encoding
list and reserves it for internal use only.
The ascii to wchar was reporting errors using conv_illegal_output,
while it should have been using WCSGROUP_THROUGH. Effectively that
replaced illegal characters with '?' for the purpose of
identification.
opaque is used by the htmlentities filter, which means that we
end up trying to free the score value as a pointer. Don't try to
be overly tricky here and simply allocate a separate structure
to hold the number of illegal characters and the score.
- Truncated multi-byte characters are treated as an error
- Reject GB18030 4-byte codes which translate to (non-existent)
Unicode codepoints above 0x10FFFF
- Add a number of missing mappings from the GB18030 standards
(These mappings are supported by iconv. I don't know why they were
missing from mbstring.)
- Truncated multi-byte characters are treated as an error
- Truncated or unrecognized escape sequences are treated as an error
- ASCII control characters are not allowed to appear in the middle
of a multi-byte character
- Truncated multi-byte characters are treated as an error now
- Invalid multi-byte characters are treated as an error rather than
being quietly swallowed
- ASCII control characters are not allowed to appear in the middle
of a multi-byte character