The previous version of the GB-18030 standard was published in 2005.
This commit adds support for the updated (2022) version of this text
encoding. The existing GB18030 implementation has been left unchanged
for backwards compatibility; users who want to use the new standard
must explicitly indicate the desired text encoding is 'GB18030-2022'.
The document which defines GB18030-2022, published by the government
of the People's Republic of China, defines three levels of standards
compliance. This implementation is intended to achieve Implementation
Level 3, which is the highest level of compliance.
Experts in the GB18030 standard are requested to assess this
implementation and report any deviation from the standard.
mbfl_name2encoding() uses a linear loop through the encodings, comparing
the name one by one, which is very slow. For the benchmark [1] just
looking up the name takes about 50% of run-time.
By using perfect hashing instead, we no longer have to loop over the
list, and the number of string comparisons is reduced to just a single
one. The perfect hashing table is generated using GNU gperf and amended
manually to fit in with mbstring and manually changed to reduce the
cache size.
[1] https://github.com/php/php-src/issues/12684#issuecomment-1813799924
The old implementation runs through the entire string to pick out the
part which should be returned by mb_strcut. This creates significant
performance overhead. The new specialized implementation of mb_strcut
for UTF-8 usually only examines a few bytes around the starting and
ending cut points, meaning it generally runs in constant time.
For UTF-8 strings just a few bytes long, the new implementation is
around 10% faster (according to microbenchmarks which I ran locally).
For strings around 10,000 bytes in length, it is 50-300x faster.
(Yes, that is 300x and not 300%.)
The new implementation behaves identically to the old one on VALID
UTF-8 strings; a fuzzer was used to help ensure this is the case.
On invalid UTF-8 strings, there is a difference: in some cases, the
old implementation will pass invalid byte sequences through unchanged,
while in others it will remove them. The new implementation has
behavior which is perhaps slightly more predictable: it simply backs
up the starting and ending cut points to the preceding "starter
byte" (one which is not a UTF-8 continuation byte).
This will make it easier to combine duplicated code between all the
CJK text encodings (a significant amount is already combined in this
commit, such as the repeated definitions of SJIS_DECODE and
SJIS_ENCODE), but I hope to remove even more redundancy in the future.
The table used to implement mb_strlen for CP932 has been changed to
the same table as "SJIS-win".
For mb_parse_str, when mbstring.http_input (INI parameter) is a list of
multiple possible text encodings (which is not the case by default),
this new implementation is about 25% faster.
When mbstring.http_input is a single value, then nothing is changed.
(No automatic encoding detection is done in that case.)
Previously, mbstring used the same logic for encoding validation as for
encoding conversion.
However, there are cases where we want to use different logic for validation
and conversion. For example, if a string ends up with missing input
required by the encoding, or if a character is input that is invalid
as an encoding but can be converted, the conversion should succeed and
the validation should fail.
To achieve this, a function pointer mb_check_fn has been added to
struct mbfl_encoding to implement the logic used for validation.
Also, added implementation of validation logic for UTF-7, UTF7-IMAP,
ISO-2022-JP and JIS.
The behavior of the new mb_encode_mimeheader implementation closely
follows the old implementation, except for three points:
• The old implementation was missing a call to the mbfl_convert_filter
flush function. So it would sometimes truncate the input string just
before its end.
• The old implementation would drop zero bytes when QPrint-encoding.
So for example, if you tried to QPrint-encode the UTF-32BE string
"\x00\x00\x12\x34", its QPrint-encoding would be "=12=34", which
does not decode to a valid UTF-32BE string. This is now fixed.
• In some rare corner cases, the new implementation will choose to
Base64-encode or QPrint-encode the input string, where the old
implementation would have just added newlines to it. Specifically,
this can happen when there is a non-space ASCII character, followed
by a large number of ASCII spaces, followed by a non-ASCII character.
The new implementation is around 2.5-8x faster than the old one,
depending on the text encoding and transfer encoding used. Performance
gains are greater with Base64 transfer encoding than with QPrint
transfer encoding; this is not because QPrint-encoding bytes is slow,
but because QPrint-encoded output is much bigger than Base64-encoded
output and takes more lines, so we have to go through the process of
finding the right place to break a line many more times.
The new implementation is 2.5x-3x faster.
If an invalid charset name was used, the old implementation would get
'stuck' trying to parse the charset name and would not interpret any
other MIME encoded words up to the end of the input string. The new
implementation fixes this bug.
If an (invalid) encoded word ends abruptly and a new (valid) encoded
word starts, the old implementation would not decode the valid encoded
word. The new implementation also fixes this.
Otherwise, the behavior of the new implementation has been designed to
closely match that of the old implementation.
We now have a couple of mbstring functions which have fast paths for
strings marked as 'valid UTF-8'. Later, we may likely have more. So
that these fast paths can be used more frequently, mark UTF-8 strings
emitted by mbstring as 'valid UTF-8'. This is always a correct thing
to do, because mbstring never returns invalid UTF-8 as the result of
a conversion (or similar) operation.
Internally, we do have a conversion mode which deliberately emits
invalid UTF-8 in some cases. (This is done to prevent unwanted matches
when we are converting strings to UTF-8 before performing matching
operations on them.) For such strings, don't set the 'valid UTF-8' flag.
It probably wouldn't hurt anything to set it, because strings generated
using that special conversion mode should *never* be returned to
userland, and I don't think we do anything with them which cares about
the IS_STR_VALID_UTF8 flag... but still, it would likely cause
confusion for developers.
Regarding the optional 3rd `strict` argument to mb_detect_encoding,
the documentation states:
Controls the behaviour when string is not valid in any of the listed encodings.
If strict is set to false, the closest matching encoding will be returned;
if strict is set to true, false will be returned.
(Ref: https://www.php.net/manual/en/function.mb-detect-encoding.php)
Because of bugs in the implementation, mb_detect_encoding did not always
behave according to this description when `strict` was false.
For example:
<?php
echo var_export(mb_detect_encoding("\xc0\x00", "UTF-8", false));
// Before this commit, prints: false
// After this commit, prints: 'UTF-8'
Because `strict` is false in the above example, mb_detect_encoding
should return the 'closest matching encoding', which is UTF-8, since
that is the only candidate encoding. (Incidentally, this example shows
that using mb_detect_encoding with a single candidate encoding in
non-strict mode is useless.)
The new implementation fixes this bug. It also fixes another problem
with the old implementation as regards non-strict detection mode:
The old implementation would stop processing of the input string using
a particular candidate encoding as soon as it saw an error in that
encoding, even in non-strict mode. This means that it could not really
detect the 'closest matching encoding'; rather, what it would return
in non-strict mode was 'the encoding in which the first decoding error
is furthest from the beginning of the input string'.
In non-strict mode, the new implementation continues trying to process
the input string to its end even after seeing an error. This makes it
possible to determine in which candidate encoding the string has the
smallest number of errors, i.e. the 'closest matching encoding'.
Rejecting candidate encodings as soon as it saw an error gave the old
implementation a marked performance advantage in non-strict mode;
however, the new implementation still beats it in most cases. Here are
a few sample microbenchmark results:
UTF-8, ~100 codepoints, strict mode
Old: 0.080s (100,000 calls)
New: 0.026s (" " )
UTF-8, ~100 codepoints, non-strict mode
Old: 0.079s (100,000 calls)
New: 0.033s (" " )
UTF-8, ~10000 codepoints, strict mode
Old: 6.708s (60,000 calls)
New: 1.383s (" " )
UTF-8, ~10000 codepoints, non-strict mode
Old: 6.705s (60,000 calls)
New: 3.044s (" " )
Notice that the old implementation had almost identical performance
between strict and non-strict mode, while the new suffers a significant
performance penalty for non-strict detection. This is the cost of
implementing the behavior specified in the documentation.
A couple more sample results:
SJIS, ~10000 codepoints, strict mode
Old: 4.563s
New: 1.084s
SJIS, ~10000 codepoints, non-strict mode
Old: 4.569s
New: 2.863s
This is the only case I found where the new implementation loses:
UTF-16LE, ~10000 codepoints, non-strict mode
Old: 1.514s
New: 2.813s
The reason is because the test strings happened to be invalid right from
the first few bytes for all the candidate encodings except for UTF-16LE;
so the old implementation would immediately reject all those encodings
and only process the entire string in UTF-16LE.
I believe mb_detect_encoding could be made much faster if we identified
good criteria for when to reject candidate encodings before reaching
the end of the input string.
The performance gain from this change depends on the text encoding and
input string size. For very small strings, other overheads tend to swamp
the performance gains to some extent, such that the speedup is less than
2x. For medium-length strings (~100 bytes or so), the speedup is
typically around 2.5x.
The greatest performance gains are for UTF-8 strings which have already
been marked as valid (using the GC flags on the zend_string object);
for those, the speedup is more than 10x in many cases.
The previous implementation first converted the haystack and needle to
wchars, then searched for matches between the two sequences of wchars.
Because we use -1 as an error marker when converting to wchars, error
markers from invalid byte sequences in the haystack would match error
markers from invalid byte sequences in the needle, even if the specific
invalid byte sequence was different. I am not sure whether this behavior
is really desirable or not, but anyways, this new implementation
follows the same behavior so as not to cause BC breaks.
This boosts the performance of mb_strpos, mb_stripos, mb_strrpos,
mb_strripos, mb_strstr, mb_stristr, mb_strrchr, and mb_strrichr when
used on non-UTF-8 strings. mb_substr is also faster.
With UTF-8 input, there is no appreciable difference in performance for
mb_strpos, mb_stripos, mb_strrpos, etc. This is expected, since the only
real difference here (aside from shorter and simpler code) is that the
new text conversion code is used when converting non-UTF-8 input strings
to UTF-8. (This is done because internally, mb_strpos, etc. work only
on UTF-8 text.)
For ASCII, speed is boosted by 30-65%. For other legacy text encodings,
the degree of performance improvement will depend on how slow the
legacy conversion code was.
One other minor, but notable difference is that strings encoded using
UTF-8 variants from Japanese mobile vendors (SoftBank, KDDI, Docomo)
will not undergo encoding conversion but will be processed "as is". It
is expected that this will result in a large performance boost for
such input strings; but realistically, the number of users who work
with such strings is probably minute.
I was not originally planning to include mb_substr in this commit, but
fuzzing of the reimplemented mb_strstr revealed that mb_substr needed
to be reimplemented, too; using the old mbfl_substr, which was based
on the old text conversion filters, in combination with functions which
use the new text conversion filters caused bugs.
The performance boost for mb_substr varies from 10%-500%, depending
on the encoding and input string used.
The existing implementation of mb_strcut extracts part of a
multi-byte encoded string by pulling out raw bytes and then running
them through a conversion filter to ensure that the output is valid
in the requested encoding.
If the conversion filter emits error markers when doing the final
'flush' operation which ends the conversion of the extracted bytes,
these error markers may (in some cases) be included in the output.
The conversion operation does not respect the value of
mb_substitute_character; rather, it always uses '?' as an error marker.
So this issue manifests itself as unwanted '?' characters being
inserted into the output.
This issue has existed for a long time, but became noticeable in PHP
8.1 because for at least some of the supported text encodings, mbstring
is now more strict about emitting error markers when strings end in an
illegal state.
The simplest fix is to suppress error markers during the final flush
operation.
While working on a fix for this problem, another problem with mb_strcut
was discovered; since it decides when to stop consuming bytes from
the input by looking at the byte length of its OUTPUT, anything which
causes extra bytes to be emitted to the output may cause mb_strcut to
not consume all the bytes in the requested range.
The one case where we DO emit extra output bytes is for encodings
which have a selectable mode, like ISO-2022-JP; if a string in such
an encoding ends in a mode which is not the default, we emit an ending
escape sequence which changes back to the default mode. This is done
so that concatenating strings in such encodings is safe.
However, as mentioned, this can cause the output of mb_strcut to be
shorter than it logically should be. This bug has existed for a long
time, and fixing it now will be a BC break, so we may not fix it right
away.
Therefore, tests for THIS fix which don't pass because of that OTHER
bug have been split out into a separate test file (gh9535b.phpt), and
that file has been marked XFAIL.
In e2459857af, I combined mbstring's "SJIS-win" text encoding
into CP932. This was done after doing some testing which appeared
to show that the mappings for "SJIS-win" were the same as those
for "CP932".
Later, it was found that there was actually a small difference
prior to e2459857af when converting Unicode to CP932. The
mappings for the following two codepoints were different:
CP932 SJIS-win
U+203E 0x7E 0x81 0x50
U+00A5 0x5C 0x81 0x8F
As shown, mbstring's "CP932" mapped Unicode's 'OVERLINE' and
'YEN SIGN' to the ASCII bytes which have conflicting uses in
most legacy Japanese text encodings. "SJIS-win" mapped these
to equivalent JIS X 0208 fullwidth characters.
Since e2459867af was not intended to cause any user-visible
change in behavior, I am rolling back the merge of "CP932"
and "SJIS-win".
It seems doubtful whether these two text encodings should
be kept separate or merged in a future release. An extensive
discussion of the related historical background and
compatibility issues involved can be found in this
GitHub thread:
https://github.com/php/php-src/issues/8308
The use of a special 'vtbl' for converting between '7bit' and
'8bit' text meant that '7bit' text would not be converted to
wchars before going to '8bit'. This meant that the special
value MBFL_BAD_INPUT, which we use to flag an erroneous byte
sequence in input text (and which is required by functions
like mb_check_encoding), would pass directly to the output,
instead of being converted to the error marker specified
by mb_substitute_character.
This issue dates back to the time when I removed the mbfl
'identify filters' and made encoding validity checking and
encoding detection rely only on the conversion filters.
Fuzzing revealed that something was missed here when making the new
encoding conversion code match the behavior of the old code. In the
next major release of PHP, support for these non-encodings will be
dropped, but in the meantime, it is better to match the legacy
behavior.
This new implementation of mb_strimwidth uses the new text
encoding conversion filters. Changes from the previous
implementation:
• mb_strimwidth allows a negative 'from' argument, which
should count backwards from the end of the string. However,
the implementation of this feature was buggy (starting right
from when it was first implemented).
It used the following code:
if ((from < 0) || (width < 0)) {
swidth = mbfl_strwidth(&string);
}
if (from < 0) {
from += swidth;
}
Do you see the bug? 'from' is a count of CODEPOINTS, but
'swidth' is a count of TERMINAL COLUMNS. Adding those two
together does not make sense. If there were no fullwidth
characters in the input string, then the two counts coincide
and the feature would work correctly. However, each
fullwidth character would throw the result off by one,
causing more characters to be skipped than was requested.
• mb_strimwidth also allows a negative 'width' argument,
which again counts backwards from the end of the string;
in this case, it is not determining the START of the portion
which we want to extract, but rather, the END of that portion.
Perhaps unsurprisingly, this feature was also buggy.
Code:
if (width < 0) {
width = swidth + width - from;
}
'swidth + width' is fine here; the problem is '- from'.
Again, that is subtracting a count of CODEPOINTS from a
count of TERMINAL COLUMNS. In this case, we really need
to count the terminal width of the string prefix skipped
over by 'from', and subtract that rather than the number
of codepoints which are being skipped.
As a result, if a 'from' count was passed along with a
negative 'width', for every fullwidth character in the
skipped prefix, the result of mb_strimwidth was one
terminal column wider than requested.
Since these situations were covered by unit tests, you
might wonder why the bugs were not caught. Well, as far as
I can see, it looks like the author of the 'tests' just
captured the actual output of mb_strimwidth and defined it
as 'correct'. The tests were written in such a way that it
was difficult to examine them and see whether they made
sense or not; but a careful examination of the inputs and
outputs clearly shows that the legacy tests did not conform
to the documented contract of mb_strimwidth.
• The old implementation would always pass the input string
through decoding/encoding filters before returning it to
the caller, even if it fit within the specified width. This
means that invalid byte sequences would be converted to
error markers. For performance, the new implementation
returns the very same string which was passed in if it
does not exceed the specified width. This means that
erroneous byte sequences are not converted to error markers
unless it is necessary to trim the string.
• The same applies to the 'trim marker' string.
• The old implementation was buggy in the (unusual)
case that the trim marker is wider than the requested
maximum width of the result. It did an unsigned subtraction
of the requested width and the width of the trim marker. If the
width of the trim marker was greater, that subtraction would
underflow and yield a huge number. As a result, mb_strimwidth
would then pass the input string through, even if it was
far wider than the requested maximum width.
In that case, since the input string is wider than the
requested width, and NONE of it will fit together with the
trim marker, the new implementation returns just the trim
marker. This is the one case where the output can be wider
than the requested width: when BOTH the input string and
also the trim marker are too wide.
• Since it passed the input string and trim marker through
decoding/encoding filters, when using "Quoted-Printable" as
the encoding, newlines could be inserted into the trim marker
to maintain the maximum line length for QP.
This is an extremely bizarre use case and I don't think there
is any point in worrying about it. QP will be removed from
mbstring in time, anyways.
PERFORMANCE:
• From micro-benchmarking with various input string lengths and
text encodings, it appears that the new implementation is 2-3x
faster for UTF-8 and UTF-16. For legacy Japanese text encodings
like ISO-2022-JP or SJIS, the new implementation is perhaps 25%
faster.
• Note that correctly implementing negative 'from' and 'width'
arguments imposes a small performance burden in such cases; one
which the old implementation did not pay. This slightly skews
benchmarking results in favor of the old implementation. However,
even so, the new implementation is faster in all cases which I
tested.
mb_convert_kana now uses the new text encoding conversion
filters. Microbenchmarking shows speed gains of 50%-150%
across various text encodings and input string lengths.
The behavior is the same as the old mb_convert_kana
except for one fix: if the 'zero codepoint' U+0000 appeared
in the input, the old implementation would sometimes drop
it, not passing it through to the output. This is now
fixed.
This new implementation uses the new encoding conversion filters.
Aside from fewer LOC and (hopefully) improved readability,
the differences are as follows:
BEHAVIOR CHANGES:
- The old implementation used signed arithmetic when operating
on the 'convmap'. This meant that results could be surprising when
using convmap entries with 1 in the MSB. Further, types like 'int'
were used rather than those with a specific bit width, such as
'int32_t'. This meant that results could also depend on the
platform width of an 'int'.
Now unsigned arithmetic is used, with explicit bit widths.
- Similarly, while converting decimal numeric entities, the
legacy implementation would ensure that the value never overflowed
INT_MAX, and if it did, the entity would be treated as invalid
and passed through unconverted.
However, that again means that results depend on the platform
size of an 'int'. So now, we use a value with explicit bit width
(32 bits) to hold the value of a deconverted decimal entity, and
ensure that the entity value does not overflow that.
Further, because we are using an UNSIGNED 32-bit value rather
than a signed one, the ceiling for how large a decimal entity
can be is higher now.
All of this will probably not affect anyone, since Unicode
codepoints above U+10FFFF are invalid anyways. To see the
difference, you need to be using a text encoding like UCS-4,
which allows huge 'codepoints'.
- If it saw something which looked like a hex entity, but
turned out not to be a valid numeric entity, the old
implementation would sometimes convert the hexadecimal
digits a-f to A-F (uppercase). The new implementation passes
invalid numeric entities through without performing case
conversion.
- The old implementation of mb_encode_numericentity was
limited in how many decimal/hex digits it could emit.
If a text encoding like UCS-4 was in use, where 'codepoints'
can have huge values (larger than the valid range
stipulated by the Unicode standard), it would not error
out on a 'codepoint' whose value was too large for it,
but would rather mangle the value and emit a numeric
entity which decoded to some other random codepoint.
The new implementation is able to emit enough digits to
express any value which fits in 32 bits.
PERFORMANCE:
Based on micro-benchmarks run on my development machine:
Decoding numeric HTML entities is about 4 times faster, for
both decimal and hexadecimal entities, across a variety of
input string lengths. Encoding is about 3 times faster.
Even for single-character strings, this is about 50% faster for
ASCII, UTF-8, and UTF-16. For long strings, the performance gain is
enormous, since the old code would convert the ENTIRE string, just
to pick out the first codepoint.
When converting text to/from wchars, mbstring makes one function call
for each and every byte or wchar to be converted. Typically, each of
these conversion functions contains a state machine, and its state has
to be restored and then saved for every single one of these calls.
It doesn't take much to see that this is grossly inefficient.
Instead of converting one byte or wchar on each call, the new
conversion functions will either fill up or drain a whole buffer of
wchars on each call. In benchmarks, this is about 3-10× faster.
Adding the new, faster conversion functions for all supported legacy
text encodings still needs some work. Also, all the code which uses
the old-style conversion functions needs to be converted to use the
new ones. After that, the old code can be dropped. (The mailparse
extension will also have to be fixed up so it will still compile.)
`php_mb_check_encoding()` now uses conversion to `mbfl_encoding_wchar`.
Since `mbfl_encoding_7bit` has no `input_filter`, no filter can be
found. Since we don't actually need to convert to wchar, we encode to
8bit.
Closes GH-7712.
Originally, `mb_detect_encoding` essentially just checked all candidate
encodings to see which ones the input string was valid in. However, it
was only able to do this for a limited few of all the text encodings
which are officially supported by mbstring.
In 3e7acf901d, I modified it so it could 'detect' any text encoding
supported by mbstring. While this is arguably an improvement, if the
only text encodings one is interested in are those which
`mb_detect_encoding` could originally handle, the old
`mb_detect_encoding` may have been preferable. Because the new one has
more possible encodings which it can guess, it also has more chances to
get the answer wrong.
This commit adjusts the detection heuristics to provide accurate
detection in a wider variety of scenarios. While the previous detection
code would frequently confuse UTF-32BE with UTF-32LE or UTF-16BE with
UTF-16LE, the adjusted code is extremely accurate in those cases.
Detection for Chinese text in Chinese encodings like GB18030 or BIG5
and for Japanese text in Japanese encodings like EUC-JP or SJIS is
greatly improved. Detection of UTF-7 is also greatly improved. An 8KB
table, with one bit for each codepoint from U+0000 up to U+FFFF, is
used to achieve this.
One significant constraint is that the heuristics are completely based
on looking at each codepoint in a string in isolation, treating some
codepoints as 'likely' and others as 'unlikely'. It might still be
possible to achieve great gains in detection accuracy by looking at
sequences of codepoints rather than individual codepoints. However,
this might require huge tables. Further, we might need a huge corpus
of text in various languages to derive those tables.
Accuracy is still dismal when trying to distinguish single-byte
encodings like ISO-8859-1, ISO-8859-2, KOI8-R, and so on. This is
because the valid bytes in these encodings are basically all the same,
and all valid bytes decode to 'likely' codepoints, so our method of
detection (which is based on rating codepoints as likely or unlikely)
cannot tell any difference between the candidates at all. It just
selects the first encoding in the provided list of candidates.
Speaking of which, if one wants to get good results from
`mb_detect_encoding`, it is important to order the list of candidate
encodings according to your prior belief of which are more likely to
be correct. When the function cannot tell any difference between two
candidates, it returns whichever appeared earlier in the array.
Rather than doing a linear search of a table of fullwidth codepoint
ranges for every input character,
1) Short-cut the search if the codepoint is below the first such range
2) Otherwise, do a binary (rather than linear) search