The purpose of mbstring is for working with Unicode and legacy text
encodings; but Base64, QPrint, etc. are not text encodings and don't
really belong in mbstring. PHP already contains separate implementations
of Base64, QPrint, and HTML entities. It will be better to eventually
remove these non-encodings from mbstring.
Regarding HTML entities... there is a bit more to say. mbstring's
implementation of HTML entities is different from the other built-in
implementation (htmlspecialchars and htmlentities). Those functions
convert <, >, and & to HTML entities, but mbstring does not.
It appears that the original author of mbstring intended for something
to be done with <, >, and &. He used a table to identify which
characters should be converted to HTML entities, and </>/& all have a
special value in that table. However, nothing ever checks for that
special value, so the characters are passed through unconverted.
This seems like a very useless implementation of HTML entities. The most
important characters which need to be expressed as entities in HTML
documents are those three!
We must not reuse per-request memory across multiple requests, so this
check triggered during RINIT makes no sense. As explained in the bug
report[1], it can be even harmful, if some request startup fails, and
the pointers refer to already freed memory in the next request.
[1] <https://bugs.php.net/76167>
Closes GH-7604.
Among the text encodings supported by mbstring are several which are
not really 'text encodings'. These include Base64, QPrint, UUencode,
HTML entities, '7 bit', and '8 bit'.
Rather than providing an explicit list of text encodings which they are
interested in, users may pass the output of mb_list_encodings to
mb_detect_encoding. Since Base64, QPrint, and so on are included in
the output of mb_list_encodings, mb_detect_encoding can return one of
these as its 'detected encoding' (and in fact, this often happens).
Before mb_detect_encoding was enhanced so it could detect any of the
supported text encodings, this did not happen, and it is never desired.
Originally, `mb_detect_encoding` essentially just checked all candidate
encodings to see which ones the input string was valid in. However, it
was only able to do this for a limited few of all the text encodings
which are officially supported by mbstring.
In 3e7acf901d, I modified it so it could 'detect' any text encoding
supported by mbstring. While this is arguably an improvement, if the
only text encodings one is interested in are those which
`mb_detect_encoding` could originally handle, the old
`mb_detect_encoding` may have been preferable. Because the new one has
more possible encodings which it can guess, it also has more chances to
get the answer wrong.
This commit adjusts the detection heuristics to provide accurate
detection in a wider variety of scenarios. While the previous detection
code would frequently confuse UTF-32BE with UTF-32LE or UTF-16BE with
UTF-16LE, the adjusted code is extremely accurate in those cases.
Detection for Chinese text in Chinese encodings like GB18030 or BIG5
and for Japanese text in Japanese encodings like EUC-JP or SJIS is
greatly improved. Detection of UTF-7 is also greatly improved. An 8KB
table, with one bit for each codepoint from U+0000 up to U+FFFF, is
used to achieve this.
One significant constraint is that the heuristics are completely based
on looking at each codepoint in a string in isolation, treating some
codepoints as 'likely' and others as 'unlikely'. It might still be
possible to achieve great gains in detection accuracy by looking at
sequences of codepoints rather than individual codepoints. However,
this might require huge tables. Further, we might need a huge corpus
of text in various languages to derive those tables.
Accuracy is still dismal when trying to distinguish single-byte
encodings like ISO-8859-1, ISO-8859-2, KOI8-R, and so on. This is
because the valid bytes in these encodings are basically all the same,
and all valid bytes decode to 'likely' codepoints, so our method of
detection (which is based on rating codepoints as likely or unlikely)
cannot tell any difference between the candidates at all. It just
selects the first encoding in the provided list of candidates.
Speaking of which, if one wants to get good results from
`mb_detect_encoding`, it is important to order the list of candidate
encodings according to your prior belief of which are more likely to
be correct. When the function cannot tell any difference between two
candidates, it returns whichever appeared earlier in the array.
mb_convert_kana is controlled by user-provided flags, which specify what it should convert
and to what. These flags come in inverse pairs, for example "fullwidth numerals to halfwidth
numerals" and "halfwidth numerals to fullwidth numerals". It does not make sense to combine
inverse flags.
But, clever reader of commit logs, you will surely say: What if I want all my halfwidth
numerals to become fullwidth, and all my fullwidth numerals to become halfwidth? Much too
clever, you are! Let's put aside the fact that this bizarre switch-up is ridiculous and
will never be used, and face up to another stark reality: mb_convert_kana does not work
for that case, and never has. This was probably never noticed because nobody ever tried.
Disallowing useless combinations of flags gives freedom to rearrange the kana conversion
code without changing behavior.
We can also reject unrecognized flags. This may help users to catch bugs.
Interestingly, the existing tests used a 'Z' flag, which is useless (it's not recognized
at all).
Rather than doing a linear search of a table of fullwidth codepoint
ranges for every input character,
1) Short-cut the search if the codepoint is below the first such range
2) Otherwise, do a binary (rather than linear) search
Headers should not be processed in a locale-depdendent fashion.
Switch from upper to lowercasing because that's the standard for
PHP and we provide an ASCII implementation of this operation.
This is adapted from GH-7506.
The 'fast path' in the uppercase/lowercase functions for Unicode text can be used
for a slightly greater range of characters. This is not expected to have a big
impact on performance, since the number of characters which will use the 'fast path'
is only increased by about 50-60, and these are not very commonly used characters...
but still, it doesn't cost anything.
Rather than using pointers to pointers to pointers (3 levels of indirection), what
makes sense is two levels. This reduces unnecessary pointer dereference operations.
Previously, when passed an empty string, and given an encoding which
uses a variable number of bytes per character (and which doesn't have
a 'character length table'), mb_str_split would return an array
containing a single empty string, rather than an empty array.
The ISO-2022 encodings are among those which were affected by this bug.
* PHP-8.1:
Bug #81390: mb_detect_encoding should not prematurely stop processing input
mb_detect_encoding with only one candidate encoding uses mb_check_encoding
Optimize text encoding detection for speed (eliminate Unicode property lookups)
As a performance optimization, mb_detect_encoding tries to stop
processing the input string early when there is only one 'candidate'
encoding which the input string is valid in. However, the code which
keeps count of how many candidate encodings have already been rejected
was buggy. This caused mb_detect_encoding to prematurely stop
processing the input when it should have continued.
As a result, it did not notice that in the test case provided by Alec,
the input string was not valid in UTF-16.
...By just testing the input codepoints if they are within a few fixed
ranges instead. This avoids hash lookups in property tables.
From (micro-)benchmarking on my PC, this looks to be a bit less than 4x
faster than the existing code.
mb_convert_kana is able to convert fullwidth katakana to fullwidth
hiragana (and vice versa). The constants referring to these modes had
names like MBFL_FILT_TL_ZEN2HAN_KANA2HIRA.
The "ZEN2HAN" part of the name is misleading, since these modes do not
convert fullwidth (zenkaku) kana to halfwidth (hankaku). The converted
characters are fullwidth both before and after the conversion. So...
let's name the constants accordingly.
mb_convert_kana has conversion modes selected using 'M'/'m', which
convert a few various punctuation and symbol characters between
'ordinary' and full-width forms. The constants which refer to these
modes have names ending with COMPAT1.
Internally, there are similar conversion modes with names ending in
COMPAT2. They are like COMPAT1 modes, but they operate on a smaller
set of characters. But... that is all just dead code, because there is
no way for user code to select the COMPAT2 modes.
I have no idea what the original author intended those COMPAT2 modes to
actually be used for. Guess it doesn't really matter, anyways. At this
point, it's just more food for the flames.