mirror of https://github.com/php/php-src.git synced 2026-04-04 06:32:49 +02:00

Files

Alex Dowad 28b346bc06 Improve detection accuracy of mb_detect_encoding

Originally, `mb_detect_encoding` essentially just checked all candidate
encodings to see which ones the input string was valid in. However, it
was only able to do this for a limited few of all the text encodings
which are officially supported by mbstring.

In 3e7acf901d, I modified it so it could 'detect' any text encoding
supported by mbstring. While this is arguably an improvement, if the
only text encodings one is interested in are those which
`mb_detect_encoding` could originally handle, the old
`mb_detect_encoding` may have been preferable. Because the new one has
more possible encodings which it can guess, it also has more chances to
get the answer wrong.

This commit adjusts the detection heuristics to provide accurate
detection in a wider variety of scenarios. While the previous detection
code would frequently confuse UTF-32BE with UTF-32LE or UTF-16BE with
UTF-16LE, the adjusted code is extremely accurate in those cases.
Detection for Chinese text in Chinese encodings like GB18030 or BIG5
and for Japanese text in Japanese encodings like EUC-JP or SJIS is
greatly improved. Detection of UTF-7 is also greatly improved. An 8KB
table, with one bit for each codepoint from U+0000 up to U+FFFF, is
used to achieve this.

One significant constraint is that the heuristics are completely based
on looking at each codepoint in a string in isolation, treating some
codepoints as 'likely' and others as 'unlikely'. It might still be
possible to achieve great gains in detection accuracy by looking at
sequences of codepoints rather than individual codepoints. However,
this might require huge tables. Further, we might need a huge corpus
of text in various languages to derive those tables.

Accuracy is still dismal when trying to distinguish single-byte
encodings like ISO-8859-1, ISO-8859-2, KOI8-R, and so on. This is
because the valid bytes in these encodings are basically all the same,
and all valid bytes decode to 'likely' codepoints, so our method of
detection (which is based on rating codepoints as likely or unlikely)
cannot tell any difference between the candidates at all. It just
selects the first encoding in the provided list of candidates.

Speaking of which, if one wants to get good results from
`mb_detect_encoding`, it is important to order the list of candidate
encodings according to your prior belief of which are more likely to
be correct. When the function cannot tell any difference between two
candidates, it returns whichever appeared earlier in the array.

2021-10-19 18:05:51 +02:00

filters

In UTF7-IMAP, reject the 2nd part of surrogate pair if it appears unexpectedly

2021-08-31 13:41:34 +02:00

mbfl

Improve detection accuracy of mb_detect_encoding

2021-10-19 18:05:51 +02:00

nls

Remove redundant includes from mbstring (and make sure correct config.h is used)

2020-08-31 23:17:58 +02:00

config.h.w32

Remove unused symbol definition

2019-05-11 19:47:54 +02:00

LICENSE

Integrate libmbfl docs to README.md and LICENSE

2019-05-11 18:29:30 +02:00

README.md

Integrate libmbfl docs to README.md and LICENSE

2019-05-11 18:29:30 +02:00

README.md

libmbfl

This is libmbfl, a streamable multibyte character code filter and converter library, written by Shigeru Kanemoto.

The original version of libmbfl is developed and distributed at https://github.com/moriyoshi/libmbfl under the LGPL 2.1 license. See the LICENSE file for licensing information.

The libmbfl library is bundled with PHP as a fork of the original repository and is not in sync with the upstream. As such, the libmbfl directory is directly modified in the php-src repository.

Changelog

October 2017

Since 2017, it is forked and bundled in the php-src repository. For the list of changes related to PHP see the PHP NEWS change logs.

Version 1.3.2 August 20, 2011

Added JISX-0213:2004 based encoding : Shift_JIS-2004, EUC-JP-2004, ISO-2022-JP-2004 (rui).
Added gb18030 encoding (rui).
Added CP950 with user user defined area based on Big5 (rui).
Added mapping for user defined character area to CP936 (rui).
Added UTF-8-Mobile to support the pictogram characters defined by mobile phone carrier in Japan (rui).

Version 1.3.1 August 5, 2011

Added check for invalid/obsolete utf-8 encoding (rui).

Version 1.3.0 August 1, 2011

Added encoding conversion between Shift_JIS and Unicode (6.0 or PUA) for pictogram characters defined by mobile phone carrier in Japan (rui).

Detailed info
Fixed encoding conversion of cp5022x for user defined area (rui).
Added MacJapanese (SJIS-mac) for legacy encoding support (rui).
Backport from PHP 5.2 (rui).

Version 1.1.0 March 02, 2010

Added cp5022x encoding (moriyoshi)
Added ISO-2022-JP-MS (moriyoshi)
Moved to github.com from sourceforge.jp (moriyoshi)

Earlier versions

1998/11/10 sgk implementation in C++
Rewriting with sgk C 1999/4/25.
1999/4/26 Implemented sgk input filter. Add filter while estimating kanji code.
1999/6 Unicode support.
1999/6/22 Changed sgk license to LGPL.

Credits

Marcus Boerger helly@php.net Hayk Chamyan hamshen@gmail.com Wez Furlong wez@thebrainroom.com Rui Hirokawa hirokawa@php.net Shigeru Kanemoto sgk@happysize.co.jp U. Kenkichi kenkichi@axes.co.jp Moriyoshi Koizumi moriyoshi@php.net Hironori Sato satoh@jpnnet.com Tsukada Takuya tsukada@fminn.nagano.nagano.jp Tateyama tateyan@amy.hi-ho.ne.jp Den V. Tsopa tdv@edisoft.ru Maksym Veremeyenko verem@m1stereo.tv Haluk AKIN halukakin@gmail.com