Regarding the optional 3rd `strict` argument to mb_detect_encoding, the documentation states: Controls the behaviour when string is not valid in any of the listed encodings. If strict is set to false, the closest matching encoding will be returned; if strict is set to true, false will be returned. (Ref: https://www.php.net/manual/en/function.mb-detect-encoding.php) Because of bugs in the implementation, mb_detect_encoding did not always behave according to this description when `strict` was false. For example: <?php echo var_export(mb_detect_encoding("\xc0\x00", "UTF-8", false)); // Before this commit, prints: false // After this commit, prints: 'UTF-8' Because `strict` is false in the above example, mb_detect_encoding should return the 'closest matching encoding', which is UTF-8, since that is the only candidate encoding. (Incidentally, this example shows that using mb_detect_encoding with a single candidate encoding in non-strict mode is useless.) The new implementation fixes this bug. It also fixes another problem with the old implementation as regards non-strict detection mode: The old implementation would stop processing of the input string using a particular candidate encoding as soon as it saw an error in that encoding, even in non-strict mode. This means that it could not really detect the 'closest matching encoding'; rather, what it would return in non-strict mode was 'the encoding in which the first decoding error is furthest from the beginning of the input string'. In non-strict mode, the new implementation continues trying to process the input string to its end even after seeing an error. This makes it possible to determine in which candidate encoding the string has the smallest number of errors, i.e. the 'closest matching encoding'. Rejecting candidate encodings as soon as it saw an error gave the old implementation a marked performance advantage in non-strict mode; however, the new implementation still beats it in most cases. Here are a few sample microbenchmark results: UTF-8, ~100 codepoints, strict mode Old: 0.080s (100,000 calls) New: 0.026s (" " ) UTF-8, ~100 codepoints, non-strict mode Old: 0.079s (100,000 calls) New: 0.033s (" " ) UTF-8, ~10000 codepoints, strict mode Old: 6.708s (60,000 calls) New: 1.383s (" " ) UTF-8, ~10000 codepoints, non-strict mode Old: 6.705s (60,000 calls) New: 3.044s (" " ) Notice that the old implementation had almost identical performance between strict and non-strict mode, while the new suffers a significant performance penalty for non-strict detection. This is the cost of implementing the behavior specified in the documentation. A couple more sample results: SJIS, ~10000 codepoints, strict mode Old: 4.563s New: 1.084s SJIS, ~10000 codepoints, non-strict mode Old: 4.569s New: 2.863s This is the only case I found where the new implementation loses: UTF-16LE, ~10000 codepoints, non-strict mode Old: 1.514s New: 2.813s The reason is because the test strings happened to be invalid right from the first few bytes for all the candidate encodings except for UTF-16LE; so the old implementation would immediately reject all those encodings and only process the entire string in UTF-16LE. I believe mb_detect_encoding could be made much faster if we identified good criteria for when to reject candidate encodings before reaching the end of the input string.
libmbfl
This is libmbfl, a streamable multibyte character code filter and converter library, written by Shigeru Kanemoto.
The original version of libmbfl is developed and distributed at https://github.com/moriyoshi/libmbfl under the LGPL 2.1 license. See the LICENSE file for licensing information.
The libmbfl library is bundled with PHP as a fork of the original repository and is not in sync with the upstream. As such, the libmbfl directory is directly modified in the php-src repository.
Changelog
October 2017
- Since 2017, it is forked and bundled in the php-src repository. For the list of changes related to PHP see the PHP NEWS change logs.
Version 1.3.2 August 20, 2011
- Added JISX-0213:2004 based encoding : Shift_JIS-2004, EUC-JP-2004, ISO-2022-JP-2004 (rui).
- Added gb18030 encoding (rui).
- Added CP950 with user user defined area based on Big5 (rui).
- Added mapping for user defined character area to CP936 (rui).
- Added UTF-8-Mobile to support the pictogram characters defined by mobile phone carrier in Japan (rui).
Version 1.3.1 August 5, 2011
- Added check for invalid/obsolete utf-8 encoding (rui).
Version 1.3.0 August 1, 2011
-
Added encoding conversion between Shift_JIS and Unicode (6.0 or PUA) for pictogram characters defined by mobile phone carrier in Japan (rui).
-
Fixed encoding conversion of cp5022x for user defined area (rui).
-
Added MacJapanese (SJIS-mac) for legacy encoding support (rui).
-
Backport from PHP 5.2 (rui).
Version 1.1.0 March 02, 2010
- Added cp5022x encoding (moriyoshi)
- Added ISO-2022-JP-MS (moriyoshi)
- Moved to github.com from sourceforge.jp (moriyoshi)
Earlier versions
- 1998/11/10 sgk implementation in C++
- Rewriting with sgk C 1999/4/25.
- 1999/4/26 Implemented sgk input filter. Add filter while estimating kanji code.
- 1999/6 Unicode support.
- 1999/6/22 Changed sgk license to LGPL.
Credits
Marcus Boerger helly@php.net Hayk Chamyan hamshen@gmail.com Wez Furlong wez@thebrainroom.com Rui Hirokawa hirokawa@php.net Shigeru Kanemoto sgk@happysize.co.jp U. Kenkichi kenkichi@axes.co.jp Moriyoshi Koizumi moriyoshi@php.net Hironori Sato satoh@jpnnet.com Tsukada Takuya tsukada@fminn.nagano.nagano.jp Tateyama tateyan@amy.hi-ho.ne.jp Den V. Tsopa tdv@edisoft.ru Maksym Veremeyenko verem@m1stereo.tv Haluk AKIN halukakin@gmail.com