1
0
mirror of https://github.com/php/php-src.git synced 2026-04-14 11:32:11 +02:00
Files
archived-php-src/ext/mbstring/libmbfl
Alex Dowad 0e7160b836 Implement mb_detect_encoding using fast text conversion filters
Regarding the optional 3rd `strict` argument to mb_detect_encoding,
the documentation states:

  Controls the behaviour when string is not valid in any of the listed encodings.
  If strict is set to false, the closest matching encoding will be returned;
  if strict is set to true, false will be returned.

(Ref: https://www.php.net/manual/en/function.mb-detect-encoding.php)

Because of bugs in the implementation, mb_detect_encoding did not always
behave according to this description when `strict` was false.
For example:

  <?php
  echo var_export(mb_detect_encoding("\xc0\x00", "UTF-8", false));
  // Before this commit, prints: false
  // After this commit, prints: 'UTF-8'

Because `strict` is false in the above example, mb_detect_encoding
should return the 'closest matching encoding', which is UTF-8, since
that is the only candidate encoding. (Incidentally, this example shows
that using mb_detect_encoding with a single candidate encoding in
non-strict mode is useless.)

The new implementation fixes this bug. It also fixes another problem
with the old implementation as regards non-strict detection mode:

The old implementation would stop processing of the input string using
a particular candidate encoding as soon as it saw an error in that
encoding, even in non-strict mode. This means that it could not really
detect the 'closest matching encoding'; rather, what it would return
in non-strict mode was 'the encoding in which the first decoding error
is furthest from the beginning of the input string'.

In non-strict mode, the new implementation continues trying to process
the input string to its end even after seeing an error. This makes it
possible to determine in which candidate encoding the string has the
smallest number of errors, i.e. the 'closest matching encoding'.

Rejecting candidate encodings as soon as it saw an error gave the old
implementation a marked performance advantage in non-strict mode;
however, the new implementation still beats it in most cases. Here are
a few sample microbenchmark results:

  UTF-8, ~100 codepoints, strict mode
  Old: 0.080s (100,000 calls)
  New: 0.026s ("       "    )

  UTF-8, ~100 codepoints, non-strict mode
  Old: 0.079s (100,000 calls)
  New: 0.033s ("       "    )

  UTF-8, ~10000 codepoints, strict mode
  Old: 6.708s (60,000 calls)
  New: 1.383s ("      "    )

  UTF-8, ~10000 codepoints, non-strict mode
  Old: 6.705s (60,000 calls)
  New: 3.044s ("      "    )

Notice that the old implementation had almost identical performance
between strict and non-strict mode, while the new suffers a significant
performance penalty for non-strict detection. This is the cost of
implementing the behavior specified in the documentation.

A couple more sample results:

  SJIS, ~10000 codepoints, strict mode
  Old: 4.563s
  New: 1.084s

  SJIS, ~10000 codepoints, non-strict mode
  Old: 4.569s
  New: 2.863s

This is the only case I found where the new implementation loses:

  UTF-16LE, ~10000 codepoints, non-strict mode
  Old: 1.514s
  New: 2.813s

The reason is because the test strings happened to be invalid right from
the first few bytes for all the candidate encodings except for UTF-16LE;
so the old implementation would immediately reject all those encodings
and only process the entire string in UTF-16LE.

I believe mb_detect_encoding could be made much faster if we identified
good criteria for when to reject candidate encodings before reaching
the end of the input string.
2023-01-03 09:10:10 +02:00
..
2019-05-11 19:47:54 +02:00

libmbfl

This is libmbfl, a streamable multibyte character code filter and converter library, written by Shigeru Kanemoto.

The original version of libmbfl is developed and distributed at https://github.com/moriyoshi/libmbfl under the LGPL 2.1 license. See the LICENSE file for licensing information.

The libmbfl library is bundled with PHP as a fork of the original repository and is not in sync with the upstream. As such, the libmbfl directory is directly modified in the php-src repository.

Changelog

October 2017

  • Since 2017, it is forked and bundled in the php-src repository. For the list of changes related to PHP see the PHP NEWS change logs.

Version 1.3.2 August 20, 2011

  • Added JISX-0213:2004 based encoding : Shift_JIS-2004, EUC-JP-2004, ISO-2022-JP-2004 (rui).
  • Added gb18030 encoding (rui).
  • Added CP950 with user user defined area based on Big5 (rui).
  • Added mapping for user defined character area to CP936 (rui).
  • Added UTF-8-Mobile to support the pictogram characters defined by mobile phone carrier in Japan (rui).

Version 1.3.1 August 5, 2011

  • Added check for invalid/obsolete utf-8 encoding (rui).

Version 1.3.0 August 1, 2011

  • Added encoding conversion between Shift_JIS and Unicode (6.0 or PUA) for pictogram characters defined by mobile phone carrier in Japan (rui).

    Detailed info

  • Fixed encoding conversion of cp5022x for user defined area (rui).

  • Added MacJapanese (SJIS-mac) for legacy encoding support (rui).

  • Backport from PHP 5.2 (rui).

Version 1.1.0 March 02, 2010

  • Added cp5022x encoding (moriyoshi)
  • Added ISO-2022-JP-MS (moriyoshi)
  • Moved to github.com from sourceforge.jp (moriyoshi)

Earlier versions

  • 1998/11/10 sgk implementation in C++
  • Rewriting with sgk C 1999/4/25.
  • 1999/4/26 Implemented sgk input filter. Add filter while estimating kanji code.
  • 1999/6 Unicode support.
  • 1999/6/22 Changed sgk license to LGPL.

Credits

Marcus Boerger helly@php.net Hayk Chamyan hamshen@gmail.com Wez Furlong wez@thebrainroom.com Rui Hirokawa hirokawa@php.net Shigeru Kanemoto sgk@happysize.co.jp U. Kenkichi kenkichi@axes.co.jp Moriyoshi Koizumi moriyoshi@php.net Hironori Sato satoh@jpnnet.com Tsukada Takuya tsukada@fminn.nagano.nagano.jp Tateyama tateyan@amy.hi-ho.ne.jp Den V. Tsopa tdv@edisoft.ru Maksym Veremeyenko verem@m1stereo.tv Haluk AKIN halukakin@gmail.com