1
0
mirror of https://github.com/php/php-src.git synced 2026-04-11 18:13:00 +02:00
Alex Dowad 0e7160b836 Implement mb_detect_encoding using fast text conversion filters
Regarding the optional 3rd `strict` argument to mb_detect_encoding,
the documentation states:

  Controls the behaviour when string is not valid in any of the listed encodings.
  If strict is set to false, the closest matching encoding will be returned;
  if strict is set to true, false will be returned.

(Ref: https://www.php.net/manual/en/function.mb-detect-encoding.php)

Because of bugs in the implementation, mb_detect_encoding did not always
behave according to this description when `strict` was false.
For example:

  <?php
  echo var_export(mb_detect_encoding("\xc0\x00", "UTF-8", false));
  // Before this commit, prints: false
  // After this commit, prints: 'UTF-8'

Because `strict` is false in the above example, mb_detect_encoding
should return the 'closest matching encoding', which is UTF-8, since
that is the only candidate encoding. (Incidentally, this example shows
that using mb_detect_encoding with a single candidate encoding in
non-strict mode is useless.)

The new implementation fixes this bug. It also fixes another problem
with the old implementation as regards non-strict detection mode:

The old implementation would stop processing of the input string using
a particular candidate encoding as soon as it saw an error in that
encoding, even in non-strict mode. This means that it could not really
detect the 'closest matching encoding'; rather, what it would return
in non-strict mode was 'the encoding in which the first decoding error
is furthest from the beginning of the input string'.

In non-strict mode, the new implementation continues trying to process
the input string to its end even after seeing an error. This makes it
possible to determine in which candidate encoding the string has the
smallest number of errors, i.e. the 'closest matching encoding'.

Rejecting candidate encodings as soon as it saw an error gave the old
implementation a marked performance advantage in non-strict mode;
however, the new implementation still beats it in most cases. Here are
a few sample microbenchmark results:

  UTF-8, ~100 codepoints, strict mode
  Old: 0.080s (100,000 calls)
  New: 0.026s ("       "    )

  UTF-8, ~100 codepoints, non-strict mode
  Old: 0.079s (100,000 calls)
  New: 0.033s ("       "    )

  UTF-8, ~10000 codepoints, strict mode
  Old: 6.708s (60,000 calls)
  New: 1.383s ("      "    )

  UTF-8, ~10000 codepoints, non-strict mode
  Old: 6.705s (60,000 calls)
  New: 3.044s ("      "    )

Notice that the old implementation had almost identical performance
between strict and non-strict mode, while the new suffers a significant
performance penalty for non-strict detection. This is the cost of
implementing the behavior specified in the documentation.

A couple more sample results:

  SJIS, ~10000 codepoints, strict mode
  Old: 4.563s
  New: 1.084s

  SJIS, ~10000 codepoints, non-strict mode
  Old: 4.569s
  New: 2.863s

This is the only case I found where the new implementation loses:

  UTF-16LE, ~10000 codepoints, non-strict mode
  Old: 1.514s
  New: 2.813s

The reason is because the test strings happened to be invalid right from
the first few bytes for all the candidate encodings except for UTF-16LE;
so the old implementation would immediately reject all those encodings
and only process the entire string in UTF-16LE.

I believe mb_detect_encoding could be made much faster if we identified
good criteria for when to reject candidate encodings before reaching
the end of the input string.
2023-01-03 09:10:10 +02:00
2022-12-16 17:44:26 +01:00
2023-01-01 10:49:39 +01:00
2022-12-13 19:29:29 -05:00
2022-12-30 06:54:29 +00:00
2022-12-13 19:43:47 +01:00
2022-12-16 17:44:26 +01:00
2022-12-16 17:44:26 +01:00
2022-12-16 17:44:26 +01:00
2022-08-09 16:22:14 +02:00
2019-07-21 11:40:23 +02:00
2022-12-16 17:44:26 +01:00
2022-08-30 11:17:15 -04:00
2022-03-19 18:25:29 +01:00
2022-11-03 14:40:35 +01:00
2022-12-16 18:12:28 +01:00

The PHP Interpreter

PHP is a popular general-purpose scripting language that is especially suited to web development. Fast, flexible and pragmatic, PHP powers everything from your blog to the most popular websites in the world. PHP is distributed under the PHP License v3.01.

Push Build status Build status Build Status Fuzzing Status

Documentation

The PHP manual is available at php.net/docs.

Installation

Prebuilt packages and binaries

Prebuilt packages and binaries can be used to get up and running fast with PHP.

For Windows, the PHP binaries can be obtained from windows.php.net. After extracting the archive the *.exe files are ready to use.

For other systems, see the installation chapter.

Building PHP source code

For Windows, see Build your own PHP on Windows.

For a minimal PHP build from Git, you will need autoconf, bison, and re2c. For a default build, you will additionally need libxml2 and libsqlite3.

On Ubuntu, you can install these using:

sudo apt install -y pkg-config build-essential autoconf bison re2c \
                    libxml2-dev libsqlite3-dev

On Fedora, you can install these using:

sudo dnf install re2c bison autoconf make libtool ccache libxml2-devel sqlite-devel

Generate configure:

./buildconf

Configure your build. --enable-debug is recommended for development, see ./configure --help for a full list of options.

# For development
./configure --enable-debug
# For production
./configure

Build PHP. To speed up the build, specify the maximum number of jobs using -j:

make -j4

The number of jobs should usually match the number of available cores, which can be determined using nproc.

Testing PHP source code

PHP ships with an extensive test suite, the command make test is used after successful compilation of the sources to run this test suite.

It is possible to run tests using multiple cores by setting -jN in TEST_PHP_ARGS:

make TEST_PHP_ARGS=-j4 test

Shall run make test with a maximum of 4 concurrent jobs: Generally the maximum number of jobs should not exceed the number of cores available.

The qa.php.net site provides more detailed info about testing and quality assurance.

Installing PHP built from source

After a successful build (and test), PHP may be installed with:

make install

Depending on your permissions and prefix, make install may need super user permissions.

PHP extensions

Extensions provide additional functionality on top of PHP. PHP consists of many essential bundled extensions. Additional extensions can be found in the PHP Extension Community Library - PECL.

Contributing

The PHP source code is located in the Git repository at github.com/php/php-src. Contributions are most welcome by forking the repository and sending a pull request.

Discussions are done on GitHub, but depending on the topic can also be relayed to the official PHP developer mailing list internals@lists.php.net.

New features require an RFC and must be accepted by the developers. See Request for comments - RFC and Voting on PHP features for more information on the process.

Bug fixes don't require an RFC. If the bug has a GitHub issue, reference it in the commit message using GH-NNNNNN. Use #NNNNNN for tickets in the old bugs.php.net bug tracker.

Fix GH-7815: php_uname doesn't recognise latest Windows versions
Fix #55371: get_magic_quotes_gpc() throws deprecation warning

See Git workflow for details on how pull requests are merged.

Guidelines for contributors

See further documents in the repository for more information on how to contribute:

Credits

For the list of people who've put work into PHP, please see the PHP credits page.

Description
⚠️ ARCHIVED: Original GitHub repository no longer exists. Preserved as backup on 2026-01-22T16:25:23.756Z
Readme 1,011 MiB
Languages
C 66%
PHP 31.3%
C++ 0.8%
Shell 0.5%
M4 0.4%
Other 0.8%