1
0
mirror of https://github.com/php/php-src.git synced 2026-04-08 00:22:52 +02:00
Alex Dowad 3ab10da758 Take order of candidate encodings into account when guessing text encoding
The documentation for mb_detect_encoding says that this function
"Detects the most likely character encoding for string `string` from an
ordered list of candidates".

Prior to 28b346bc06, mb_detect_encoding did not really attempt to
determine the "most likely" text encoding for the input string. It
would just return the first candidate encoding for which the string was
valid. In 28b346bc06, I amended this function so that it uses heuristics
to try to guess which candidate encoding is "most likely".

However, the caller did not have any way to indicate which candidate
text encoding(s) they consider to be more likely, in case the
heuristics applied are inconclusive. In the language of Bayesian
probability, there was no way for the caller to indicate their 'prior'
assignment of probabilities.

Further, the documentation for mb_detect_encoding also says that the
second parameter `encodings` is "a list of character encodings to try,
in order". The documentation clearly implies that the order of
the `encodings` argument should be significant.

Therefore, amend mb_detect_encoding so that while it still uses
heuristics to guess the most likely text encoding for the input string,
it favors those which are earlier in the list of candidate encodings.

One complication is that many callers of mb_detect_encoding use it
in this way:

    mb_detect_encoding($string, mb_list_encodings());

In a majority of cases, this is bad code; mb_detect_encoding will both
be much slower and the results will be less reliable than if a smaller
list of candidates is used. However, since such code already exists and
people are using it in production, we should not unnecessarily break it.
The order of candidate encodings obviously does not express any prior
belief of which candidates are more likely in this case, and treating
it as if it did will degrade the accuracy of the result.

Since mb_list_encodings now returns a single, immutable array on each
call, we can avoid that problem by turning off the new behavior when
we receive the array of encodings returned by mb_list_encodings.
This implementation means that if the user does this:

    $a = mb_list_encodings();
    mb_detect_encoding($string, $a);

...then the order of candidate encodings will not be considered.
However, if the user explicitly initializes their own array of all
supported legacy text encodings, then the order *will* be considered.

The other functions which also follow this new behavior are:

• mb_convert_variables
• mb_convert_encoding (when multiple candidate input encodings are
  listed)

Other places where "detection" (or really "guessing") of text encoding
may be performed include:

• mb_send_mail
• Zend engine, when determining the encoding of a PHP script
• mbstring processing of HTTP request contents, when http_input INI
  parameter is set to a list

In these cases, the new logic based on order of candidate encodings
is *not* enabled. It *might* be logical to consider the order of
candidate encodings in some or all of these cases, but I'm not sure if
that is true, so it seems wiser to avoid more behavior changes than is
necessary. Further, ever since the new encoding detection heuristics
were implemented in 28b346bc06, we have not received any complaints of
user code being broken in these areas. So I am reluctant to "fix what
isn't broken".

Well, some might say that applying the new detection heuristics
to mb_send_mail, etc. in 28b346bc06 was "fixing what wasn't broken",
but (cough cough) I don't have any comment on that...
2023-05-16 07:01:07 -07:00
2023-04-15 23:14:20 +02:00
2023-05-12 15:33:55 +01:00
2023-05-14 12:29:19 +01:00
2023-02-22 20:15:05 +01:00
2023-04-08 16:47:05 +02:00
2022-12-16 17:44:26 +01:00
2023-05-11 12:47:42 +02:00
2023-02-02 18:59:49 +01:00
2023-03-15 01:40:06 +01:00
2019-07-21 11:40:23 +02:00
2023-04-25 18:33:13 +03:00
2023-01-19 12:01:29 +01:00
2023-02-17 13:22:23 +00:00
2023-02-17 13:22:23 +00:00
2023-05-03 13:51:31 +02:00

The PHP Interpreter

PHP is a popular general-purpose scripting language that is especially suited to web development. Fast, flexible and pragmatic, PHP powers everything from your blog to the most popular websites in the world. PHP is distributed under the PHP License v3.01.

Push Build status Build status Fuzzing Status

Documentation

The PHP manual is available at php.net/docs.

Installation

Prebuilt packages and binaries

Prebuilt packages and binaries can be used to get up and running fast with PHP.

For Windows, the PHP binaries can be obtained from windows.php.net. After extracting the archive the *.exe files are ready to use.

For other systems, see the installation chapter.

Building PHP source code

For Windows, see Build your own PHP on Windows.

For a minimal PHP build from Git, you will need autoconf, bison, and re2c. For a default build, you will additionally need libxml2 and libsqlite3.

On Ubuntu, you can install these using:

sudo apt install -y pkg-config build-essential autoconf bison re2c \
                    libxml2-dev libsqlite3-dev

On Fedora, you can install these using:

sudo dnf install re2c bison autoconf make libtool ccache libxml2-devel sqlite-devel

Generate configure:

./buildconf

Configure your build. --enable-debug is recommended for development, see ./configure --help for a full list of options.

# For development
./configure --enable-debug
# For production
./configure

Build PHP. To speed up the build, specify the maximum number of jobs using -j:

make -j4

The number of jobs should usually match the number of available cores, which can be determined using nproc.

Testing PHP source code

PHP ships with an extensive test suite, the command make test is used after successful compilation of the sources to run this test suite.

It is possible to run tests using multiple cores by setting -jN in TEST_PHP_ARGS:

make TEST_PHP_ARGS=-j4 test

Shall run make test with a maximum of 4 concurrent jobs: Generally the maximum number of jobs should not exceed the number of cores available.

The qa.php.net site provides more detailed info about testing and quality assurance.

Installing PHP built from source

After a successful build (and test), PHP may be installed with:

make install

Depending on your permissions and prefix, make install may need super user permissions.

PHP extensions

Extensions provide additional functionality on top of PHP. PHP consists of many essential bundled extensions. Additional extensions can be found in the PHP Extension Community Library - PECL.

Contributing

The PHP source code is located in the Git repository at github.com/php/php-src. Contributions are most welcome by forking the repository and sending a pull request.

Discussions are done on GitHub, but depending on the topic can also be relayed to the official PHP developer mailing list internals@lists.php.net.

New features require an RFC and must be accepted by the developers. See Request for comments - RFC and Voting on PHP features for more information on the process.

Bug fixes don't require an RFC. If the bug has a GitHub issue, reference it in the commit message using GH-NNNNNN. Use #NNNNNN for tickets in the old bugs.php.net bug tracker.

Fix GH-7815: php_uname doesn't recognise latest Windows versions
Fix #55371: get_magic_quotes_gpc() throws deprecation warning

See Git workflow for details on how pull requests are merged.

Guidelines for contributors

See further documents in the repository for more information on how to contribute:

Credits

For the list of people who've put work into PHP, please see the PHP credits page.

Description
⚠️ ARCHIVED: Original GitHub repository no longer exists. Preserved as backup on 2026-01-22T16:25:23.756Z
Readme 970 MiB
Languages
C 66%
PHP 31.3%
C++ 0.8%
Shell 0.5%
M4 0.4%
Other 0.8%