GB18030-2022 is the current official standard, superseding the previous 2005 and 2000 versions. It is essential for modern Chinese text processing for the following reasons:
1. Superset Relationship: GB18030 is a strict superset of CP936 (GBK) and EUC-CN (GB2312). Using GB18030 as the detection target covers all characters in these older encodings while enabling support for a much wider range of characters.
2. Extended Character Coverage: The 2022 standard includes significant updates, covering over 87,000 characters. It adds support for CJK Extensions (C, D, E, F, G) and updates mappings for rare characters that were previously mapped to the Private Use Area (PUA) in the 2005 version. This is critical for correctly handling names containing rare characters (e.g., in banking or government data).
3. Backward Compatibility: It is safe to promote GB18030-2022 as the preferred encoding. Files encoded in EUC-CN or CP936 are valid GB18030 streams.
This PR adds GB18030-2022 to the default encoding list for CN.
This API can't handle references, yet everyone keeps forgetting that it
can't and that you should DEREF upfront. Fix every type of this issue
once and for all by moving the reference handling to this Zend API.
Closes GH-18761.
This allows us to avoid a call to `zend_ini_str` which took 6% of the
profile on my i7-4790 for a call to `http_build_query`. Now we can just
grab the value from the globals.
In other files this can avoid some length recomputations.
Conversion of floating point to integer values is undefined if the
integral part of the float value cannot be represented by the integer
type. We need to cater to that explicitly (in a manner similar to
`zend_dval_to_lval_cap()`).
Closes GH-17689.
Besides that it is not needed, it is not proper C, and Clang warns that
"forward references to 'enum' types are a Microsoft extension"
(`-Wmicrosoft-enum-forward-reference`).
The behaviour is weird in the sense that the reference must get
unwrapped. What ended up happening is that when destroying the old
reference the sources list was not cleaned properly. We add handling for
that. Normally we would use use ZEND_TRY_ASSIGN_STRINGL but that doesn't
work here as it would keep the reference and change values through
references (see bug #26639).
Closes GH-16272.
* Pull zend_string* from INI directive
* Ensure that mail.force_extra_parameters INI directive does not have any nul bytes
* ext/standard: Make php_escape_shell_cmd() take a zend_string* instead of char*
This saves on an expensive strlen() computation
* Convert E_ERROR to ValueError in php_escape_shell_cmd()
* ext/standard: Make php_escape_shell_arg() take a zend_string* instead of char*
This saves on an expensive strlen() computation
* Convert E_ERROR to ValueError in php_escape_shell_arg()
Because the default characters are defined in the stub file, and the
stub file is UTF-8 (typically), the characters are encoded in the string
as UTF-8. When using a different character encoding, there is a mismatch
between what mb_trim expects and the UTF-8 encoded string it gets.
One way of solving this is by making the characters argument nullable,
which would mean that it always uses the internal code path that has the
unicode codepoints that are defaulted actually stored as codepoint
numbers instead of in a string.
Co-authored-by: @ranvis