This avoids a crash in cases where the list of candidate encodings is so huge
that alloca would fail. Such crashes have been observed when the list of
encodings was larger than around 208,000 entries.
The aim of this PR is twofold:
- Reduce the number of highly similar TMP|VAR handlers
- Avoid ZVAL_DEREF in most of these cases
This is achieved by guaranteeing that all zend_compile_expr() calls, as well as
all other compile calls with BP_VAR_{R,IS}, will result in a TMP variable. This
implies that the result will not contain an IS_INDIRECT or IS_REFERENCE value,
which was mostly already the case, with two exceptions:
- Calls to return-by-reference functions. Because return-by-reference functions
are quite rare, this is solved by delegating the DEREF to the RETURN_BY_REF
handler, which will examine the stack to check whether the caller expects a
VAR or TMP to understand whether the DEREF is needed. Internal functions will
also need to adjust by calling the zend_return_unwrap_ref() function.
- By-reference assignments, including both $a = &$b, as well as $a = [&$b]. When
the result of these expressions is used in a BP_VAR_R context, the reference
is unwrapped via a ZEND_QM_ASSIGN opcode beforehand. This is exceptionally
rare.
Closes GH-20628
Thanks to the GitHub user vi3tL0u1s (Viet Hoang Luu) for reporting this issue.
The MacJapanese legacy text encoding has a very unusual property; it is possible for a string
to encode more codepoints than it has bytes. In some corner cases, this resulted in a situation
where the implementation code for mb_substr() would allocate a buffer of size -1. As you can
probably imagine, that doesn't end well.
Fixes GH-20832.
GB18030-2022 is the current official standard, superseding the previous 2005 and 2000 versions. It is essential for modern Chinese text processing for the following reasons:
1. Superset Relationship: GB18030 is a strict superset of CP936 (GBK) and EUC-CN (GB2312). Using GB18030 as the detection target covers all characters in these older encodings while enabling support for a much wider range of characters.
2. Extended Character Coverage: The 2022 standard includes significant updates, covering over 87,000 characters. It adds support for CJK Extensions (C, D, E, F, G) and updates mappings for rare characters that were previously mapped to the Private Use Area (PUA) in the 2005 version. This is critical for correctly handling names containing rare characters (e.g., in banking or government data).
3. Backward Compatibility: It is safe to promote GB18030-2022 as the preferred encoding. Files encoded in EUC-CN or CP936 are valid GB18030 streams.
This PR adds GB18030-2022 to the default encoding list for CN.
Moves the usage of `mb_internal_encoding()` to INI section for the tests not testing the encoding/function itself, but the other mbstring/iconv functions.
We prevent signed overflow by making the count unsigned. The actual
interpretation of the count doesn't matter as it's just used to denote a
limit.
The test output for some limit values looks strange though, so that may
need extra investigation. However, that's orthogonal to this fix.
Closes GH-18906.
This API can't handle references, yet everyone keeps forgetting that it
can't and that you should DEREF upfront. Fix every type of this issue
once and for all by moving the reference handling to this Zend API.
Closes GH-18761.
Conversion of floating point to integer values is undefined if the
integral part of the float value cannot be represented by the integer
type. We need to cater to that explicitly (in a manner similar to
`zend_dval_to_lval_cap()`).
Closes GH-17689.
The behaviour is weird in the sense that the reference must get
unwrapped. What ended up happening is that when destroying the old
reference the sources list was not cleaned properly. We add handling for
that. Normally we would use use ZEND_TRY_ASSIGN_STRINGL but that doesn't
work here as it would keep the reference and change values through
references (see bug #26639).
Closes GH-16272.
Updates UCD to Unicode 16.0 (released 2024 Sept).
Previously: 0fdffc18, #7502, #14680
Unicode 16 adds several new character sets and case folding rules.
However, the existing ucgendat script can still parse them.
This also adds a couple test cases to make sure the new rules for
East Asian Wide characters and case folding work correctly. These
tests fail on Unicode 15.1 and older because those verisons do not
contain those rules.
I fixed from strcasecmp to strncasecmp.
However, strncasecmp is specify size to #3 parameter.
Hence, Add check length to mime and aliases.
Co-authored-by: Niels Dossche <7771979+nielsdos@users.noreply.github.com>
Updates UCD to Unicode 15.1 (released 2023 Sept). The upcoming
Unicode 16 version will be released roughly on 2024 Sept.
Previously: 0fdffc18, #7502
UCD 15.1 `DerivedNormalizationProps` contains multiple properties in
the same line, which breaks the parser. This also updates the
`ucgendat.php` script to allow 2 or three fields in each line, and to
look for the `Cased` and `Case_Ignorable` properties in either of the
fields to mimic the previous behavior.
Because the default characters are defined in the stub file, and the
stub file is UTF-8 (typically), the characters are encoded in the string
as UTF-8. When using a different character encoding, there is a mismatch
between what mb_trim expects and the UTF-8 encoded string it gets.
One way of solving this is by making the characters argument nullable,
which would mean that it always uses the internal code path that has the
unicode codepoints that are defaulted actually stored as codepoint
numbers instead of in a string.
Co-authored-by: @ranvis