The aim of this PR is twofold:
- Reduce the number of highly similar TMP|VAR handlers
- Avoid ZVAL_DEREF in most of these cases
This is achieved by guaranteeing that all zend_compile_expr() calls, as well as
all other compile calls with BP_VAR_{R,IS}, will result in a TMP variable. This
implies that the result will not contain an IS_INDIRECT or IS_REFERENCE value,
which was mostly already the case, with two exceptions:
- Calls to return-by-reference functions. Because return-by-reference functions
are quite rare, this is solved by delegating the DEREF to the RETURN_BY_REF
handler, which will examine the stack to check whether the caller expects a
VAR or TMP to understand whether the DEREF is needed. Internal functions will
also need to adjust by calling the zend_return_unwrap_ref() function.
- By-reference assignments, including both $a = &$b, as well as $a = [&$b]. When
the result of these expressions is used in a BP_VAR_R context, the reference
is unwrapped via a ZEND_QM_ASSIGN opcode beforehand. This is exceptionally
rare.
Closes GH-20628
Thanks to the GitHub user vi3tL0u1s (Viet Hoang Luu) for reporting this issue.
The MacJapanese legacy text encoding has a very unusual property; it is possible for a string
to encode more codepoints than it has bytes. In some corner cases, this resulted in a situation
where the implementation code for mb_substr() would allocate a buffer of size -1. As you can
probably imagine, that doesn't end well.
Fixes GH-20832.
Over the last few years, I refactored mbstring to perform encoding conversion
a buffer at a time, rather than a single byte at a time. This resulted in a
huge performance increase.
After the refactoring, the old "byte-at-a-time" code was retained for two
reasons:
1) It was used by the mailparse PECL extension.
2) It was used to implement mb_strcut for some text encodings.
However, after reviewing mailparse's use of mbstring, it is clear that
mailparse only relies on mbstring for decoding of QPrint, and possibly
Base64. It does not use the byte-at-a-time conversion code for any
other encoding.
Further, mb_strcut only relies on the byte-at-a-time conversion code
for a limited number of legacy text encodings, such as ISO-2022-JP,
HZ, UTF-7, etc.
Hence, we can remove over 5000 lines of unused code without breaking
anything. This will help to reduce binary size, and make the mbstring
codebase easier to navigate for new contributors.
The legacy mbfl_strcut function is only used to implement mb_strcut
for legacy text encodings which 1) do not use a fixed number of bytes
per codepoint, 2) do not have an 'mblen_table' which can be used to
quickly determine the codepoint length of a byte sequence, and 3) do
not have a specialized 'mb_cut' function which implements mb_strcut
for that text encoding.
Remove unused code from mbfl_strcut, and leave only what is currently
needed for the implementation of mb_strcut.
mbstring's Unicode case conversion is table-driven, using Minimal Perfect Hash tables.
However, for small codepoint values, we bypass the hashtable lookup and just use
hard-coded conversion logic (i.e. adding or subtracting 0x20 from the appropriate
ASCII range).
For upcasing and downcasing, we had already optimized the conditional which sends
execution down this fast path, to use the fast path for as many codepoint values
as possible. However, for case folding, this had not been done.
This will give a small performance boost for case-folding Unicode text which
includes non-breaking spaces, symbols like ¥ or ™, or accented Latin
characters (used in many European languages).
GB18030-2022 is the current official standard, superseding the previous 2005 and 2000 versions. It is essential for modern Chinese text processing for the following reasons:
1. Superset Relationship: GB18030 is a strict superset of CP936 (GBK) and EUC-CN (GB2312). Using GB18030 as the detection target covers all characters in these older encodings while enabling support for a much wider range of characters.
2. Extended Character Coverage: The 2022 standard includes significant updates, covering over 87,000 characters. It adds support for CJK Extensions (C, D, E, F, G) and updates mappings for rare characters that were previously mapped to the Private Use Area (PUA) in the 2005 version. This is critical for correctly handling names containing rare characters (e.g., in banking or government data).
3. Backward Compatibility: It is safe to promote GB18030-2022 as the preferred encoding. Files encoded in EUC-CN or CP936 are valid GB18030 streams.
This PR adds GB18030-2022 to the default encoding list for CN.
The issue is specific to SLES15.
Arguably this should be reported to them as it seems to me they meddled
with the oniguruma source code.
The definition in oniguruma.h on that platform looks like this (same as upstream):
```c
ONIG_EXTERN
int onig_error_code_to_str PV_((OnigUChar* s, int err_code, ...));
```
Where `PV_` is defined as (differs):
```c
#ifndef PV_
#ifdef HAVE_STDARG_PROTOTYPES
# define PV_(args) args
#else
# define PV_(args) ()
#endif
#endif
```
So that means that `HAVE_STDARG_PROTOTYPES` is unset.
This can be set if we define `HAVE_STDARG_H`,
which we can do because PHP requires at least C99 in which the header
is always available.
We could also use an autoconf check, but this isn't really necessary as
it will always succeed.
Moves the usage of `mb_internal_encoding()` to INI section for the tests not testing the encoding/function itself, but the other mbstring/iconv functions.
We prevent signed overflow by making the count unsigned. The actual
interpretation of the count doesn't matter as it's just used to denote a
limit.
The test output for some limit values looks strange though, so that may
need extra investigation. However, that's orthogonal to this fix.
Closes GH-18906.
This API can't handle references, yet everyone keeps forgetting that it
can't and that you should DEREF upfront. Fix every type of this issue
once and for all by moving the reference handling to this Zend API.
Closes GH-18761.