This happens because config test does not shutdown SAPI.
In addition this commit also fixes few failures when running FPM tests
under root.
Closes GH-10296
We now have a couple of mbstring functions which have fast paths for
strings marked as 'valid UTF-8'. Later, we may likely have more. So
that these fast paths can be used more frequently, mark UTF-8 strings
emitted by mbstring as 'valid UTF-8'. This is always a correct thing
to do, because mbstring never returns invalid UTF-8 as the result of
a conversion (or similar) operation.
Internally, we do have a conversion mode which deliberately emits
invalid UTF-8 in some cases. (This is done to prevent unwanted matches
when we are converting strings to UTF-8 before performing matching
operations on them.) For such strings, don't set the 'valid UTF-8' flag.
It probably wouldn't hurt anything to set it, because strings generated
using that special conversion mode should *never* be returned to
userland, and I don't think we do anything with them which cares about
the IS_STR_VALID_UTF8 flag... but still, it would likely cause
confusion for developers.
* random: Randomizer::getFloat(): Fix check for empty open intervals
The check for invalid parameters for the IntervalBoundary::OpenOpen variant was
not correct: If two consecutive doubles are passed as parameters, the resulting
interval is empty, resulting in an uint64 underflow in the γ-section
implementation.
Instead of checking whether `$min < $max`, we must check that there is at least
one more double between `$min` and `$max`, i.e. it must hold that:
nextafter($min, $max) != $max
Instead of duplicating the comparatively complicated and expensive `nextafter`
logic for a rare error case we instead return `NAN` from the γ-section
implementation when the parameters result in an empty interval and thus underflow.
This allows us to reliably detect this specific error case *after* the fact,
but without modifying the engine state. It also provides reliable error
reporting for other internal functions that might use the γ-section
implementation.
* random: γ-section: Also check that that min is smaller than max
This extends the empty-interval check in the γ-section implementation with a
check that min is actually the smaller of the two parameters.
* random: Use PHP_FLOAT_EPSILON in getFloat_error.phpt
Co-authored-by: Christoph M. Becker <cmbecker69@gmx.de>
This version replaces SPACEs before the meridian with NARROW NO-BREAK
SPACEs. Thus, we split the affected test cases as usual.
(cherry picked from commit 8dd51b462d)
Fixes GH-10262.
A copy of this piece of code exists in zend_jit_compile_side_trace(),
but there, the leak bug does not exist.
This bug exists since both copies of this piece of code were added in
commit 4bf2d09ede
One small piece of this was obtained from Stack Overflow. According to
Stack Overflow's Terms of Service, all user-contributed code on SO is
provided under a Creative Commons license. I believe this license is
compatible with the code being included in PHP.
Benchmarking results (UTF-8 only, for strings which have already been
checked using mb_check_encoding):
For very short (0-5 byte) strings, mb_strlen is 12% faster.
The speedup gets greater and greater on longer input strings; for
strings around 100KB, mb_strlen is 23 times faster.
Currently the 'fast' code is gated behind a GC flag check which ensures
it is only used on strings which have already been checked for UTF-8
validity. This is because the accelerated code will return different
results on some invalid UTF-8 strings.
zend_get_property_guard previously assumed that at least "str" has a
pre-computed hash. This is not always the case, for example when a
string is created by bitwise operations, its hash is not set. Instead of
forcing a computation of the hashes, drop the hash comparison.
Closes GH-10254
Co-authored-by: Changochen <changochen1@gmail.com>
Signed-off-by: George Peter Banyard <girgias@php.net>
I like the asm which gcc -O3 generates on this modified code...
and guess what: my CPU likes it too!
(The asm is noticeably tighter, without any extra operations in the
path which dispatches to the code for decoding a 1-byte, 2-byte,
3-byte, or 4-byte character. It's just CMP, conditional jump, CMP,
conditional jump, CMP, conditional jump.
...Though I was admittedly impressed to see gcc could implement the
boolean expression `c >= 0xC2 && c <= 0xDF` with just 3 instructions:
add, CMP, then conditional jump. Pretty slick stuff there, guys.)
Benchmark results:
UTF-8, short - to UTF-16LE faster by 7.36% (0.0001 vs 0.0002)
UTF-8, short - to UTF-16BE faster by 6.24% (0.0001 vs 0.0002)
UTF-8, medium - to UTF-16BE faster by 4.56% (0.0003 vs 0.0003)
UTF-8, medium - to UTF-16LE faster by 4.00% (0.0003 vs 0.0003)
UTF-8, long - to UTF-16BE faster by 1.02% (0.0215 vs 0.0217)
UTF-8, long - to UTF-16LE faster by 1.01% (0.0209 vs 0.0211)
MacJapanese has a somewhat unusual feature that when mapped to
Unicode, many characters map to sequences of several codepoints.
Add test cases demonstrating how mb_str_split and mb_substr behave in
this situation.
When adding these tests, I found the behavior of mb_substr was wrong
due to an inconsistency between the string "length" as measured by
mb_strlen and the number of native MacJapanese characters which
mb_substr would count when iterating over the string using the
mblen_table. This has been fixed.
I believe that mb_strstr will also return wrong results in some cases
for MacJapanese. I still need to come up with unit tests which
demonstrate the problem and figure out how to fix it.
Various mbstring legacy text encodings have what is called an 'mblen_table';
a table which gives the length of a multi-byte character using a lookup on
the first byte value. Several mbstring functions have a 'fast path' which uses
this table when it is available.
However, it turns out that iterating through a string using the mblen_table
is surprisingly slow. I found that by deleting this 'fast path' from mb_strlen,
while mb_strlen becomes a few percent slower on very small strings (0-5 bytes),
very large performance gains can be achieved on medium to long input strings.
Part of the reason for this is because our text decoding filters are so much
faster now.
Here are some benchmarks:
EUC-KR, short (0-5 chars) - master faster by 11.90% (0.0000 vs 0.0000)
EUC-JP, short (0-5 chars) - master faster by 10.88% (0.0000 vs 0.0000)
BIG-5, short (0-5 chars) - master faster by 10.66% (0.0000 vs 0.0000)
UTF-8, short (0-5 chars) - master faster by 8.91% (0.0000 vs 0.0000)
CP936, short (0-5 chars) - master faster by 6.27% (0.0000 vs 0.0000)
UHC, short (0-5 chars) - master faster by 5.38% (0.0000 vs 0.0000)
SJIS, short (0-5 chars) - master faster by 5.20% (0.0000 vs 0.0000)
UTF-8, medium (~100 chars) - new faster by 127.51% (0.0004 vs 0.0002)
UTF-8, long (~10000 chars) - new faster by 87.94% (0.0319 vs 0.0170)
UTF-8, very long (~100000 chars) - new faster by 88.25% (0.3199 vs 0.1699)
SJIS, medium (~100 chars) - new faster by 208.89% (0.0004 vs 0.0001)
SJIS, long (~10000 chars) - new faster by 253.57% (0.0319 vs 0.0090)
CP936, medium (~100 chars) - new faster by 126.08% (0.0004 vs 0.0002)
CP936, long (~10000 chars) - new faster by 200.48% (0.0319 vs 0.0106)
EUC-KR, medium (~100 chars) - new faster by 146.71% (0.0004 vs 0.0002)
EUC-KR, long (~10000 chars) - new faster by 212.05% (0.0319 vs 0.0102)
EUC-JP, medium (~100 chars) - new faster by 186.68% (0.0004 vs 0.0001)
EUC-JP, long (~10000 chars) - new faster by 295.37% (0.0320 vs 0.0081)
BIG-5, medium (~100 chars) - new faster by 173.07% (0.0004 vs 0.0001)
BIG-5, long (~10000 chars) - new faster by 269.19% (0.0319 vs 0.0086)
UHC, medium (~100 chars) - new faster by 196.99% (0.0004 vs 0.0001)
UHC, long (~10000 chars) - new faster by 256.39% (0.0323 vs 0.0091)
This does raise the question: is using the 'mblen_table' worthwhile for
other mbstring functions, such as mb_str_split? The answer is yes, it
is worthwhile; you see, while mb_strlen only needs to decode the input
string but not re-encode it, when mb_str_split is implemented using
the conversion filters, it needs to both decode the string and then
re-encode it. This means that there is more potential to gain
performance by using the 'mblen_table'. Benchmarking shows that in a
few cases, mb_str_split becomes faster when the 'mblen_table fast path'
is deleted, but in the majority of cases, it becomes slower.
ctx can never be zero in these functions because they are dispatched
virtually by looking up their entries in ctx. Furthermore, 2 of these
checks never actually worked because ctx was dereferenced before ctx was
NULL-checked.