The new SSE2-based implementation of mb_check_encoding for UTF-8 is
about 10% faster for 0-5 byte strings, more than 3 times faster for
~100-byte strings, and just under 4 times faster for ~10,000-byte
strings.
I believe it may be possible to make this function much faster again.
Some possible directions for further performance optimization include:
• If other ISA extensions like AVX or AVX-512 are available, use a
similar algorithm, but process text in blocks of 32 or 64 bytes
(instead of 16 bytes).
• If other SIMD ISA extensions are available, use the greater variety
of available instructions to make some of the checks tighter.
• Even if only SSE/SSE2 are available, find clever ways to squeeze
instructions out of the hot path. This would probably require a lot
of perusing instruction mauals and thinking hard about which SIMD
instructions could be used to perform the same checks with fewer
instructions.
• Find a better algorithm, possibly one where more checks could be
combined (just as the current algorithm combines the checks for
certain overlong code units and reserved codepoints).
This code path was only triggered if inst->cd == NULL. But the freeing
only happens if inst->cd != NULL. There is nothing to free here, so
remove this code. In fact, let's get rid of the goto too to make the
code more clear to read.
* PHP-8.2:
Fix wrong flags check for compression method in phar_object.c
Fix missing check for xmlTextWriterEndElement
Fix substr_replace with slots in repl_ht being UNDEF
* PHP-8.1:
Fix wrong flags check for compression method in phar_object.c
Fix missing check for xmlTextWriterEndElement
Fix substr_replace with slots in repl_ht being UNDEF
I found this issue using static analysis tools, it reported that the condition was always false.
We can see that flags is assigned in the switch statement above, but a mistake was made in the comparison.
Closes GH-10328
Signed-off-by: George Peter Banyard <girgias@php.net>
xmlTextWriterEndElement returns -1 if the call fails. There was already
a check for retval, but the return value wasn't assigned to retval. The
other caller of xmlTextWriterEndElement is in
xmlwriter_write_element_ns, which does the check correctly.
Closes GH-10324
Signed-off-by: George Peter Banyard <girgias@php.net>
The check that was supposed to check whether the array slot was UNDEF
was wrong and never triggered. This resulted in a replacement with the
empty string or the wrong string instead of the correct one. The correct
check pattern can be observed higher up in the function's code.
Closes GH-10323
Signed-off-by: George Peter Banyard <girgias@php.net>
Remove array_pad's arbitrary length restriction
The error message was wrong; it *is* possible to use a larger length.
Furthermore, there is an arbitrary restriction on the new array's
length.
Fix both by checking the length against HT_MAX_SIZE.
These are mandatory in C99, so it's a pointless waste of time to check
for them.
(Actually, the fixed-size integer types are not mandatory, but if they
are really not available on some theoretical system, PHP's fallbacks
won't work either, so nothing is gained from this check.)
Cheaper than fcntl(F_SETLK). The same is done already on Windows, so
if that works, why not use it everywhere? (Of course, only if the
compiler supports this C11 feature.)
As a bonus, the code in this commit also works on C++ via C++11
std::atomic, just in case somebody adds some C++ code to the opcache
extension one day.
* unserialize: Strictly check for `:{` at object start
* unserialize: Update CVE tests
It's unlikely that the object syntax error contributed to the actual CVE. The
CVE is rather caused by the incorrect object serialization data of the `C`
format. Add a second string without such a syntax error to ensure that path is
still executed as well to ensure the CVE is absent.
* Fix test expectation in gmp/tests/bug74670.phpt
No changes to the input required, because the test actually is intended to
verify the behavior for a missing `}`, it's just that the report position changed.
* NEWS
* UPGRADING
This make sure the tests do not fail if they are not run from the
repository root.
Closes GH-10266
Signed-off-by: George Peter Banyard <girgias@php.net>
Instead of checking the 'encoding number' to see if we are converting
case for ISO-8859-9 text, compare pointers instead.
This should free up 1 register in php_unicode_convert_case.
The capital Greek letter sigma (Σ) should be lowercased as σ except
when it appears at the end of a word; in that case, it should be
lowercased as the special form ς.
This rule is included in the Unicode data file SpecialCasing.txt.
The condition for applying the rule is called "Final_Sigma" and is
defined in Unicode technical report 21. The rule is:
• For the special casing form to apply, the capital letter sigma must
be preceded by 0 or more "case-ignorable" characters, preceded by
at least 1 "cased" character.
• Further, capital sigma must NOT be followed by 0 or more
case-ignorable characters and then at least 1 cased character.
"Case-ignorable" characters include certain punctuation marks, like
the apostrophe, as well as various accent marks. There are actually
close to 500 different case-ignorable characters, including accent marks
from Cyrillic, Hebrew, Armenian, Arabic, Syriac, Bengali, Gujarati,
Telugu, Tibetan, and many other alphabets. This category also includes
zero-width spaces, codepoints which indicate RTL/LTR text direction,
certain musical symbols, etc.
Since the rule involves scanning over "0 or more" of such
case-ignorable characters, it may be necessary to scan arbitrarily far
to the left and right of capital sigma to determine whether the special
lowercase form should be used or not. However, since we are trying to
be both memory-efficient and CPU-efficient, this implementation limits
how far to the left we will scan. Generally, we scan up to 63 characters
to the left looking for a "cased" character, but not more.
When scanning to the right, we go up to the end of the string if
necessary, even if it means scanning over thousands of characters.
Anyways, it is almost impossible to imagine that natural text will
include "words" with more than 63 successive apostrophes (for example)
followed by a capital sigma.
Closes GH-8096.
The name "new" happens to be a C++ keyword, which was the my reason to
rethink those names.
The "xlat_table" is not only used to translate pointers for persisting
scripts to shared memory, but is also used to annoate pointers
(e.g. by the JIT to associate an op_array with its jit_extension).
The names "old" and "new" aren't good for that; often, there's nothing
"old" or "new" about them. It's actually a generic lookup table, and
"old" shall be named "key" (which it is called internally already),
and "new" is renamed to simply "value".
We now have a couple of mbstring functions which have fast paths for
strings marked as 'valid UTF-8'. Later, we may likely have more. So
that these fast paths can be used more frequently, mark UTF-8 strings
emitted by mbstring as 'valid UTF-8'. This is always a correct thing
to do, because mbstring never returns invalid UTF-8 as the result of
a conversion (or similar) operation.
Internally, we do have a conversion mode which deliberately emits
invalid UTF-8 in some cases. (This is done to prevent unwanted matches
when we are converting strings to UTF-8 before performing matching
operations on them.) For such strings, don't set the 'valid UTF-8' flag.
It probably wouldn't hurt anything to set it, because strings generated
using that special conversion mode should *never* be returned to
userland, and I don't think we do anything with them which cares about
the IS_STR_VALID_UTF8 flag... but still, it would likely cause
confusion for developers.