Instead of checking the 'encoding number' to see if we are converting
case for ISO-8859-9 text, compare pointers instead.
This should free up 1 register in php_unicode_convert_case.
The capital Greek letter sigma (Σ) should be lowercased as σ except
when it appears at the end of a word; in that case, it should be
lowercased as the special form ς.
This rule is included in the Unicode data file SpecialCasing.txt.
The condition for applying the rule is called "Final_Sigma" and is
defined in Unicode technical report 21. The rule is:
• For the special casing form to apply, the capital letter sigma must
be preceded by 0 or more "case-ignorable" characters, preceded by
at least 1 "cased" character.
• Further, capital sigma must NOT be followed by 0 or more
case-ignorable characters and then at least 1 cased character.
"Case-ignorable" characters include certain punctuation marks, like
the apostrophe, as well as various accent marks. There are actually
close to 500 different case-ignorable characters, including accent marks
from Cyrillic, Hebrew, Armenian, Arabic, Syriac, Bengali, Gujarati,
Telugu, Tibetan, and many other alphabets. This category also includes
zero-width spaces, codepoints which indicate RTL/LTR text direction,
certain musical symbols, etc.
Since the rule involves scanning over "0 or more" of such
case-ignorable characters, it may be necessary to scan arbitrarily far
to the left and right of capital sigma to determine whether the special
lowercase form should be used or not. However, since we are trying to
be both memory-efficient and CPU-efficient, this implementation limits
how far to the left we will scan. Generally, we scan up to 63 characters
to the left looking for a "cased" character, but not more.
When scanning to the right, we go up to the end of the string if
necessary, even if it means scanning over thousands of characters.
Anyways, it is almost impossible to imagine that natural text will
include "words" with more than 63 successive apostrophes (for example)
followed by a capital sigma.
Closes GH-8096.
The name "new" happens to be a C++ keyword, which was the my reason to
rethink those names.
The "xlat_table" is not only used to translate pointers for persisting
scripts to shared memory, but is also used to annoate pointers
(e.g. by the JIT to associate an op_array with its jit_extension).
The names "old" and "new" aren't good for that; often, there's nothing
"old" or "new" about them. It's actually a generic lookup table, and
"old" shall be named "key" (which it is called internally already),
and "new" is renamed to simply "value".
This happens because config test does not shutdown SAPI.
In addition this commit also fixes few failures when running FPM tests
under root.
Closes GH-10296
We now have a couple of mbstring functions which have fast paths for
strings marked as 'valid UTF-8'. Later, we may likely have more. So
that these fast paths can be used more frequently, mark UTF-8 strings
emitted by mbstring as 'valid UTF-8'. This is always a correct thing
to do, because mbstring never returns invalid UTF-8 as the result of
a conversion (or similar) operation.
Internally, we do have a conversion mode which deliberately emits
invalid UTF-8 in some cases. (This is done to prevent unwanted matches
when we are converting strings to UTF-8 before performing matching
operations on them.) For such strings, don't set the 'valid UTF-8' flag.
It probably wouldn't hurt anything to set it, because strings generated
using that special conversion mode should *never* be returned to
userland, and I don't think we do anything with them which cares about
the IS_STR_VALID_UTF8 flag... but still, it would likely cause
confusion for developers.