In GitHub issue 9613, it was reported that mb_strpos wrongly matches the
character '?' against any invalid string, even when the character '?'
clearly does not appear in the invalid string. This behavior has existed
at least since PHP 5.2.
The reason for the behavior is that mb_strpos internally converts the
haystack and needle to UTF-8 before performing a search. When converting
to UTF-8, regardless of the setting of mb_substitute_character, libmbfl
would use '?' as an error marker for invalid byte sequences. Once those
invalid input sequences were replaced with '?', then naturally, they
would match against occurrences of the actual character '?' (when it
appeared as a 'normal' character, not as an error marker). This would
happen regardless of whether the error was in the haystack and '?' was
used in the needle, or whether the error was in the needle and '?' was
used in the haystack.
Why would libmbfl use '?' rather than the mb_substitute_character set
by the user? Remember that libmbfl was originally a separate library
which was imported into the PHP codebase. mb_substitute_character is an
mbstring API function, not something built into libmbfl. When mbstring
would call into libmbfl, it would provide the error replacement
character to libmbfl as a parameter. However, when libmbfl would perform
conversion operations internally, and not because of a direct call from
mbstring, it would use its own error replacement character.
Example:
<?php
$questionMark = "\x00?";
$badUTF16 = "\xDB\x00"; // half of a surrogate pair
echo mb_strpos($questionMark, $badUTF16, 0, 'UTF-16BE'), "\n";
echo mb_strpos($badUTF16, $questionMark, 0, 'UTF-16BE'), "\n";
Incidentally, this behavior does not occur if the text encoding is
UTF-8, because no conversion is needed in that case.
mb_stripos had a similar issue, but instead of always using '?' as an
error marker internally, it would use the selected
mb_substitute_character. So, for example, if the mb_substitute_character
was '%', then occurrences of '%' in the haystack would match invalid
bytes in the needle, and vice versa.
Example:
<?php
mb_substitute_character(0x25); // '%'
$percent = "\x00%";
$badUTF16 = "\xDB\x00"; // half of a surrogate pair
echo mb_stripos($percent, $badUTF16, 0, 'UTF-16BE'), "\n";
echo mb_stripos($badUTF16, $percent, 0, 'UTF-16BE'), "\n";
This behavior (of mb_stripos) still occurs even if the text encoding is
UTF-8, because case folding is still needed to make the search
case-insensitive.
It is not hard to think of scenarios where these strange and unintuitive
behaviors could cause security vulnerabilities. In the discussion on
GH issue 9613, Christoph Becker suggested that mb_str{i,}pos should
simply refuse to operate on invalid strings. However, this would almost
certainly break existing production code.
This commit mitigates the problem in a less intrusive way: it ensures
that while invalid haystacks can match invalid needles (even if the
specific invalid bytes are different), invalid bytes in the haystack
will never match '?' OR occurrences of the mb_substitute_character in
the needle, and vice versa.
This does represent a backwards compatibility break, but a small one.
Since it mitigates a potential security problem, I believe this is
appropriate.
Closes GH-9613.
Instead of case-folding a string and then converting it to UTF-8 as a
separate operation, why not convert it to UTF-8 at the same time as
we fold case?
For non-UTF-8 encodings, this typically makes mb_stripos about 2x
faster.
The check checks whether p is non-NULL. But if it were NULL the function
would crash in later code, so the check is useless.
It seems like *p was intended, but that is redundant as well because
isspace would return false on '\0'.
&& and || should always evaluate to a boolean instead of the lhs/rhs.
This optimization never gets triggered for any of our tests.
Additionally, even if triggered this instruction gets optimized away
because the else branch of the JMP instruction will overwrite the tmp
value.
* PHP-8.2:
Replace another root XML element format to the "canonical" one
Remove the superfluous closing parentheses from class synopsis page includes
Always include the constructor on the class manual pages
Backport methodsynopsis role attributes changes from master
The performance gain from this change depends on the text encoding and
input string size. For very small strings, other overheads tend to swamp
the performance gains to some extent, such that the speedup is less than
2x. For medium-length strings (~100 bytes or so), the speedup is
typically around 2.5x.
The greatest performance gains are for UTF-8 strings which have already
been marked as valid (using the GC flags on the zend_string object);
for those, the speedup is more than 10x in many cases.
The previous implementation first converted the haystack and needle to
wchars, then searched for matches between the two sequences of wchars.
Because we use -1 as an error marker when converting to wchars, error
markers from invalid byte sequences in the haystack would match error
markers from invalid byte sequences in the needle, even if the specific
invalid byte sequence was different. I am not sure whether this behavior
is really desirable or not, but anyways, this new implementation
follows the same behavior so as not to cause BC breaks.
* random: Add Randomizer::nextFloat()
* random: Check that doubles are IEEE-754 in Randomizer::nextFloat()
* random: Add Randomizer::nextFloat() tests
* random: Add Randomizer::getFloat() implementing the y-section algorithm
The algorithm is published in:
Drawing Random Floating-Point Numbers from an Interval. Frédéric
Goualard, ACM Trans. Model. Comput. Simul., 32:3, 2022.
https://doi.org/10.1145/3503512
* random: Implement getFloat_gamma() optimization
see https://github.com/php/php-src/pull/9679/files#r994668327
* random: Add Random\IntervalBoundary
* random: Split the implementation of γ-section into its own file
* random: Add tests for Randomizer::getFloat()
* random: Fix γ-section for 32-bit systems
* random: Replace check for __STDC_IEC_559__ by compile-time check for DBL_MANT_DIG
* random: Drop nextFloat_spacing.phpt
* random: Optimize Randomizer::getFloat() implementation
* random: Reject non-finite parameters in Randomizer::getFloat()
* random: Add NEWS/UPGRADING for Randomizer’s float functionality
The recently committed fix for GH-9944 did only indirectly cater to
that, namely because in this case `CreateFileMapping()` with a zero
size couldn't be created. As of PHP 8.2.0, the mappings of the actual
SHM and the info segment have been merged, so creating a zero size SHM
would be possible unless we explicitly prohibit this.
`ap_get_brigade()` may fail for different reasons, and we must not
pretend that a partially read POST payload is fine; instead we report
a content length of zero what matches all other `read_post()` callbacks
of bundled SAPIs.
Closes GH-10059.
The code checks if stack is a NULL pointer. Below that if the
stack->next pointer is updated unconditionally. Therefore a call with a
NULL pointer will crash, even though the if (stack) check seems to show
the intent that it is valid to call the function with NULL.
The function is not meant to be called with NULL, so just ZEND_ASSERT
instead.
On longer MacJapanese strings, conversion speed is boosted by 60-80%.
On medium-length strings, conversion speed is boosted around 20-30%.
For very short strings, there is no appreciable difference.
While benchmarking the new implementation of mb_substr, I found it was
slower than the old one only when the selected encoding was SJIS.
Investigation showed that the new text conversion filter for SJIS
was a touch slower than the old one.
With this optimization, the new SJIS decoder is about 20% faster than
the old one.
This boosts the performance of mb_strpos, mb_stripos, mb_strrpos,
mb_strripos, mb_strstr, mb_stristr, mb_strrchr, and mb_strrichr when
used on non-UTF-8 strings. mb_substr is also faster.
With UTF-8 input, there is no appreciable difference in performance for
mb_strpos, mb_stripos, mb_strrpos, etc. This is expected, since the only
real difference here (aside from shorter and simpler code) is that the
new text conversion code is used when converting non-UTF-8 input strings
to UTF-8. (This is done because internally, mb_strpos, etc. work only
on UTF-8 text.)
For ASCII, speed is boosted by 30-65%. For other legacy text encodings,
the degree of performance improvement will depend on how slow the
legacy conversion code was.
One other minor, but notable difference is that strings encoded using
UTF-8 variants from Japanese mobile vendors (SoftBank, KDDI, Docomo)
will not undergo encoding conversion but will be processed "as is". It
is expected that this will result in a large performance boost for
such input strings; but realistically, the number of users who work
with such strings is probably minute.
I was not originally planning to include mb_substr in this commit, but
fuzzing of the reimplemented mb_strstr revealed that mb_substr needed
to be reimplemented, too; using the old mbfl_substr, which was based
on the old text conversion filters, in combination with functions which
use the new text conversion filters caused bugs.
The performance boost for mb_substr varies from 10%-500%, depending
on the encoding and input string used.