These macros are designed only for literals. This is a
standards compliant trick to ensure they are literals.
For example, these are the same:
"Hello" " world"
"Hello world"
Shift header include
In the C file, include the header first so missing #includes are
detected by the compiler, and use lighter header dependencies in the
header, to speed up compile times.
For many years, the code has contained a TODO comment indicating
that the original author had wanted to do this.
Using smart_str makes the code shorter and cleaner, and it is another
step towards removing a bunch of legacy mbstring code which will soon
be unneeded.
Regarding the optional 3rd `strict` argument to mb_detect_encoding,
the documentation states:
Controls the behaviour when string is not valid in any of the listed encodings.
If strict is set to false, the closest matching encoding will be returned;
if strict is set to true, false will be returned.
(Ref: https://www.php.net/manual/en/function.mb-detect-encoding.php)
Because of bugs in the implementation, mb_detect_encoding did not always
behave according to this description when `strict` was false.
For example:
<?php
echo var_export(mb_detect_encoding("\xc0\x00", "UTF-8", false));
// Before this commit, prints: false
// After this commit, prints: 'UTF-8'
Because `strict` is false in the above example, mb_detect_encoding
should return the 'closest matching encoding', which is UTF-8, since
that is the only candidate encoding. (Incidentally, this example shows
that using mb_detect_encoding with a single candidate encoding in
non-strict mode is useless.)
The new implementation fixes this bug. It also fixes another problem
with the old implementation as regards non-strict detection mode:
The old implementation would stop processing of the input string using
a particular candidate encoding as soon as it saw an error in that
encoding, even in non-strict mode. This means that it could not really
detect the 'closest matching encoding'; rather, what it would return
in non-strict mode was 'the encoding in which the first decoding error
is furthest from the beginning of the input string'.
In non-strict mode, the new implementation continues trying to process
the input string to its end even after seeing an error. This makes it
possible to determine in which candidate encoding the string has the
smallest number of errors, i.e. the 'closest matching encoding'.
Rejecting candidate encodings as soon as it saw an error gave the old
implementation a marked performance advantage in non-strict mode;
however, the new implementation still beats it in most cases. Here are
a few sample microbenchmark results:
UTF-8, ~100 codepoints, strict mode
Old: 0.080s (100,000 calls)
New: 0.026s (" " )
UTF-8, ~100 codepoints, non-strict mode
Old: 0.079s (100,000 calls)
New: 0.033s (" " )
UTF-8, ~10000 codepoints, strict mode
Old: 6.708s (60,000 calls)
New: 1.383s (" " )
UTF-8, ~10000 codepoints, non-strict mode
Old: 6.705s (60,000 calls)
New: 3.044s (" " )
Notice that the old implementation had almost identical performance
between strict and non-strict mode, while the new suffers a significant
performance penalty for non-strict detection. This is the cost of
implementing the behavior specified in the documentation.
A couple more sample results:
SJIS, ~10000 codepoints, strict mode
Old: 4.563s
New: 1.084s
SJIS, ~10000 codepoints, non-strict mode
Old: 4.569s
New: 2.863s
This is the only case I found where the new implementation loses:
UTF-16LE, ~10000 codepoints, non-strict mode
Old: 1.514s
New: 2.813s
The reason is because the test strings happened to be invalid right from
the first few bytes for all the candidate encodings except for UTF-16LE;
so the old implementation would immediately reject all those encodings
and only process the entire string in UTF-16LE.
I believe mb_detect_encoding could be made much faster if we identified
good criteria for when to reject candidate encodings before reaching
the end of the input string.
Aside from being used by the pcre extension, various functions in
mbstring are faster if the input strings are known to be valid UTF-8.
We might as well mark the default interned strings (which are
initialized when PHP starts up) as valid UTF-8 where appropriate.
There is no great difference between the old and new code for text
encodings which either 1) use a fixed number of bytes per codepoint or
2) for which we have an 'mblen' table which enables us to find the
length of a multi-byte character using a table lookup indexed by the
first byte value.
The big difference is for other text encodings, where we have to
actually decode the string to split it. For such text encodings,
such as ISO-2022-JP and UTF-16, I measured a speedup of 50%-120% over
the previous implementation.
The issue was that passwd was empty for the issue reporter, but the test
expected passwd to be non-empty. An empty passwd can occur if there is
no (encrypted) group password set up.
This occurs because the array of properties is a single element with an
integer key, not an associative array. Therefore it is a packed array
and thus the assumption the iteration macro makes is invalid.
This restores the behaviour of PHP<8.2.
Closes GH-10209
Co-authored-by: Deltik <deltik@gmx.com>
Signed-off-by: George Peter Banyard <girgias@php.net>
This occurs because the array of properties is a single element with an
integer key, not an associative array. Therefore it is a packed array
and thus the assumption the iteration macro makes is invalid.
This restores the behaviour of PHP<8.2.
Closes GH-10209
Co-authored-by: Deltik <deltik@gmx.com>
Signed-off-by: George Peter Banyard <girgias@php.net>
Until https://github.com/php/php-src/pull/10098 default constructors were sometimes documented, sometimes omitted. The above PR made adding documentation for constructor of all non-abstract classes required. However, this is not desired when such a class inherits a constructor from its parent. The current PR fixes the behavior the following way:
- documents inherited constructors (along with destructors)
- makes it possible to generate/replace the methodsynopsis of implicit default constructors which don't have a stub counterpart
Each test should use its own temporary filenames to avoid issues when
the tests are executed in parallel[1]. We also silence the `unlink()`
calls in the CLEAN section just in case.
And while we're at it, we also remove the erroneous comment; there is
no symlinking involved for the Windows test variants.
[1] <https://github.com/php/php-src/pull/10175#issuecomment-1366809933>
Closes GH-10189.
Add 4 codepoints commonly used to write Turkish text to our table
of 'commonly used' Unicode codepoints. These are:
• U+011F LATIN SMALL LETTER G WITH BREVE
• U+0130 LATIN CAPITAL LETTER I WITH DOT ABOVE
• U+0131 LATIN SMALL LETTER DOTLESS I
• U+015F LATIN SMALL LETTER S WITH CEDILLA
When the validation logic for param->type was added, the logic did not
account for the case where param could be NULL. The existing code did
take that into account as can be seen in the `if (param)` check below.
Furthermore, phpdbg_set_breakpoint_expression even calls
phpdbg_create_conditional_break with param == NULL.
Fix it by placing the validation logic inside a NULL check.
The 'h' flag makes mb_convert_kana convert zenkaku hiragana to hankaku
katakana; 'k' makes it convert zenkaku katakana to hankaku katakana.
When working on the implementation of mb_convert_kana, I added some
additional checks to catch combinations of flags which do not make
sense; but there is no conflict between 'h' and 'k' (they control
conversions for two disjoint ranges of codepoints) and this combination
should not have been restricted.
Thanks to the GitHub user 'akira345' for reporting this problem.
Closes GH-10174.
Commit 6c25413183 added the flag ZEND_JIT_EXIT_INVALIDATE which
resets the trace handlers in zend_jit_trace_exit(), but forgot to
lock the shared memory section.
This could cause another worker process who still saw the
ZEND_JIT_TRACE_JITED flag to schedule ZEND_JIT_TRACE_STOP_LINK, but
when it arrived at the ZEND_JIT_DEBUG_TRACE_STOP, the handler was
already reverted by the first worker process and thus
zend_jit_find_trace() fails.
This in turn generated a bogus jump offset in the JITed code, crashing
the PHP process.