In b5ff87ca71, I made a number of adjustments to our conversion code
for CP1252. One of the adjustments was to make the mappings match those
published by the Unicode Consortium in the file CP1252.TXT. These do
not include mappings for the CP1252 bytes 0x81, 0x8D, 0x8F, 0x90, and
0x9D.
Rostyslav Gulka reported that this caused a problem. His application
stores binary JPEG data in an MS-SQL database. When they SELECT the
binary data out of the database, it is treated as CP1252 text and
automatically converted to UTF-8. To recover the original binary
data, they then do a conversion from UTF-8 to CP1252.
Obviously, that does not work if certain CP1252 bytes do not map to
any Unicode codepoint at all.
While this is a very unusual application of text encoding conversion,
and we might choose not to support it if there was no other basis for
including those mappings, it seems that Microsoft does actually include
them in the Win32 API as "best fit" mappings. These are extra mappings
from Unicode to other text encodings, which the Win32 API function
WideCharToMultiByte uses by default unless the WC_NO_BEST_FIT_CHARS
flag was passed.
A list of these "best fit" mappings for CP1252 can be found here:
https://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WindowsBestFit/bestfit1252.txt
A previous fix[1] was not sufficient to catch all potential file URIs,
because the patch did not cater to URL encoding. Properly parsing and
decoding the URI may yield a different result than the handling of
SQLite3, so we play it safe, and reject any file URIs if open_basedir
is configured.
[1] <https://bugs.php.net/bug.php?id=77967>
Closes GH-10018.
There are two issues to resolve:
1. The FCC is not refetch when trying to unregister a trampoline
2. Comparing the function pointer of trampolines is meaningless as they are reallocated, thus we need to compare the name of the function
Found while working on GH-8294
Closes GH-10033
Dialect 1 databases store and transfer `NUMERIC(15,2)` values as
doubles, which we need to cater to in `firebird_stmt_get_col()` to
avoid `ZEND_ASSUME(0)` to ever be triggered, since that may result
in undefined behavior.
Since adding a regression test would require to create a dialect 1
database, we go without it.
Closes GH-10021.
For JIS encoding, hiragana and katakana can be input in multiple forms.
One form uses JISX 0201 escape sequences. Another is called 'GR-invoked'
kana.
In the context of ISO-2022 encoding, bytes with a zero bit in the MSB
are called "GL" (or "graphics left") and those with the MSB set are
called "GR" (or "graphics right"). Regarding the variants of
ISO-2022-JP which are called "JIS7" and "JIS8", Wikipedia states:
"Other, older variants known as JIS7 and JIS8 build directly on the
7-bit and 8-bit encodings defined by JIS X 0201 and allow use of JIS X
0201 kana from G1 without escape sequences, using Shift Out and Shift
In or setting the eighth bit (GR-invoked), respectively."
In harmony with this, we have always accepted bytes from 0xA3-0xDF and
decoded them to the corresponding hiragana/katakana. However, at some
point I accidentally broke output for these kana. You can see the
problem in 3v4l.org by running this program:
<?php
echo bin2hex(mb_convert_encoding("\xA3", 'JIS', 'JIS'));
The results are:
Output for 8.2rc1 - rc3
1b244200231b2842
Output for 7.4.0 - 7.4.33, 8.0.1 - 8.0.25, 8.1.12
1b2849231b2842
Output for 8.1.0 - 8.1.11
1b284923
You can see that from 8.1.0 - 8.1.11, there was a missing escape
sequence at the end. That was caused because the flush functions were
not being called properly, and has already been fixed. However, this
also shows that the output for 8.2rc1-rc3 is completely invalid.
It is trying to output a JISX 0208 sequence, but with 0x00 as one of
the JISX 0208 bytes, which is illegal.
Add the missing code which will make the new text conversion filters
behave the same as the old ones when outputting hiragana/katakana in
JIS encoding.
We need to overwrite the __toString magic method for SplFileObject, similarly to how DirectoryIterator overwrites it
Moreover, the custom cast handler is useless as we define __toString methods, so use the standard one instead.
Closes GH-9912
Closure::call() makes a temporary copy of original closure function, modifies its
scope, resets ZEND_ACC_CLOSURE flag and call it through zend_call_function().
As result the same function may be called with and without
ZEND_ACC_CLOSURE flag, that confuses JIT and may lead to memory leak or
even worse memory errors.
The patch allocates "fake" closure object and keep ZEND_ACC_CLOSURE flag
to always behave in the same way.
This bug was found when I was fuzzing a patch related to mb_strpos.
In some cases, the legacy text conversion code for UTF-7 (and
UTF7-IMAP) would correctly recognize an error for a Base64-encoded
section which was not correctly padded with zero bits, but the new
(and faster) text conversion code would not.
Specifically, if the input string ended abruptly after the 4th or 7th
byte of a Base64-encoded section, the new conversion code would
confirm that the trailing padding bits from the previous byte (3rd or
6th) were zeroes, but would not check whether the 4th or 7th byte
itself encoded any non-zero bits. The legacy conversion code did
perform this check and would treat the input string as invalid.
Actually, even if the 4th or 7th byte does encode only (padding) zero
bits, this is still a problem, because there is no reason to have a
4th (or 7th) byte in that case. The UTF-7 string should have ended
on the previous byte instead.
Apply the same fix for both UTF-7 and UTF7-IMAP.
Some of the legacy text encodings which were used in this regression
test are deprecated in PHP-8.2+. The deprecation warnings break the
expected output. Since using these encodings in mbstring is now
deprecated, I think there is little point in keeping them in this test.
So they are now removed from it.
Further, in 219fff376b, I made a change to avoid a situation where the
legacy UTF7-IMAP conversion code gets stuck in a wrong state when its
attempt to emit a character fails. When a Base64-encoded section of
input ended with -, the previous code would FIRST emit a character if
necessary (using the CK or "check" macro, which causes the function to
return immediately if the downstream filter function returns an error
code), and THEN update its own state to indicate that it is now in
ASCII rather than Base64 mode.
If the downstream filter function returned an error code, the CK macro
would then cause the UTF7-IMAP filter function to return immediately
WITHOUT setting its own state to indicate that the Base64-encoded
section was done.
I fixed this by updating the filter state as needed BEFORE calling CK...
but I missed updating the filter state in the case where the Base64
section ends normally and there is no need to emit anything.
Again, in 6d525a425e, I modified the legacy conversion code for
ISO-2022-KR to try to comply more closely with the RFC for this
text encoding. The RFC states that before any occurrence of 'Shift In'
or 'Shift Out' codes in a ISO-2022-KR string, a special escape
sequence must appear at least ONCE, at the beginning of a line.
The previous code did not comply with this requirement. I made it
comply by always emitting this escape sequence at the beginning of
the first line.
Since mb_strcut (wrongly) determines when it has consumed enough of
the input string by looking at the length of its output in bytes, this
extra escape sequence makes mb_strcut consume 4 bytes less of an
ISO-2022-KR string than would otherwise be the case. When this
strange behavior of mb_strcut is fixed, this test will have to be
adjusted to restore the previous expected outputs for ISO-2022-KR.
The existing implementation of mb_strcut extracts part of a
multi-byte encoded string by pulling out raw bytes and then running
them through a conversion filter to ensure that the output is valid
in the requested encoding.
If the conversion filter emits error markers when doing the final
'flush' operation which ends the conversion of the extracted bytes,
these error markers may (in some cases) be included in the output.
The conversion operation does not respect the value of
mb_substitute_character; rather, it always uses '?' as an error marker.
So this issue manifests itself as unwanted '?' characters being
inserted into the output.
This issue has existed for a long time, but became noticeable in PHP
8.1 because for at least some of the supported text encodings, mbstring
is now more strict about emitting error markers when strings end in an
illegal state.
The simplest fix is to suppress error markers during the final flush
operation.
While working on a fix for this problem, another problem with mb_strcut
was discovered; since it decides when to stop consuming bytes from
the input by looking at the byte length of its OUTPUT, anything which
causes extra bytes to be emitted to the output may cause mb_strcut to
not consume all the bytes in the requested range.
The one case where we DO emit extra output bytes is for encodings
which have a selectable mode, like ISO-2022-JP; if a string in such
an encoding ends in a mode which is not the default, we emit an ending
escape sequence which changes back to the default mode. This is done
so that concatenating strings in such encodings is safe.
However, as mentioned, this can cause the output of mb_strcut to be
shorter than it logically should be. This bug has existed for a long
time, and fixing it now will be a BC break, so we may not fix it right
away.
Therefore, tests for THIS fix which don't pass because of that OTHER
bug have been split out into a separate test file (gh9535b.phpt), and
that file has been marked XFAIL.
This occurs when the handle is different from the current handle (e.g. copy of the .so file), hence the existing test did not catch that particular case.