Previously, mbstring would accept a lot of things which were not valid
UHC text. No more.
- Don't allow single-byte control characters to appear where the 2nd
byte of a multi-byte character should be.
- Validate that the 2nd byte of a multi-byte character is in the
expected range.
- Treat it as an error if a multi-byte character is truncated.
Also add a test suite to confirm that UHC conversion (both to and from
Unicode) works according to spec.
Sigh. Double sigh. After fruitlessly searching the Internet for information on
this mysterious text encoding called "SJIS-open", I wrote a script to try
converting every Unicode codepoint from 0-0xFFFF and compare the results from
different variants of Shift-JIS, to see which one "SJIS-open" would be most
similar to.
The result? It's just CP932. There is no difference at all. So why do we have
two implementations of CP932 in mbstring?
In case somebody, somewhere is using "SJIS-open" (or its aliases "SJIS-win" or
"SJIS-ms"), add these as aliases to CP932 so existing code will continue to
work.
Also fix a couple small problems with UTF-32 and UTF-8 support:
- UTF-32 would pass very large codepoints (>= 0x80000000), which are
not valid.
- UTF-8 would sometimes emit two error marker characters for a single
bad input byte.
Updates the deprecation message for implicit incompatible float to int conversion from:
```
Implicit conversion from non-compatible float %.*H to int in %s on line %d
```
to
```
Implicit conversion from float %.*H to int loses precision in %s on line %d
```
Related: #6661
1. Update: http://www.php.net/license/3_01.txt to https, as there is anyway server header "Location:" to https.
2. Update few license 3.0 to 3.01 as 3.0 states "php 5.1.1, 4.1.1, and earlier".
3. In some license comments is "at through the world-wide-web" while most is without "at", so deleted.
4. fixed indentation in some files before |
- Treat it as error if multi-byte string or escape sequence is truncated
- Don't allow 'control' characters or escape sequences to appear in the middle
of a multi-byte char
As with ISO-2022-JP-KDDI, the main reference used to develop the tests was
the behavior of the existing code. It would have been better to have some
independent reference which we could cross-check our code against, but I
couldn't find one.
- Treat it as an error if a multi-byte character or escape sequence is truncated
- When converting other encodings to ISO-2022-JP-KDDI, don't swallow trailing
hash characters or digits
- Don't allow 'control' characters to appear in the middle of a multi-byte char
Note: I was not able to find any kind of official or even semi-official
specification for this legacy encoding. Therefore, the test suite for
ISO-2022-JP-KDDI is based largely on the behavior of the existing code.
Verifying the correctness of program code in this way is very questionable.
In a sense, all you are proving is that the code "does what it does". However,
the test suite will still expose any unintended _changes_ to behavior.
To detect errors in conversion from Unicode to another text encoding, each
mbstring conversion filter object maintains a count of 'bad' characters. After
a conversion operation finishes, this count is checked to see if there was any
error.
The problem with CP50220 was that mbstring used a chain of two conversion filter
objects. The 'bad character count' would be incremented on the second object in
the chain, but this didn't do anything, as only the count on the first such
object is ever checked.
Fix this by implementing the conversion using a single conversion filter object,
rather than a chain of two. This is possible because of the recent refactoring,
which pulled out the needed logic for CP50220 conversion into a helper function.
There's no need to dynamically allocate a struct to hold the 'mode' parameter;
just store it directly in `filt->opaque`. Some other things were also being done
in an unnecessarily roundabout way.
Also, the 'copy' function for CP50220 conversion filters was *both* broken
and unnecessary. Broken, because it malloc'd memory which was never freed by
anything. Unnecessary, because the point of the copy is so that various
algorithms can try running bytes through a conversion filter and see how many
output bytes or characters result, and then back out by restoring the filters
to their previous state. But here's the thing; CP50220 conversion filters don't
hold cached bytes, which is the main thing which would need to be restored to a
previous state.
This function pointer is only called when initializing the struct. After that
nothing is done with it. Therefore, there is no need to keep it in the struct.
This constructor function doesn't do anything different than the generic one.
There's no need to invoke it, either, when initializing a CP50220 conversion
filter.
This deprecates passing null to non-nullable scale arguments of
internal functions, with the eventual goal of making the behavior
consistent with userland functions, where null is never accepted
for non-nullable arguments.
This change is expected to cause quite a lot of fallout. In most
cases, calling code should be adjusted to avoid passing null. In
some cases, PHP should be adjusted to make some function arguments
nullable. I have already fixed a number of functions before landing
this, but feel free to file a bug if you encounter a function that
doesn't accept null, but probably should. (The rule of thumb for
this to be applicable is that the function must have special behavior
for 0 or "", which is distinct from the natural behavior of the
parameter.)
RFC: https://wiki.php.net/rfc/deprecate_null_to_scalar_internal_arg
Closes GH-6475.
Instead of manually maintaining the data in eaw_table.h, it is now automatically
generated by ucgendat/ucgendat.php, using the EastAsianWidth.txt file from
the Unicode Consortium.
Something must be said about the deleted test case. Back in 2004, someone
noticed that `mb_strwidth` didn't comply with Unicode 4.0. A test case was
added to expose the problem. Well, time keeps moving on, and with the changing
years, new Unicodes are born and old Unicodes die. Some characters which were
counted as double-width in Unicode 4.0 are no longer such in Unicode 13.0,
which renders the test case obsolete.
At the same time, make a couple of spelling/grammar fixes in ucgendat.php.
This flag indicated that an encoding was 'multi-byte'; it can use a variable
number of bytes to encode each character. As it turns out, we don't actually
need to check this flag anywhere, so it's better to remove it.
MicroSoft invented three encodings very similar to ISO-2022-JP/JIS7/JIS8, called
CP50220, CP50221, and CP50222. All three are supported by mbstring.
Since these encodings are very similar, some code can be shared. Actually,
conversion of CP50220/1/2 to Unicode is exactly the same operation; it's when
converting from Unicode to CP50220/1/2 that some small differences arise in how
certain katakana are handled.
The most important common code was a function called `mbfl_filt_wchar_jis_ms`.
The `jis_ms` part doubtless refers to the fact that these encodings are modified
versions of 'JIS' invented by 'MS'. mbstring also went a step further and exported
'JIS-ms' to userland as a separate encoding from CP50220/1/2. If users requested
'JIS-ms' conversion, they got something like CP50220/1/2, minus their special
ways of handling half-width katakana when converting from Unicode.
But... that 'encoding' is not something which actually exists in the world outside
of mbstring. CP50220/1/2 do exist in MicroSoft software, but not 'JIS-ms'.
For a text encoding conversion library, inventing new variant encodings and
implementing them is not very productive. Our interest is in handling text
encodings which real people actually use for... you know, storing actual text
and things like that.
CP50220 is a variant of ISO-2022-JP invented by MicroSoft, which handles some
Unicode characters which are not representable in ISO-2022-JP by converting
them to similar characters which are representable.
What, then, is CP50220-raw? An Internet search turns up absolutely nothing.
Reference works which I consulted don't say anything about it. Other text
conversion libraries don't support it.
From looking at the code: It's just the same as CP50220, but it accepts
unmapped JIS X 0208 characters passed through from other Japanese encodings
and silently encodes them using the usual ISO-2022-JP escape sequence and
representation for JIS X 0208 characters.
It's hard to see how this could be useful. OK, let me come out and say it:
it's _not_ useful. We can confidently jettison this (mis)feature.
We're starting to see a mix between uses of zend_bool and bool.
Replace all usages with the standard bool type everywhere.
Of course, zend_bool is retained as an alias.
debug_zval_dump() currently prints refcount 1 for interned strings
and arrays, which does not really reflect the truth. These values
are not refcounted, so the refcount is misleading. Instead print
an "interned" tag.
Closes GH-6598.