opaque is used by the htmlentities filter, which means that we
end up trying to free the score value as a pointer. Don't try to
be overly tricky here and simply allocate a separate structure
to hold the number of illegal characters and the score.
- Truncated multi-byte characters are treated as an error
- Reject GB18030 4-byte codes which translate to (non-existent)
Unicode codepoints above 0x10FFFF
- Add a number of missing mappings from the GB18030 standards
(These mappings are supported by iconv. I don't know why they were
missing from mbstring.)
- Truncated multi-byte characters are treated as an error
- Truncated or unrecognized escape sequences are treated as an error
- ASCII control characters are not allowed to appear in the middle
of a multi-byte character
- Truncated multi-byte characters are treated as an error now
- Invalid multi-byte characters are treated as an error rather than
being quietly swallowed
- ASCII control characters are not allowed to appear in the middle
of a multi-byte character
- Treat text which ends abruptly in the middle of a multi-byte
character as erroneous.
- Don't allow ASCII control characters to appear in the middle of a
multi-byte character.
- If an illegal byte appears in the middle of a multi-byte character,
go back to the initial state rather than trying to finish the
multi-byte character.
- There was a bug in the file with the conversion tables, which set the
'maximum codepoint which can be converted using table A2' using the
size of table A1, not table A2. This meant that several hundred
Unicode codepoints which should have been able to be converted to
EUC-TW were flagged as erroneous instead.
- When a sequence which cannot possibly be a prefix of a valid
multi-byte character is found, immediately flag it as an error, rather
than waiting to read more bytes first.
- Allow characters in CNS-11643 plane 1 to be encoded as 4-byte
sequences (although they can also be encoded as 2-byte sequences).
This is allowed by the standard for EUC-TW text.
- Flag truncated multi-byte characters as erroneous.
- Don't allow ASCII control characters to appear in the middle of a
multi-byte character.
- There was a bug whereby some unrecognized Unicode codepoints would be
passed through unchanged to the output when converting Unicode to
EUC-CN.
- Stick to the original EUC-CN standard, rather than CP936 (an extended
version invented by MS).
- Treat truncated multi-byte characters as an error.
- Don't allow ASCII control characters to appear in the middle of a
multi-byte character.
- There was also a bug whereby some unrecognized Unicode codepoints
would be passed through to the output unchanged when converting
Unicode to EUC-KR.
- Treat truncated multi-byte characters as an error.
- Don't allow ASCII control characters to appear in the middle of a
multi-byte character.
- Adjust some mappings to match recommendations in conversion table
from Unicode Consortium.
- Treat truncated multi-byte characters as an error.
- Don't allow ASCII control characters to appear in the middle of a
multi-byte character.
- Handle ~ escapes according to the HZ standard (RFC 1843).
- Treat unrecognized ~ escapes as an error.
- Multi-byte characters (between ~{ ~} escapes) are GB2312, not CP936.
(CP936 is an extended version from MicroSoft, but the RFC does not
state that this extended version of GB should be used.)
Previously, mbstring would accept a lot of things which were not valid
UHC text. No more.
- Don't allow single-byte control characters to appear where the 2nd
byte of a multi-byte character should be.
- Validate that the 2nd byte of a multi-byte character is in the
expected range.
- Treat it as an error if a multi-byte character is truncated.
Also add a test suite to confirm that UHC conversion (both to and from
Unicode) works according to spec.
Sigh. Double sigh. After fruitlessly searching the Internet for information on
this mysterious text encoding called "SJIS-open", I wrote a script to try
converting every Unicode codepoint from 0-0xFFFF and compare the results from
different variants of Shift-JIS, to see which one "SJIS-open" would be most
similar to.
The result? It's just CP932. There is no difference at all. So why do we have
two implementations of CP932 in mbstring?
In case somebody, somewhere is using "SJIS-open" (or its aliases "SJIS-win" or
"SJIS-ms"), add these as aliases to CP932 so existing code will continue to
work.
Also fix a couple small problems with UTF-32 and UTF-8 support:
- UTF-32 would pass very large codepoints (>= 0x80000000), which are
not valid.
- UTF-8 would sometimes emit two error marker characters for a single
bad input byte.
Updates the deprecation message for implicit incompatible float to int conversion from:
```
Implicit conversion from non-compatible float %.*H to int in %s on line %d
```
to
```
Implicit conversion from float %.*H to int loses precision in %s on line %d
```
Related: #6661
1. Update: http://www.php.net/license/3_01.txt to https, as there is anyway server header "Location:" to https.
2. Update few license 3.0 to 3.01 as 3.0 states "php 5.1.1, 4.1.1, and earlier".
3. In some license comments is "at through the world-wide-web" while most is without "at", so deleted.
4. fixed indentation in some files before |
- Treat it as error if multi-byte string or escape sequence is truncated
- Don't allow 'control' characters or escape sequences to appear in the middle
of a multi-byte char
As with ISO-2022-JP-KDDI, the main reference used to develop the tests was
the behavior of the existing code. It would have been better to have some
independent reference which we could cross-check our code against, but I
couldn't find one.
- Treat it as an error if a multi-byte character or escape sequence is truncated
- When converting other encodings to ISO-2022-JP-KDDI, don't swallow trailing
hash characters or digits
- Don't allow 'control' characters to appear in the middle of a multi-byte char
Note: I was not able to find any kind of official or even semi-official
specification for this legacy encoding. Therefore, the test suite for
ISO-2022-JP-KDDI is based largely on the behavior of the existing code.
Verifying the correctness of program code in this way is very questionable.
In a sense, all you are proving is that the code "does what it does". However,
the test suite will still expose any unintended _changes_ to behavior.
To detect errors in conversion from Unicode to another text encoding, each
mbstring conversion filter object maintains a count of 'bad' characters. After
a conversion operation finishes, this count is checked to see if there was any
error.
The problem with CP50220 was that mbstring used a chain of two conversion filter
objects. The 'bad character count' would be incremented on the second object in
the chain, but this didn't do anything, as only the count on the first such
object is ever checked.
Fix this by implementing the conversion using a single conversion filter object,
rather than a chain of two. This is possible because of the recent refactoring,
which pulled out the needed logic for CP50220 conversion into a helper function.
There's no need to dynamically allocate a struct to hold the 'mode' parameter;
just store it directly in `filt->opaque`. Some other things were also being done
in an unnecessarily roundabout way.
Also, the 'copy' function for CP50220 conversion filters was *both* broken
and unnecessary. Broken, because it malloc'd memory which was never freed by
anything. Unnecessary, because the point of the copy is so that various
algorithms can try running bytes through a conversion filter and see how many
output bytes or characters result, and then back out by restoring the filters
to their previous state. But here's the thing; CP50220 conversion filters don't
hold cached bytes, which is the main thing which would need to be restored to a
previous state.
This function pointer is only called when initializing the struct. After that
nothing is done with it. Therefore, there is no need to keep it in the struct.