- Truncated multi-byte characters are treated as an error
- Reject GB18030 4-byte codes which translate to (non-existent)
Unicode codepoints above 0x10FFFF
- Add a number of missing mappings from the GB18030 standards
(These mappings are supported by iconv. I don't know why they were
missing from mbstring.)
- Truncated multi-byte characters are treated as an error
- Truncated or unrecognized escape sequences are treated as an error
- ASCII control characters are not allowed to appear in the middle
of a multi-byte character
- Truncated multi-byte characters are treated as an error now
- Invalid multi-byte characters are treated as an error rather than
being quietly swallowed
- ASCII control characters are not allowed to appear in the middle
of a multi-byte character
- Treat text which ends abruptly in the middle of a multi-byte
character as erroneous.
- Don't allow ASCII control characters to appear in the middle of a
multi-byte character.
- If an illegal byte appears in the middle of a multi-byte character,
go back to the initial state rather than trying to finish the
multi-byte character.
- There was a bug in the file with the conversion tables, which set the
'maximum codepoint which can be converted using table A2' using the
size of table A1, not table A2. This meant that several hundred
Unicode codepoints which should have been able to be converted to
EUC-TW were flagged as erroneous instead.
- When a sequence which cannot possibly be a prefix of a valid
multi-byte character is found, immediately flag it as an error, rather
than waiting to read more bytes first.
- Allow characters in CNS-11643 plane 1 to be encoded as 4-byte
sequences (although they can also be encoded as 2-byte sequences).
This is allowed by the standard for EUC-TW text.
- Flag truncated multi-byte characters as erroneous.
- Don't allow ASCII control characters to appear in the middle of a
multi-byte character.
- There was a bug whereby some unrecognized Unicode codepoints would be
passed through unchanged to the output when converting Unicode to
EUC-CN.
- Stick to the original EUC-CN standard, rather than CP936 (an extended
version invented by MS).
- Treat truncated multi-byte characters as an error.
- Don't allow ASCII control characters to appear in the middle of a
multi-byte character.
- There was also a bug whereby some unrecognized Unicode codepoints
would be passed through to the output unchanged when converting
Unicode to EUC-KR.
- Treat truncated multi-byte characters as an error.
- Don't allow ASCII control characters to appear in the middle of a
multi-byte character.
- Adjust some mappings to match recommendations in conversion table
from Unicode Consortium.
- Treat truncated multi-byte characters as an error.
- Don't allow ASCII control characters to appear in the middle of a
multi-byte character.
- Handle ~ escapes according to the HZ standard (RFC 1843).
- Treat unrecognized ~ escapes as an error.
- Multi-byte characters (between ~{ ~} escapes) are GB2312, not CP936.
(CP936 is an extended version from MicroSoft, but the RFC does not
state that this extended version of GB should be used.)
Previously, mbstring would accept a lot of things which were not valid
UHC text. No more.
- Don't allow single-byte control characters to appear where the 2nd
byte of a multi-byte character should be.
- Validate that the 2nd byte of a multi-byte character is in the
expected range.
- Treat it as an error if a multi-byte character is truncated.
Also add a test suite to confirm that UHC conversion (both to and from
Unicode) works according to spec.
Previously, in ISO-2022-JP/JIS7/JIS8, if an escape sequence (starting with 0x1B)
appeared where the 2nd byte of a multibyte character should have been, mbstring
would forget all about the truncated multibyte character and happily accept the
escape sequence. However, such sequences are not legal and should be flagged as
errors.
Also, any other illegal bytes appearing where the 2nd byte of a multibyte
character was expected were just passed through quietly to the output. Fix that.
Also add a test suite for both ISO-2022-JP and JIS7/JIS8. (These are extremely
similar encodings; JIS7 and JIS8 are variants of ISO-2022-JP. mbstring's 'JIS'
is actually a combination of JIS7 _and_ JIS8, since the extensions which each
one adds to ISO-2022-JP are disjoint.)
Lots of problems here.
- Don't pass 'control' characters through silently in the middle of a
multi-byte character.
- Treat it as an error if a multi-byte character is truncated.
- For ESC sequences used to encode emoji on earlier Softbank phones, if an
invalid ESC sequence is found, don't pass it through. Rather, handle it as
an error and respect `mb_substitute_character`.
- In ranges used by mobile vendors for emoji, if a certain byte sequence
doesn't map to any emoji, don't emit a mangled value (actually a raw
(ku*94)+ten value, which may not even be a valid Unicode codepoint at all).
- When converting Unicode to SJIS-Mobile, don't mangle codepoints which fall
in the 2nd range of MicroSoft vendor extensions.
Some vendor-specific emoji have been mapped to standard Unicode codepoints
now, rather than 'private use area' codepoints. When the legacy code was
written, these codepoints may not have existed yet in the Unicode standard
which was current at that time.
Also do a major code cleanup -- remove dead code, rearrange what is left,
use some new macros and helper functions to make the code clearer...
- Don't allow control characters to appear in the middle of a multi-byte
character. (This was a strange feature of mbstring; it doesn't make much
sense, and iconv doesn't allow it.)
- Treat truncated multi-byte characters as an error.
- Don't allow control characters to appear in the middle of a multi-byte
character. (A strange feature, or perhaps misfeature, of mbstring which is
not present in other libraries such as iconv.)
- When checking whether string is valid, reject kuten codes which do not
map to any character, whether converting from EUC-JP to another encoding,
or converting another encoding which uses JIS X 0208/0212 charsets to
EUC-JP.
- Truncated multi-byte characters are treated as an error.
- Reject otherwise valid kuten codes which don't map to anything in JIS X 0208.
- Handle truncated multi-byte characters as an error.
- Convert Shift-JIS 0x7E to Unicode 0x203E (overline) as recommended by the
Unicode Consortium, and as iconv does.
- Convert Shift-JIS 0x5C to Unicode 0xA5 (yen sign) as recommended by the
Unicode Consortium, and as iconv does.
(NOTE: This will affect PHP scripts which use an internal encoding of
Shift-JIS! PHP assigns a special meaning to 0x5C, the backslash. For example,
it is used for escapes in double-quoted strings. Mapping the Shift-JIS yen
sign to the Unicode yen sign means the yen sign will not be usable for
C escapes in double-quoted strings. Japanese PHP programmers who want to
write their source code in Shift-JIS for some strange reason will have to
use the JIS X 0208 backlash or 'REVERSE SOLIDUS' character for their C
escapes.)
- Convert Unicode 0x5C (backslash) to Shift-JIS 0x815F (reverse solidus).
- Immediately handle error if first Shift-JIS byte is over 0xEF, rather than
waiting to see the next byte. (Previously, the value used was 0xFC, which is
the limit for the 2nd byte and not the 1st byte of a multi-byte character.)
- Don't allow 'control characters' to appear in the middle of a multi-byte
character.
The test case for bug 47399 is now obsolete. That test assumed that a number
of Shift-JIS byte sequences which don't map to any character were 'valid'
(because the byte values were within the legal ranges).
Also remove a bogus test (bug62545.phpt) which wrongly assumed that all invalid
characters in CP1251 and CP1252 should map to Unicode 0xFFFD (REPLACEMENT
CHARACTER).
mbstring has an interface to specify what invalid characters should be
replaced with; it's called `mb_substitute_character`. If a user wants to see
the Unicode 'replacement character', they can specify that using
`mb_substitute_character`. But if they specify something else, we should
follow that.