1
0
mirror of https://github.com/php/php-src.git synced 2026-04-02 21:52:36 +02:00
Commit Graph

13 Commits

Author SHA1 Message Date
Alex Dowad
1cf12c02f0 Add test suite for SJIS-mac encoding 2020-11-11 11:18:58 +02:00
Alex Dowad
d40f9cf735 Add test suite for SJIS-2004 encoding 2020-11-11 11:18:58 +02:00
Alex Dowad
8f6889b20d Fix mbstring support for EUC-JP text encoding
- Don't allow control characters to appear in the middle of a multi-byte
  character. (A strange feature, or perhaps misfeature, of mbstring which is
  not present in other libraries such as iconv.)
- When checking whether string is valid, reject kuten codes which do not
  map to any character, whether converting from EUC-JP to another encoding,
  or converting another encoding which uses JIS X 0208/0212 charsets to
  EUC-JP.
- Truncated multi-byte characters are treated as an error.
2020-11-09 13:45:17 +02:00
Alex Dowad
ad7e0f16cc Fix mbstring support for Shift-JIS
- Reject otherwise valid kuten codes which don't map to anything in JIS X 0208.
- Handle truncated multi-byte characters as an error.
- Convert Shift-JIS 0x7E to Unicode 0x203E (overline) as recommended by the
  Unicode Consortium, and as iconv does.
- Convert Shift-JIS 0x5C to Unicode 0xA5 (yen sign) as recommended by the
  Unicode Consortium, and as iconv does.
  (NOTE: This will affect PHP scripts which use an internal encoding of
  Shift-JIS! PHP assigns a special meaning to 0x5C, the backslash. For example,
  it is used for escapes in double-quoted strings. Mapping the Shift-JIS yen
  sign to the Unicode yen sign means the yen sign will not be usable for
  C escapes in double-quoted strings. Japanese PHP programmers who want to
  write their source code in Shift-JIS for some strange reason will have to
  use the JIS X 0208 backlash or 'REVERSE SOLIDUS' character for their C
  escapes.)
- Convert Unicode 0x5C (backslash) to Shift-JIS 0x815F (reverse solidus).
- Immediately handle error if first Shift-JIS byte is over 0xEF, rather than
  waiting to see the next byte. (Previously, the value used was 0xFC, which is
  the limit for the 2nd byte and not the 1st byte of a multi-byte character.)
- Don't allow 'control characters' to appear in the middle of a multi-byte
  character.

The test case for bug 47399 is now obsolete. That test assumed that a number
of Shift-JIS byte sequences which don't map to any character were 'valid'
(because the byte values were within the legal ranges).
2020-11-09 13:45:16 +02:00
Alex Dowad
ff953f254c Add test suite for ARMSCII-8 encoding 2020-11-02 21:31:06 +02:00
Alex Dowad
335c1b98c2 Add test suite for KOI8-U encoding 2020-11-02 21:31:06 +02:00
Alex Dowad
9db4387f14 Add test suite for KOI8-R encoding 2020-11-02 21:31:06 +02:00
Alex Dowad
9980534a4e Add test suite for CP850 encoding 2020-11-02 21:31:06 +02:00
Alex Dowad
0485bed4c7 Add test suite for CP866 encoding 2020-11-02 21:31:06 +02:00
Alex Dowad
0b13305ccc Add test suite for CP1254 encoding 2020-11-02 21:31:05 +02:00
Alex Dowad
eb4151e89e Add test suite for CP1251 encoding 2020-11-02 21:31:05 +02:00
Alex Dowad
831abe2d90 Add test suite for CP1252 encoding
Also remove a bogus test (bug62545.phpt) which wrongly assumed that all invalid
characters in CP1251 and CP1252 should map to Unicode 0xFFFD (REPLACEMENT
CHARACTER).

mbstring has an interface to specify what invalid characters should be
replaced with; it's called `mb_substitute_character`. If a user wants to see
the Unicode 'replacement character', they can specify that using
`mb_substitute_character`. But if they specify something else, we should
follow that.
2020-10-30 22:13:27 +02:00
Alex Dowad
84c180d88b Add test suite for ISO-8859-x encoding verification and conversion 2020-10-16 22:25:48 +02:00