1
0
mirror of https://github.com/php/php-src.git synced 2026-04-29 03:03:26 +02:00
Commit Graph

336 Commits

Author SHA1 Message Date
Alex Dowad 3e7acf901d Remove mbstring identify filters
mbstring had an 'identify filter' for almost every supported text encoding
which was used when auto-detecting the most likely encoding for a string.
It would run over the string and set a 'flag' if it saw anything which
did not appear likely to be the encoding in question.

One problem with this scheme was that encodings which merely appeared
less likely to be the correct one were completely rejected, even if there
was no better candidate. Another problem was that the 'identify filters'
had a huge amount of code duplication with the 'conversion filters'.

Eliminate the identify filters. Instead, when auto-detecting text
encoding, use conversion filters to see whether the input string is valid
in candidate encodings or not. At the same type, watch the type of
codepoints which the string decodes to and mark it as less likely if
non-printable characters (ESC, form feed, bell, etc.) or 'private use
area' codepoints are seen.

Interestingly, one old test case in which JIS text was misidentified
as UTF-8 (and this wrong behavior was enshrined in the test) was 'fixed'
and the JIS string is now auto-detected as JIS.
2020-11-09 13:45:17 +02:00
Alex Dowad a416f938f3 Treat non-ASCII characters as erroneous when converting ASCII text 2020-11-09 13:45:17 +02:00
Alex Dowad 8f6889b20d Fix mbstring support for EUC-JP text encoding
- Don't allow control characters to appear in the middle of a multi-byte
  character. (A strange feature, or perhaps misfeature, of mbstring which is
  not present in other libraries such as iconv.)
- When checking whether string is valid, reject kuten codes which do not
  map to any character, whether converting from EUC-JP to another encoding,
  or converting another encoding which uses JIS X 0208/0212 charsets to
  EUC-JP.
- Truncated multi-byte characters are treated as an error.
2020-11-09 13:45:17 +02:00
Alex Dowad ad7e0f16cc Fix mbstring support for Shift-JIS
- Reject otherwise valid kuten codes which don't map to anything in JIS X 0208.
- Handle truncated multi-byte characters as an error.
- Convert Shift-JIS 0x7E to Unicode 0x203E (overline) as recommended by the
  Unicode Consortium, and as iconv does.
- Convert Shift-JIS 0x5C to Unicode 0xA5 (yen sign) as recommended by the
  Unicode Consortium, and as iconv does.
  (NOTE: This will affect PHP scripts which use an internal encoding of
  Shift-JIS! PHP assigns a special meaning to 0x5C, the backslash. For example,
  it is used for escapes in double-quoted strings. Mapping the Shift-JIS yen
  sign to the Unicode yen sign means the yen sign will not be usable for
  C escapes in double-quoted strings. Japanese PHP programmers who want to
  write their source code in Shift-JIS for some strange reason will have to
  use the JIS X 0208 backlash or 'REVERSE SOLIDUS' character for their C
  escapes.)
- Convert Unicode 0x5C (backslash) to Shift-JIS 0x815F (reverse solidus).
- Immediately handle error if first Shift-JIS byte is over 0xEF, rather than
  waiting to see the next byte. (Previously, the value used was 0xFC, which is
  the limit for the 2nd byte and not the 1st byte of a multi-byte character.)
- Don't allow 'control characters' to appear in the middle of a multi-byte
  character.

The test case for bug 47399 is now obsolete. That test assumed that a number
of Shift-JIS byte sequences which don't map to any character were 'valid'
(because the byte values were within the legal ranges).
2020-11-09 13:45:16 +02:00
Alex Dowad cc03c54c36 Remove useless byte{2,4}{be,le} encodings from mbstring
There is no meaningful difference between these and UCS-{2,4}. They are
just a little bit more lax about passing errors silently. They also have
no known use.

Alias to UCS-{2,4} in case someone, somewhere is using them.
2020-11-09 13:45:16 +02:00
Alex Dowad 9f5a4b3bd9 Fix mbstring support for ARMSCII-8
- Identify filter was completely wrong.
- Respect `mb_substitute_character` rather than converting invalid bytes to
  Unicode 0xFFFD (generic replacement character).
- Don't convert Unicode 0xFFFD to a valid ARMSCII-8 character.
- When converting ARMSCII-8 to ARMSCII-8, don't pass invalid bytes through
  silently.
2020-11-02 21:31:06 +02:00
Alex Dowad e81458862b Remove dead code from mbfilter_koi8u.c (and do general code cleanup) 2020-11-02 21:31:06 +02:00
Alex Dowad f9826fba46 All bytes are valid in KOI8-U encoding 2020-11-02 21:31:06 +02:00
Alex Dowad fde7794556 Remove dead code from mbfilter_iso8859_{2,4,5,9,10,13,14,15,16}.c
...Plus some dead code related to ISO-8859-1.
2020-11-02 21:31:06 +02:00
Alex Dowad 0a8ebb36a5 Remove dead code from mbfilter_koi8r.c 2020-11-02 21:31:06 +02:00
Alex Dowad 7b97789ec0 All bytes are valid in KOI8-R encoding 2020-11-02 21:31:06 +02:00
Alex Dowad b6e75265d0 Remove dead code from mbfilter_cp850.c (and do general code cleanup)
Since there are no invalid bytes in CP850, these `if` conditions will never
be true.
2020-11-02 21:31:06 +02:00
Alex Dowad 8926252ee8 All bytes are valid in CP850 encoding 2020-11-02 21:31:06 +02:00
Alex Dowad 20a404f765 Remove dead code from mbfilter_cp866.c (and do general code cleanup)
Since there are no invalid bytes in CP866, these `if` conditions will never
be true.
2020-11-02 21:31:06 +02:00
Alex Dowad bc04e0cc6d All bytes are valid in CP866 encoding 2020-11-02 21:31:05 +02:00
Alex Dowad e6d17cfe44 Fix mbstring support for CP1254 encoding
One funny thing: while the original author used Unicode 0xFFFD (generic
replacement character) for invalid bytes in CP1251 and CP1252, for CP1254
they used 0xFFFE, which is not a valid Unicode codepoint at all, but is a
reversed byte-order mark. Probably this was by mistake.

Anyways,

- Fixed identify filter, which was completely wrong.
- Don't convert Unicode 0xFFFE to a random (but valid) CP1254 byte.
- When converting CP1254 to CP1254, don't pass invalid bytes through silently.
2020-11-02 21:31:05 +02:00
Alex Dowad 44bd5804b0 Fix mbstring support for CP1251 encoding
- Identify filter was as wrong as wrong can be.
- Invalid CP1251 byte 0x98 was converted to Unicode 0xFFFD (generic
  replacement character), rather than respecting `mb_substitute_character`.
- Unicode 0xFFFD was converted to some random CP1251 byte.
- When converting CP1251 to CP1251, don't pass invalid bytes through silently.
2020-11-02 21:31:05 +02:00
Alex Dowad b5ff87ca71 Fix mbstring support for CP1252 encoding
It's a bit surprising how much was broken here.

- Identify filter was utterly and completely wrong.
- Instead of handling invalid CP1252 bytes as specified by
  `mb_substitute_character`, it would convert them to Unicode 0xFFFD
  (generic replacement character).
- When converting ISO-8859-1 to CP1252, invalid ISO-8859-1 bytes would
  be passed through silently.
- Unicode codepoints from 0x80-0x9F were converted to CP1252 bytes 0x80-0x9F,
  which is wrong.
- Unicode codepoint 0xFFFD was converted to CP1252 0x9F, which is very wrong.

Also clean up some unneeded code, and make the conversion table consistent with
others by using zero as a 'invalid' marker, rather than 0xFFFD.
2020-10-30 22:13:27 +02:00
Alex Dowad e26234a044 UTF-32 conversion treats truncated characters as illegal 2020-10-27 10:19:01 +02:00
Alex Dowad 7047e5d2c4 Add identify filter for UTF-32{,BE,LE} 2020-10-27 10:19:01 +02:00
Alex Dowad d8895cd054 Improve error handling for UTF-16{,BE,LE}
Catch various errors such as the first part of a surrogate pair not being
followed by a proper second part, the first part of a surrogate pair appearing
at the end of a string, the second part of a surrogate pair appearing out
of place, and so on.
2020-10-27 10:19:01 +02:00
Alex Dowad d9ddeb6e85 UTF-16 text conversion handles truncated characters as illegal
This broke one old test (Zend/tests/multibyte_encoding_003.phpt), which used
a PHP script encoded as UTF-16. The problem was that to terminate the test
script, we need the text: "\n--EXPECT--". Out of that text, the terminating
newline (0x0A byte) becomes part of the resulting test script; but a bare
0x0A byte with no 0x00 is not valid UTF-16.

Since we now treat truncated UTF-16 characters as erroneous, an extra '?' is
appended to the output as an 'illegal character' marker.

Really, if we are running PHP scripts which are treated as encoded in UTF-16
or some other arbitrary text encoding (not ASCII), and the script is not
actually a valid string in that encoding, inserting '?' characters into the
code which the PHP interpreter runs is a bad thing to do. In such cases, the
script shouldn't be treated as UTF-16 (or whatever) at all.

I wonder if mbstring's encoding detection is being used in 'non-strict' mode?
2020-10-27 10:19:00 +02:00
Alex Dowad bc18e32690 Do not pass invalid ISO-8859-{3,6,7,8} characters through silently
mbstring has a bad habit of passing invalid characters through silently
when converting to the same (or a "compatible") encoding.

For example, if you give it an invalid JIS X 0208 kuten code encoded with SJIS,
and try to convert that to EUC-JP, mbstring will just quietly re-encode the
invalid code in the EUC-JP representation.

At the same, some parts of the code (like `mb_check_encoding`) assume that
invalid characters will be treated as... well, invalid. Let's unbreak things
by actually catching errors and reporting them, instead of swallowing them.
2020-10-16 22:17:45 +02:00
Alex Dowad 5c6b2a7ad2 Add identify filter for ISO-8859-8 (Latin/Hebrew) 2020-10-16 22:17:45 +02:00
Alex Dowad ea687018cd Add identify filter for ISO-8859-7 (Latin/Greek) 2020-10-16 22:17:45 +02:00
Alex Dowad a6603b60f7 Add identify filter for ISO-8859-6 (Latin/Arabic)
Note that some text encoding conversion libraries, such as Solaris iconv
and FreeBSD iconv, map 0x30-0x39 to the Arabic script numerals rather than
the 'regular' Roman numerals. (That is, to Unicode codepoints 0x660-0x669.)

Further, Windows CP28596 adds more mappings to use the unused bytes in
ISO-8859-6.
2020-10-16 22:17:45 +02:00
Alex Dowad 23270d7f9e Add identify filter for ISO-8859-3 (Latin-3)
There are some bytes in this encoding which are not mapped to any character.
Notably, MicroSoft added their own mappings for these 'unused' bits in their
version of Latin-3, called CP28593.
2020-10-16 22:17:45 +02:00
Alex Dowad 7b9bed0150 Add identify filter for ISO-8859-16 (Latin-10) encoding
Interestingly, it looks like the original author intended to add an identify filter
for this encoding, but never did so. The needed struct is there, but was never added
to the list of identify filters in mbfl_ident.c.
2020-10-16 20:56:45 +02:00
Alex Dowad 7bb5b435af mUTF-7 (UTF7-IMAP) conversion: handle illegal (non-RFC-compliant) input correctly
Instead of looking the other way and letting things slide, report errors when
the input does not follow the RFC.
2020-10-13 20:26:14 +02:00
Alex Dowad b43a7deacf Add 'mUTF-7' alias for UTF7-IMAP encoding 2020-10-13 20:26:14 +02:00
Alex Dowad b975817265 Add comment explaining mUTF-7 to mbfilter_utf7imap.c 2020-10-13 20:26:14 +02:00
Alex Dowad 648c1cb51e Add identify filter for UCS-2, UCS-2BE, and UCS-2LE encodings 2020-10-13 20:26:14 +02:00
Alex Dowad 374f31e364 Add mbstring identify filter for 'binary' encoding 2020-10-13 20:26:13 +02:00
Alex Dowad 97beecc251 Add identify filter for UTF-16, UTF-16LE, UTF-16BE
There was one faulty test in the suite which only passed before because UTF-16 had no
identify filter. After this was fixed, it exposed the problem with the test.
2020-10-13 20:26:13 +02:00
Alex Dowad a98838e3b6 Handle illegal bytes properly when converting to '7bit' encoding
Previously, mbstring would silently drop illegal bytes when converting a
string to '7bit' encoding.
2020-10-13 06:12:38 +02:00
Alex Dowad 4aa7430f68 Add mbstring identify filter for '7bit' encoding 2020-10-13 06:12:38 +02:00
Alex Dowad 0ffc1f55b3 Refactor mbfl_ident.c, mbfl_encoding.c, mbfl_memory_device.c, mbfl_string.c
- Make everything less gratuitously verbose
- Don't litter the code with lots of unneeded NULL checks (for things which
  will never be NULL)
- Don't return success/failure code from functions which can never fail
- For encoding structs, don't use pointers to pointers to pointers for the
  list of alias strings. Pointers to pointers (2 levels of indirection)
  is what actually makes sense. This gets rid of some extraneous
  dereference operations.
2020-10-13 06:12:38 +02:00
Alex Dowad e8b8ecbd4e Remove useless constants MBFL_CHP_{CTL,DIGIT,UALPHA,LALPHA,MSPECIAL} 2020-10-13 06:12:37 +02:00
Alex Dowad aabbee2318 Remove useless validity check when converting UTF-16LE -> wchar
The check ensures that the decoded codepoint is between 0x10000-0x10FFFF,
which is the valid range which can be encoded in a UTF-16 surrogate pair.
However, just looking at the code, it's obvious that this will be true.
First of all, 0x10000 is added to the decoded codepoint on the previous
line, so how could it be less than 0x10000?

Further, even if the 20 data bits already decoded were 0xFFFFF (all ones),
when you add 0x10000, it comes to 0x10FFFF, which is the very top of the
valid range. So how could the decoded codepoint be more than 0x10FFFF?
It can't.
2020-10-13 06:12:37 +02:00
Alex Dowad f474e5502c Refactor UTF-16LE -> wchar conversion code 2020-10-13 06:12:37 +02:00
Alex Dowad 3f1851dec2 Avoid compiler warnings related to mbstring flush functions 2020-10-13 06:12:37 +02:00
George Peter Banyard fd1672a7f3 Fix [-Wduplicated-cond] in MBString extension 2020-10-09 20:54:23 +01:00
Remi Collet b1c5532ad1 fix mbfl function prototypes
re-add mbfl_convert_filter_feed API
re-add pointer cast
2020-09-15 15:15:06 +02:00
Alex Dowad a81061d36c Use symbolic constants in Japanese kana conversion code (not magic numbers)
Also correct misspelling of 'hiragana' as 'hirangana' at the same time.
2020-09-03 15:56:29 +02:00
Alex Dowad ec609916dc Remove unused 'from' field from mbfl_buffer_converter struct 2020-09-03 15:56:29 +02:00
Alex Dowad f699d65391 Add comment to mbfilter_tl_jisx0201_jisx0208.h
Explain the 'ZEN' and 'HAN' in symbolic constant names.
2020-09-03 15:56:29 +02:00
Alex Dowad a2b40ee9a5 Remove unneeded function mbfl_filt_ident_common_dtor
This was the default destructor for mbfl_identify_filter structs, but there's nothing
we actually need to do to those structs before freeing them.
2020-09-03 15:56:29 +02:00
Alex Dowad dcd6c6043e Remove unneeded function mbfl_filt_conv_common_dtor
This is a default destructor for mbfl_convert_filter structs. The thing is: there
isn't really anything that needs to be done to those structs before freeing them.
The default destructor just zeroed out some fields, but there's no reason why
we should actually do that.
2020-09-03 15:56:29 +02:00
Alex Dowad 409aa20ab0 Refactor mbfl_convert.c 2020-09-03 15:56:29 +02:00
Alex Dowad cdc664049c Comment constants in mbfl_consts.h, remove unused ones
These were unused, and almost certainly will never be used:

- MBFL_ENCTYPE_MWC4BE
- MBFL_ENCTYPE_MWC4LE
- MBFL_ENCTYPE_SHFTCODE
- MBFL_ENCTYPE_ENC_STRM

For the latter two, there were some encodings which were marked with these flags;
but nothing ever _checked_ these particular flags.
2020-08-31 23:18:56 +02:00