1
0
mirror of https://github.com/php/php-src.git synced 2026-04-25 00:48:25 +02:00
Commit Graph

2134 Commits

Author SHA1 Message Date
Alex Dowad 9e1447dbf3 Rename KANA2HIRA and HIRA2KANA constants (for mb_convert_kana)
mb_convert_kana is able to convert fullwidth katakana to fullwidth
hiragana (and vice versa). The constants referring to these modes had
names like MBFL_FILT_TL_ZEN2HAN_KANA2HIRA.

The "ZEN2HAN" part of the name is misleading, since these modes do not
convert fullwidth (zenkaku) kana to halfwidth (hankaku). The converted
characters are fullwidth both before and after the conversion. So...
let's name the constants accordingly.
2021-09-06 13:16:23 +02:00
Alex Dowad c8e65c9d74 Remove COMPAT2 conversion modes for mb_convert_kana
mb_convert_kana has conversion modes selected using 'M'/'m', which
convert a few various punctuation and symbol characters between
'ordinary' and full-width forms. The constants which refer to these
modes have names ending with COMPAT1.

Internally, there are similar conversion modes with names ending in
COMPAT2. They are like COMPAT1 modes, but they operate on a smaller
set of characters. But... that is all just dead code, because there is
no way for user code to select the COMPAT2 modes.

I have no idea what the original author intended those COMPAT2 modes to
actually be used for. Guess it doesn't really matter, anyways. At this
point, it's just more food for the flames.
2021-09-06 13:16:23 +02:00
Alex Dowad d7eb442993 Add more tests for ISO-2022-JP-2004 text conversion 2021-09-06 13:16:23 +02:00
Alex Dowad 907d0c3248 Add more tests for UTF7-IMAP text conversion 2021-09-06 13:16:23 +02:00
Alex Dowad bf940a13ff Add another test for SJIS-Mobile text conversion 2021-09-06 13:16:23 +02:00
Alex Dowad 32df61c558 Add more tests for UTF-7 text conversion 2021-09-06 13:16:23 +02:00
Alex Dowad ae71bfdee7 Add more tests for UCS-4 text conversion 2021-09-06 13:16:23 +02:00
Alex Dowad fd0e0c7390 Add another test for UCS-2 text conversion 2021-09-06 13:16:23 +02:00
Alex Dowad edf2bd95d9 Add more tests for ISO-2022-JP and JIS7/8 text conversion 2021-09-06 13:16:23 +02:00
Alex Dowad 6a2dca3420 Add more tests for ISO-2022-JP-KDDI text conversion 2021-09-06 13:16:23 +02:00
Alex Dowad d2f5a8b328 Add more tests for SJIS-mac text conversion 2021-09-06 13:16:23 +02:00
Alex Dowad 0957f54eb1 Treat truncated escape sequences for CP5022{0,1,2} as error 2021-09-06 13:16:23 +02:00
Alex Dowad 64e379d81e Declare CP50222 flush function as 'static' 2021-09-06 13:16:23 +02:00
Alex Dowad a312620607 Remove redundant NULL checks in mbstring
Whoever originally wrote mbstring seems to have a deathly fear of NULL
pointers lurking behind every corner. A common pattern is that one
function will check if a pointer is NULL, then pass it to another
function, which will again check if it is NULL, then pass to yet another
function, which will yet again check if it is NULL... it's NULL checks
all the way down.

Remove all the NULL checks in places where pointers could not possibly
be NULL.
2021-09-06 13:16:23 +02:00
Alex Dowad 626f0fec54 Remove some dead code from mbstring
mbstring has a great deal of dead code. Some common types are:

- Default switch clauses which will never be taken
- If clauses intended to convert codepoints which were not present in
  a conversion table... but the codepoint in question *is* in the table,
  so the if clause is not needed.
- Bounds checks in places where it is not possible for a value to ever
  be out of bounds.
- Checks to see if an unmatched Unicode codepoint is in CP932 extension
  range 3... but every codepoint in range 3 is also in range 2, so no
  codepoint will ever be matched and converted by that code.
2021-09-06 13:16:23 +02:00
Alex Dowad df32267494 Add more tests for UTF7-IMAP text conversion 2021-08-31 13:41:34 +02:00
Alex Dowad 16a1e0a219 In UTF7-IMAP, reject the 2nd part of surrogate pair if it appears unexpectedly 2021-08-31 13:41:34 +02:00
Alex Dowad 355464935d Add another test for UTF-7 text conversion 2021-08-31 13:41:34 +02:00
Alex Dowad 51b6c687db Add another test for GB18030 text conversion 2021-08-31 13:41:34 +02:00
Alex Dowad a0415b22ab Add more tests for CP5022{0,1,2} text conversion 2021-08-31 13:41:34 +02:00
Alex Dowad e3f6a9fbfe CP5022{0,1,2} supports 'IBM extension' codes from ku 115-119
mbstring has always had the conversion tables to support CP932 codes
in ku 115-119, and the conversion code for CP5022x has an 'if' clause
specifically to handle such characters... but that 'if' clause was dead
code, since a guard clause earlier in the same function prevented it
from accepting 2-byte characters with a starting byte of 0x93-0x97.

Adjust the guard clause so that these characters can be converted as
the original author apparently intended.

The code which handles ku 115-119 is the part which reads:

    } else if (s >= cp932ext3_ucs_table_min && s < cp932ext3_ucs_table_max) {
      w = cp932ext3_ucs_table[s - cp932ext3_ucs_table_min];
2021-08-31 13:41:34 +02:00
Alex Dowad 671dcee01e Add test for mb_str_split on UCS-2 text 2021-08-31 13:41:34 +02:00
Alex Dowad f303fc8a9b Use bool in mbfl_filt_conv_output_hex (rather than int) 2021-08-31 13:41:34 +02:00
Alex Dowad 776296e12f mbstring no longer provides 'long' substitutions for erroneous input bytes
Previously, mbstring had a special mode whereby it would convert
erroneous input byte sequences to output like "BAD+XXXX", where "XXXX"
would be the erroneous bytes expressed in hexadecimal. This mode could
be enabled by calling `mb_substitute_character("long")`.

However, accurately reproducing input byte sequences from the cached
state of a conversion filter is often tricky, and this significantly
complicates the implementation. Further, the means used for passing
the erroneous bytes through to where the "BAD+XXXX" text is generated
only allows for up to 3 bytes to be passed, meaning that some erroneous
byte sequences are truncated anyways.

More to the point, a search of publically available PHP code indicates
that nobody is really using this feature anyways.

Incidentally, this feature also provided error output like "JIS+XXXX"
if the input 'should have' represented a JISX 0208 codepoint, but it
decodes to a codepoint which does not exist in the JISX 0208 charset.
Similarly, specific error output was provided for non-existent
JISX 0212 codepoints, and likewise for JISX 0213, CP932, and a few
other charsets. All of that is now consigned to the flames.

However, "long" error markers also include a somewhat more useful
"U+XXXX" marker for Unicode codepoints which were successfully
decoded from the input text, but cannot be represented in the output
encoding. Those are still supported.

With this change, there is no need to use a variety of special values
in the high bits of a wchar to represent different types of error
values. We can (and will) just use a single error value. This will be
equal to -1.

One complicating factor: Text conversion functions return an integer to
indicate whether the conversion operation should be immediately
aborted, and the magic 'abort' marker is -1. Also, almost all of these
functions would return the received byte/codepoint to indicate success.
That doesn't work with the new error value; if an input filter detects
an error and passes -1 to the output filter, and the output filter
returns it back, that would be taken to mean 'abort'.

Therefore, amend all these functions to return 0 for success.
2021-08-31 13:41:34 +02:00
Alex Dowad 15ba73cee3 Add more tests for UTF-8 text conversion 2021-08-30 16:29:58 +02:00
Alex Dowad 51a32ccaf4 Add another test for UTF-16LE 2021-08-30 16:29:58 +02:00
Alex Dowad 7472c82c45 Add tests for UCS-4 text conversion 2021-08-30 16:29:58 +02:00
Alex Dowad 79015b23aa Add tests for UCS-2 text encoding 2021-08-30 16:29:58 +02:00
Alex Dowad 34ef8f3ca2 Add tests for '7bit' and '8bit' text encodings in mbstring 2021-08-30 16:29:58 +02:00
Alex Dowad 97f8495e0f UCS-4 conversion does not pass BOM through to output
This is to match the way that we handle UCS-2. When a BOM is found at
the beginning of a 'UCS-2' string (NOT 'UCS-2BE' or 'UCS-2LE'), we take
note of the intended byte order and handle the string accordingly, but
do NOT emit a BOM to the output. Rather, we just use the default byte
order for the requested output encoding.

Some might argue that if the input string used a BOM, and we are
emitting output in a text encoding where both big-endian and
little-endian byte orders are possible, we should include a BOM in the
output string. To such hypothetical debaters of minutiae, I can only
offer you a shoulder shrug. No reasonable program which handles UCS-2
and UCS-4 text should require a BOM.

Really, the concept of the BOM is a poor idea and should not have been
included in Unicode. Standardizing on a single byte order would have
been much better, similar to 'network byte order' for the Internet
Protocol. But this is not the place to speak at length of such things.
2021-08-30 16:29:58 +02:00
Alex Dowad e6f1a72235 Add test suite for mobile variants of UTF-8 (and fix bugs) 2021-08-30 16:29:58 +02:00
Alex Dowad 1865576694 Add test suite for EUC-JP-WIN (or EUC-JP-MS) text encoding (and fix bugs) 2021-08-30 16:29:58 +02:00
Alex Dowad 6a693d2d33 Remove useless variable: mbfl_encoding_utf8_kddi_a_aliases 2021-08-30 16:29:58 +02:00
Alex Dowad d4561894ea Extraneous trailing UCS-4 bytes are treated as error 2021-08-30 16:29:58 +02:00
Alex Dowad 0de4d6872e Add more tests for SJIS-2004 text conversion 2021-08-30 16:29:58 +02:00
Alex Dowad c7d47cbb4c Add more tests for SJIS text conversion 2021-08-30 16:29:58 +02:00
Alex Dowad 299690a1cf Add more tests for ISO-2022-JP/JIS7/JIS8 text conversion 2021-08-30 16:29:58 +02:00
Alex Dowad b2be85d11a Add more tests for ISO-2022-JP-MS text conversion 2021-08-30 16:29:58 +02:00
Alex Dowad ae4c956089 Add more tests for ISO-2022-JP-KDDI text conversion 2021-08-30 16:29:58 +02:00
Alex Dowad 51e0d323e4 ISO-2022-JP-MS treats truncated multi-byte chars as error
Sigh. I included tests which were intended to check this case in the
test suite for ISO-2022-JP-MS, but those tests were faulty and didn't
actually test what they were supposed to.

Fixing the tests revealed that there were still bugs in this area.
2021-08-30 16:29:58 +02:00
Alex Dowad 57a81af041 ISO-2022-JP-KDDI text conversion doesn't swallow PUA codepoints
There was a bit of legacy code here which looks like the original author
of mbstring intended to allow conversion of Unicode Private Use Area
codepoints to ISO-2022-JP-KDDI. However, that code never worked.
It set the output variable to values which were not matched by any
of the 'if' clauses below, which meant that nothing was actually
emitted to the output. In other words, if one tried to convert Unicode
to ISO-2022-JP-KDDI, and the Unicode string contained PUA codepoints,
they would be quietly 'swallowed' and disappear.

I don't know what ISO-2022-JP-KDDI byte sequences the author wanted
to map those PUA codepoints to, and anyways, this use case is so obscure
that there is little point in worrying about it. However, it is better
to remove the non-functioning code than to leave it in.

This means that if now one tries to convert PUA codepoints to
ISO-2022-JP-KDDI, those codepoints will be treated as erroneous rather
than silently ignored.
2021-08-30 16:29:58 +02:00
Alex Dowad 51b9d7a5e1 Test behavior of 'long' illegal character markers
After mb_substitute_character("long"), mbstring will respond to
erroneous input by inserting 'long' error markers into the output.
Depending on the situation, these error markers will either look like
BAD+XXXX (for general bad input), U+XXXX (when the input is OK, but it
converts to Unicode codepoints which cannot be represented in the
output encoding), or an encoding-specific marker like JISX+XXXX or
W932+XXXX.

We have almost no tests for this feature. Add a bunch of tests to
ensure that all our legacy encoding handlers work in a reasonable
way when 'long' error markers are enabled.
2021-08-30 16:29:58 +02:00
Alex Dowad f6f0506c84 Correct comment in mbfilter_ucs4.c 2021-08-30 16:29:58 +02:00
Alex Dowad 03392ecd50 Simplify code for converting UHC to Unicode 2021-08-30 16:29:58 +02:00
Alex Dowad 9363b0b5a7 Declare ARMSCII-8 conversion functions as 'static' 2021-08-30 16:29:58 +02:00
Alex Dowad 97b7fc893c Output illegal character marker for 4-byte illegal characters > 0x7FFFFFFF
Some text encodings supported by mbstring (such as UCS-4) accept 4-byte
characters. When mbstring encounters an illegal byte sequence for the
encoding it is using, it should emit an 'illegal character' marker,
which can either be a single character like '?', an HTML hexadecimal
entity, or a marker string like 'BAD+XXXX'.

Because of the use of signed integers to hold 4-byte characters,
illegal 4-byte sequences with a 'negative' value (one with the high
bit set) were not handled correctly when emitting the illegal char
marker. The result is that such illegal sequences were just skipped
over (and the marker was not emitted to the output). Fix that.
2021-08-30 16:29:58 +02:00
Máté Kocsis 8e6e9838b0 Add support for generating MAY_BE_ARRAY_OF_REF func info flag (#7416) 2021-08-30 13:50:34 +02:00
Nikita Popov 634f2e21d3 Don't expose wchar encoding to users (#7415)
The "wchar" encoding isn't really an encoding -- it's what we
internally use as the representation of decoded characters.

In practice, it tends to behave a lot like the 8bit encoding when
used from userland, because input code units end up being treated
as code points.

This patch removes the wchar encoding from the public encoding
list and reserves it for internal use only.
2021-08-30 11:11:33 +02:00
Nikita Popov 43cb2548f7 Flush filter during non-strict encoding detection
If we reach the end of the string without reducing to a single
encoding, then we should flush to check whether the last character
is incomplete.
2021-08-27 14:48:32 +02:00
Máté Kocsis fdc6082902 Generate optimizer func info from stubs for various extensions (#7409)
ext/hash, ext/iconv, ext/mbstring, ext/xml, ext/zlib
2021-08-26 19:52:11 +02:00