1
0
mirror of https://github.com/php/php-src.git synced 2026-04-01 13:12:16 +02:00
Commit Graph

662 Commits

Author SHA1 Message Date
Alex Dowad
caeaa662ab Strict conversion of UHC text to Unicode
Previously, mbstring would accept a lot of things which were not valid
UHC text. No more.

- Don't allow single-byte control characters to appear where the 2nd
  byte of a multi-byte character should be.
- Validate that the 2nd byte of a multi-byte character is in the
  expected range.
- Treat it as an error if a multi-byte character is truncated.

Also add a test suite to confirm that UHC conversion (both to and from
Unicode) works according to spec.
2021-06-17 13:12:40 +02:00
Alex Dowad
9868c17368 Mark CP932 and CP51932 encoding tests as 'slow tests' 2021-06-17 13:12:40 +02:00
Alex Dowad
e2459857af Remove duplicate implementation of CP932 from mbstring
Sigh. Double sigh. After fruitlessly searching the Internet for information on
this mysterious text encoding called "SJIS-open", I wrote a script to try
converting every Unicode codepoint from 0-0xFFFF and compare the results from
different variants of Shift-JIS, to see which one "SJIS-open" would be most
similar to.

The result? It's just CP932. There is no difference at all. So why do we have
two implementations of CP932 in mbstring?

In case somebody, somewhere is using "SJIS-open" (or its aliases "SJIS-win" or
"SJIS-ms"), add these as aliases to CP932 so existing code will continue to
work.
2021-06-17 13:12:40 +02:00
Alex Dowad
7502c86342 Add test suite for UTF-{7,8,16,32}
Also fix a couple small problems with UTF-32 and UTF-8 support:

- UTF-32 would pass very large codepoints (>= 0x80000000), which are
  not valid.
- UTF-8 would sometimes emit two error marker characters for a single
  bad input byte.
2021-06-17 13:12:40 +02:00
Nikita Popov
a06d015e61 Remove unnecessary mbstring skipifs
These functions are always available (if the extension is available
at all).
2021-06-14 15:27:28 +02:00
Nikita Popov
6600ad6067 Add some missing EXTENSIONS sections to misc tests 2021-06-14 14:52:44 +02:00
Nikita Popov
4083600bd5 Port mbstring to use EXTENSIONS 2021-06-11 14:00:43 +02:00
Nikita Popov
39131219e8 Migrate more SKIPIF -> EXTENSIONS (#7139)
This is a mix of more automated and manual migration. It should remove all applicable extension_loaded() checks outside of skipif.inc files.
2021-06-11 12:58:44 +02:00
Nikita Popov
7485978339 Migrate SKIPIF -> EXTENSIONS (#7138)
This is an automated migration of most SKIPIF extension_loaded checks.
2021-06-11 11:57:42 +02:00
Ayesh Karunaratne
b8e380ab09 Update deprecation message for incompatible float to int conversion
Updates the deprecation message for implicit incompatible float to int conversion from:

```
Implicit conversion from non-compatible float %.*H to int in %s on line %d
```

to

```
Implicit conversion from float %.*H to int loses precision in %s on line %d
```

Related: #6661
2021-06-07 14:36:11 +02:00
George Peter Banyard
b6958bb847 Implement "Deprecate implicit non-integer-compatible float to int conversions" RFC. (#6661)
RFC: https://wiki.php.net/rfc/implicit-float-int-deprecate

Co-authored-by: Nikita Popov <nikita.ppv@gmail.com>
2021-05-31 15:48:45 +01:00
Christoph M. Becker
592cfa309e Merge branch 'PHP-8.0'
* PHP-8.0:
  Fix #81011: mb_convert_encoding removes references from arrays
2021-05-04 18:40:23 +02:00
Christoph M. Becker
d1c0cbdcb1 Merge branch 'PHP-7.4' into PHP-8.0
* PHP-7.4:
  Fix #81011: mb_convert_encoding removes references from arrays
2021-05-04 18:39:39 +02:00
Christoph M. Becker
0cafd53d18 Fix #81011: mb_convert_encoding removes references from arrays
We need to dereference references.

Closes GH-6938.
2021-05-04 18:37:40 +02:00
Alex Dowad
7159907d30 Fix mbstring support for ISO-2022-JP-MS encoding
- Treat it as error if multi-byte string or escape sequence is truncated
- Don't allow 'control' characters or escape sequences to appear in the middle
  of a multi-byte char

As with ISO-2022-JP-KDDI, the main reference used to develop the tests was
the behavior of the existing code. It would have been better to have some
independent reference which we could cross-check our code against, but I
couldn't find one.
2021-04-15 15:52:31 +02:00
Alex Dowad
570e89a9f3 Fix mbstring support for ISO-2022-JP-KDDI encoding
- Treat it as an error if a multi-byte character or escape sequence is truncated
- When converting other encodings to ISO-2022-JP-KDDI, don't swallow trailing
  hash characters or digits
- Don't allow 'control' characters to appear in the middle of a multi-byte char

Note: I was not able to find any kind of official or even semi-official
specification for this legacy encoding. Therefore, the test suite for
ISO-2022-JP-KDDI is based largely on the behavior of the existing code.

Verifying the correctness of program code in this way is very questionable.
In a sense, all you are proving is that the code "does what it does". However,
the test suite will still expose any unintended _changes_ to behavior.
2021-04-15 15:52:31 +02:00
Alex Dowad
f5f3ee7aee Add test suite for mUTF-7 (IMAP) encoding 2021-04-15 15:52:31 +02:00
Alex Dowad
ebe6500a0b Fix error reporting bug for Unicode -> CP50220 conversion
To detect errors in conversion from Unicode to another text encoding, each
mbstring conversion filter object maintains a count of 'bad' characters. After
a conversion operation finishes, this count is checked to see if there was any
error.

The problem with CP50220 was that mbstring used a chain of two conversion filter
objects. The 'bad character count' would be incremented on the second object in
the chain, but this didn't do anything, as only the count on the first such
object is ever checked.

Fix this by implementing the conversion using a single conversion filter object,
rather than a chain of two. This is possible because of the recent refactoring,
which pulled out the needed logic for CP50220 conversion into a helper function.
2021-04-15 15:52:31 +02:00
Max Semenik
b11771271e Remove stray mentions of mbstring.func_overload
This feature has been completely removed.

Closes GH-6688.
2021-02-15 09:47:28 +01:00
Nikita Popov
b10416a652 Deprecate passing null to non-nullable arg of internal function
This deprecates passing null to non-nullable scale arguments of
internal functions, with the eventual goal of making the behavior
consistent with userland functions, where null is never accepted
for non-nullable arguments.

This change is expected to cause quite a lot of fallout. In most
cases, calling code should be adjusted to avoid passing null. In
some cases, PHP should be adjusted to make some function arguments
nullable. I have already fixed a number of functions before landing
this, but feel free to file a bug if you encounter a function that
doesn't accept null, but probably should. (The rule of thumb for
this to be applicable is that the function must have special behavior
for 0 or "", which is distinct from the natural behavior of the
parameter.)

RFC: https://wiki.php.net/rfc/deprecate_null_to_scalar_internal_arg

Closes GH-6475.
2021-02-11 21:46:13 +01:00
Alex Dowad
d8c785b894 Update 'East Asian Width' table to comply with Unicode 13.0
Instead of manually maintaining the data in eaw_table.h, it is now automatically
generated by ucgendat/ucgendat.php, using the EastAsianWidth.txt file from
the Unicode Consortium.

Something must be said about the deleted test case. Back in 2004, someone
noticed that `mb_strwidth` didn't comply with Unicode 4.0. A test case was
added to expose the problem. Well, time keeps moving on, and with the changing
years, new Unicodes are born and old Unicodes die. Some characters which were
counted as double-width in Unicode 4.0 are no longer such in Unicode 13.0,
which renders the test case obsolete.

At the same time, make a couple of spelling/grammar fixes in ucgendat.php.
2021-01-19 20:38:44 +02:00
Alex Dowad
888f5d7729 CP5022{0,1,2}: treat truncated multibyte characters as error 2021-01-15 21:55:41 +02:00
Alex Dowad
2a93a8bb8c Add test suite for CP5022{0,1,2} 2021-01-15 21:55:41 +02:00
Nikita Popov
e2c8ab7c33 Print "interned" instead of fake refcount in debug_zval_dump()
debug_zval_dump() currently prints refcount 1 for interned strings
and arrays, which does not really reflect the truth. These values
are not refcounted, so the refcount is misleading. Instead print
an "interned" tag.

Closes GH-6598.
2021-01-15 12:21:24 +01:00
Alex Dowad
4299e2de42 JIS7/JIS8 encoding: treat truncated multibyte characters as error 2021-01-14 22:34:16 +02:00
Alex Dowad
b67e358e75 JIS7/JIS8 encoding: handle invalid 2nd byte for Kanji correctly
Previously, in ISO-2022-JP/JIS7/JIS8, if an escape sequence (starting with 0x1B)
appeared where the 2nd byte of a multibyte character should have been, mbstring
would forget all about the truncated multibyte character and happily accept the
escape sequence. However, such sequences are not legal and should be flagged as
errors.

Also, any other illegal bytes appearing where the 2nd byte of a multibyte
character was expected were just passed through quietly to the output. Fix that.

Also add a test suite for both ISO-2022-JP and JIS7/JIS8. (These are extremely
similar encodings; JIS7 and JIS8 are variants of ISO-2022-JP. mbstring's 'JIS'
is actually a combination of JIS7 _and_ JIS8, since the extensions which each
one adds to ISO-2022-JP are disjoint.)
2021-01-14 22:31:31 +02:00
Alex Dowad
4b95fdf2ca ISO-2022-JP-2004 conversion: handle invalid characters correctly 2021-01-14 22:26:24 +02:00
Alex Dowad
c9fea7db72 Convert U+00AF (MACRON) to 0x8150 (FULLWIDTH MACRON) in some SJIS variants
Except for vanilla Shift-JIS, where 0x7E is a halfwidth overline/macron.
As for Shift-JIS-2004, it has an added character (byte sequence 0x854A)
which was defined as a halfwidth macron in JIS X 0213:2000, so we use that.
2020-11-25 20:51:45 +02:00
Alex Dowad
ecf718470b Convert U+FF5E (FULLWIDTH TILDE) to 0x8160 (WAVE DASH) in SJIS variants
By entering this character in the JIS X 0208 conversion table, we can
remove a bunch of explicit `if` clauses in different conversion filters.
It also means that U+FF5E can be converted into SJIS-mac now; I don't
know why this one SJIS variant rejected U+FF5E before, since 0x8160
means the same thing in SJIS-mac as the others.
2020-11-25 20:51:45 +02:00
Alex Dowad
4f3bd2e235 Convert U+203E (OVERLINE) to 0x8150 (FULLWIDTH MACRON) in some SJIS variants
Converting U+203E to 0x7E was especially wrong for CP932, where 0x7E
represents a tilde.

For vanilla Shift-JIS and Shift-JIS-2004, converting to 0x7E is acceptable,
since 0x7E does represent an overline/macron in those encodings.

Follow the same principle in CP51932, which is closely related to CP932.
2020-11-25 20:51:45 +02:00
Alex Dowad
0d0029d729 0x7E is not a tilde in Shift-JIS{,-2004} 2020-11-25 20:51:45 +02:00
Alex Dowad
e4ee979111 0x5C is not a Yen sign in CP932 (or CP51932)
When Microsoft created CP932 (their version of Shift-JIS), they explicitly
used bytes 0-0x7F to represent ASCII characters rather than JIS X 0201
characters.

So when converting Unicode to CP932, it is not correct to convert U+00A5
to CP932 0x5C. Fortunately, CP932 does have a multi-byte FULLWIDTH YEN SIGN
character which we can use instead.

CP51932 uses the same extended character set as CP932; while CP932 is
MicroSoft's extended version of Shift-JIS, CP51932 is their extended version
of EUC-JP. So the same reasoning applies to CP51932.
2020-11-25 20:51:45 +02:00
Alex Dowad
315d48b434 0x5C is not a backslash in Shift-JIS-2004
Shift-JIS-2004 is an extension of Shift-JIS, which uses 0x5C for the Yen
sign. Therefore, it is not correct to convert ASCII 0x5C (backslash) to
Shift-JIS-2004 0x5C (yen sign). JIS X 0208 does have a backslash, so we
can convert ASCII backslash to SJIS-2004 backslash instead.

From time immemorial, there has been confusion around the treatment
of 0x5C bytes on systems using legacy Japanese encodings. JIS X 0201
specified that 0x5C means a yen sign, and thus fonts on Japanese systems,
including early versions of Windows, displayed a 0x5C byte as a yen sign.
This meant that when ASCII text files were displayed on such systems,
what were meant to be backslashes would appear as yen signs. Japanese C
programmers could write character escapes using yen signs, and C compilers
built on the assumption that the input was ASCII would interpret these
escapes as desired. Likewise for shell scripts. Et cetera, et cetera...

Therefore, if the input to `mb_convert_encoding` is (for example) a C
program, and after converting to Shift-JIS-2004, the user wishes to feed
the output into a C compiler, *then* perhaps ASCII 0x5C should be mapped
to SJIS 0x5C. However, this scenario is ridiculous and will never happen.

A more realistic scenario might be: an article written in SJIS-2004 has
embedded Windows file paths (like 'C:\Program Files'), with yen signs used
as a path separator. If we convert SJIS-2004 0x5C to ASCII 0x5C, then the
path separators will be 'fixed' by the conversion.

For general written texts, it is much better to convert backslashes to...
backslashes. And yen signs, to yen signs.
2020-11-25 20:51:44 +02:00
Alex Dowad
5c805655db Enhance handling of CP51932 encoding
- Don't pass 'control' characters through in the middle of a multi-byte char
- Treat truncated multi-byte characters as an error
2020-11-25 20:51:44 +02:00
Alex Dowad
beef597124 Fix mbstring support for SJIS-Mobile (DoCoMo, KDDI, and Softbank variants of Shift-JIS)
Lots of problems here.

- Don't pass 'control' characters through silently in the middle of a
  multi-byte character.
- Treat it as an error if a multi-byte character is truncated.
- For ESC sequences used to encode emoji on earlier Softbank phones, if an
  invalid ESC sequence is found, don't pass it through. Rather, handle it as
  an error and respect `mb_substitute_character`.
- In ranges used by mobile vendors for emoji, if a certain byte sequence
  doesn't map to any emoji, don't emit a mangled value (actually a raw
  (ku*94)+ten value, which may not even be a valid Unicode codepoint at all).
- When converting Unicode to SJIS-Mobile, don't mangle codepoints which fall
  in the 2nd range of MicroSoft vendor extensions.

Some vendor-specific emoji have been mapped to standard Unicode codepoints
now, rather than 'private use area' codepoints. When the legacy code was
written, these codepoints may not have existed yet in the Unicode standard
which was current at that time.

Also do a major code cleanup -- remove dead code, rearrange what is left,
use some new macros and helper functions to make the code clearer...
2020-11-25 20:51:44 +02:00
Alex Dowad
2759874a42 Enhance handling of CP932 text encoding
- Don't allow control characters to appear in the middle of a multi-byte
  character. (This was a strange feature of mbstring; it doesn't make much
  sense, and iconv doesn't allow it.)
- Treat truncated multi-byte characters as an error.
2020-11-25 19:52:19 +02:00
Alex Dowad
b489c1bc4d Bugfixes for findInvalidChars (helper for mbstring test suite) 2020-11-25 19:52:19 +02:00
Alex Dowad
6dd75478d5 Leading BOM is stripped for UTF-32
For consistency with UTF-16 and UCS-4.

Also, do some code cleanup.
2020-11-11 11:18:59 +02:00
Alex Dowad
1cf12c02f0 Add test suite for SJIS-mac encoding 2020-11-11 11:18:58 +02:00
Alex Dowad
d40f9cf735 Add test suite for SJIS-2004 encoding 2020-11-11 11:18:58 +02:00
Alex Dowad
d1d50c2b7a Test EUC-JP and Shift-JIS more thoroughly
Previously, the unit tests for these text encodings covered all mappings
from legacy -> Unicode, and all _reversible_ mappings from Unicode -> legacy.
However, we should also test the few Unicode -> legacy mappings which
are not reversible.
2020-11-11 11:18:58 +02:00
Alex Dowad
3e7acf901d Remove mbstring identify filters
mbstring had an 'identify filter' for almost every supported text encoding
which was used when auto-detecting the most likely encoding for a string.
It would run over the string and set a 'flag' if it saw anything which
did not appear likely to be the encoding in question.

One problem with this scheme was that encodings which merely appeared
less likely to be the correct one were completely rejected, even if there
was no better candidate. Another problem was that the 'identify filters'
had a huge amount of code duplication with the 'conversion filters'.

Eliminate the identify filters. Instead, when auto-detecting text
encoding, use conversion filters to see whether the input string is valid
in candidate encodings or not. At the same type, watch the type of
codepoints which the string decodes to and mark it as less likely if
non-printable characters (ESC, form feed, bell, etc.) or 'private use
area' codepoints are seen.

Interestingly, one old test case in which JIS text was misidentified
as UTF-8 (and this wrong behavior was enshrined in the test) was 'fixed'
and the JIS string is now auto-detected as JIS.
2020-11-09 13:45:17 +02:00
Alex Dowad
8f6889b20d Fix mbstring support for EUC-JP text encoding
- Don't allow control characters to appear in the middle of a multi-byte
  character. (A strange feature, or perhaps misfeature, of mbstring which is
  not present in other libraries such as iconv.)
- When checking whether string is valid, reject kuten codes which do not
  map to any character, whether converting from EUC-JP to another encoding,
  or converting another encoding which uses JIS X 0208/0212 charsets to
  EUC-JP.
- Truncated multi-byte characters are treated as an error.
2020-11-09 13:45:17 +02:00
Alex Dowad
ad7e0f16cc Fix mbstring support for Shift-JIS
- Reject otherwise valid kuten codes which don't map to anything in JIS X 0208.
- Handle truncated multi-byte characters as an error.
- Convert Shift-JIS 0x7E to Unicode 0x203E (overline) as recommended by the
  Unicode Consortium, and as iconv does.
- Convert Shift-JIS 0x5C to Unicode 0xA5 (yen sign) as recommended by the
  Unicode Consortium, and as iconv does.
  (NOTE: This will affect PHP scripts which use an internal encoding of
  Shift-JIS! PHP assigns a special meaning to 0x5C, the backslash. For example,
  it is used for escapes in double-quoted strings. Mapping the Shift-JIS yen
  sign to the Unicode yen sign means the yen sign will not be usable for
  C escapes in double-quoted strings. Japanese PHP programmers who want to
  write their source code in Shift-JIS for some strange reason will have to
  use the JIS X 0208 backlash or 'REVERSE SOLIDUS' character for their C
  escapes.)
- Convert Unicode 0x5C (backslash) to Shift-JIS 0x815F (reverse solidus).
- Immediately handle error if first Shift-JIS byte is over 0xEF, rather than
  waiting to see the next byte. (Previously, the value used was 0xFC, which is
  the limit for the 2nd byte and not the 1st byte of a multi-byte character.)
- Don't allow 'control characters' to appear in the middle of a multi-byte
  character.

The test case for bug 47399 is now obsolete. That test assumed that a number
of Shift-JIS byte sequences which don't map to any character were 'valid'
(because the byte values were within the legal ranges).
2020-11-09 13:45:16 +02:00
Alex Dowad
cc03c54c36 Remove useless byte{2,4}{be,le} encodings from mbstring
There is no meaningful difference between these and UCS-{2,4}. They are
just a little bit more lax about passing errors silently. They also have
no known use.

Alias to UCS-{2,4} in case someone, somewhere is using them.
2020-11-09 13:45:16 +02:00
Alex Dowad
3eb8828d1a Fix issues with mbstring encoding tests
I made some mistakes on this code, which meant that not everything which
should be tested was actually being tested.
2020-11-09 13:45:16 +02:00
Alex Dowad
ff953f254c Add test suite for ARMSCII-8 encoding 2020-11-02 21:31:06 +02:00
Alex Dowad
335c1b98c2 Add test suite for KOI8-U encoding 2020-11-02 21:31:06 +02:00
Alex Dowad
9db4387f14 Add test suite for KOI8-R encoding 2020-11-02 21:31:06 +02:00
Alex Dowad
9980534a4e Add test suite for CP850 encoding 2020-11-02 21:31:06 +02:00