1
0
mirror of https://github.com/php/php-src.git synced 2026-04-02 13:43:02 +02:00
Commit Graph

625 Commits

Author SHA1 Message Date
Alex Dowad
6dd75478d5 Leading BOM is stripped for UTF-32
For consistency with UTF-16 and UCS-4.

Also, do some code cleanup.
2020-11-11 11:18:59 +02:00
Alex Dowad
1cf12c02f0 Add test suite for SJIS-mac encoding 2020-11-11 11:18:58 +02:00
Alex Dowad
d40f9cf735 Add test suite for SJIS-2004 encoding 2020-11-11 11:18:58 +02:00
Alex Dowad
d1d50c2b7a Test EUC-JP and Shift-JIS more thoroughly
Previously, the unit tests for these text encodings covered all mappings
from legacy -> Unicode, and all _reversible_ mappings from Unicode -> legacy.
However, we should also test the few Unicode -> legacy mappings which
are not reversible.
2020-11-11 11:18:58 +02:00
Alex Dowad
3e7acf901d Remove mbstring identify filters
mbstring had an 'identify filter' for almost every supported text encoding
which was used when auto-detecting the most likely encoding for a string.
It would run over the string and set a 'flag' if it saw anything which
did not appear likely to be the encoding in question.

One problem with this scheme was that encodings which merely appeared
less likely to be the correct one were completely rejected, even if there
was no better candidate. Another problem was that the 'identify filters'
had a huge amount of code duplication with the 'conversion filters'.

Eliminate the identify filters. Instead, when auto-detecting text
encoding, use conversion filters to see whether the input string is valid
in candidate encodings or not. At the same type, watch the type of
codepoints which the string decodes to and mark it as less likely if
non-printable characters (ESC, form feed, bell, etc.) or 'private use
area' codepoints are seen.

Interestingly, one old test case in which JIS text was misidentified
as UTF-8 (and this wrong behavior was enshrined in the test) was 'fixed'
and the JIS string is now auto-detected as JIS.
2020-11-09 13:45:17 +02:00
Alex Dowad
8f6889b20d Fix mbstring support for EUC-JP text encoding
- Don't allow control characters to appear in the middle of a multi-byte
  character. (A strange feature, or perhaps misfeature, of mbstring which is
  not present in other libraries such as iconv.)
- When checking whether string is valid, reject kuten codes which do not
  map to any character, whether converting from EUC-JP to another encoding,
  or converting another encoding which uses JIS X 0208/0212 charsets to
  EUC-JP.
- Truncated multi-byte characters are treated as an error.
2020-11-09 13:45:17 +02:00
Alex Dowad
ad7e0f16cc Fix mbstring support for Shift-JIS
- Reject otherwise valid kuten codes which don't map to anything in JIS X 0208.
- Handle truncated multi-byte characters as an error.
- Convert Shift-JIS 0x7E to Unicode 0x203E (overline) as recommended by the
  Unicode Consortium, and as iconv does.
- Convert Shift-JIS 0x5C to Unicode 0xA5 (yen sign) as recommended by the
  Unicode Consortium, and as iconv does.
  (NOTE: This will affect PHP scripts which use an internal encoding of
  Shift-JIS! PHP assigns a special meaning to 0x5C, the backslash. For example,
  it is used for escapes in double-quoted strings. Mapping the Shift-JIS yen
  sign to the Unicode yen sign means the yen sign will not be usable for
  C escapes in double-quoted strings. Japanese PHP programmers who want to
  write their source code in Shift-JIS for some strange reason will have to
  use the JIS X 0208 backlash or 'REVERSE SOLIDUS' character for their C
  escapes.)
- Convert Unicode 0x5C (backslash) to Shift-JIS 0x815F (reverse solidus).
- Immediately handle error if first Shift-JIS byte is over 0xEF, rather than
  waiting to see the next byte. (Previously, the value used was 0xFC, which is
  the limit for the 2nd byte and not the 1st byte of a multi-byte character.)
- Don't allow 'control characters' to appear in the middle of a multi-byte
  character.

The test case for bug 47399 is now obsolete. That test assumed that a number
of Shift-JIS byte sequences which don't map to any character were 'valid'
(because the byte values were within the legal ranges).
2020-11-09 13:45:16 +02:00
Alex Dowad
cc03c54c36 Remove useless byte{2,4}{be,le} encodings from mbstring
There is no meaningful difference between these and UCS-{2,4}. They are
just a little bit more lax about passing errors silently. They also have
no known use.

Alias to UCS-{2,4} in case someone, somewhere is using them.
2020-11-09 13:45:16 +02:00
Alex Dowad
3eb8828d1a Fix issues with mbstring encoding tests
I made some mistakes on this code, which meant that not everything which
should be tested was actually being tested.
2020-11-09 13:45:16 +02:00
Alex Dowad
ff953f254c Add test suite for ARMSCII-8 encoding 2020-11-02 21:31:06 +02:00
Alex Dowad
335c1b98c2 Add test suite for KOI8-U encoding 2020-11-02 21:31:06 +02:00
Alex Dowad
9db4387f14 Add test suite for KOI8-R encoding 2020-11-02 21:31:06 +02:00
Alex Dowad
9980534a4e Add test suite for CP850 encoding 2020-11-02 21:31:06 +02:00
Alex Dowad
0485bed4c7 Add test suite for CP866 encoding 2020-11-02 21:31:06 +02:00
Alex Dowad
0b13305ccc Add test suite for CP1254 encoding 2020-11-02 21:31:05 +02:00
Alex Dowad
eb4151e89e Add test suite for CP1251 encoding 2020-11-02 21:31:05 +02:00
Alex Dowad
b18b9c9ef6 Test cases for mbstring encodings are less repetitive 2020-11-02 21:31:05 +02:00
Alex Dowad
831abe2d90 Add test suite for CP1252 encoding
Also remove a bogus test (bug62545.phpt) which wrongly assumed that all invalid
characters in CP1251 and CP1252 should map to Unicode 0xFFFD (REPLACEMENT
CHARACTER).

mbstring has an interface to specify what invalid characters should be
replaced with; it's called `mb_substitute_character`. If a user wants to see
the Unicode 'replacement character', they can specify that using
`mb_substitute_character`. But if they specify something else, we should
follow that.
2020-10-30 22:13:27 +02:00
Alex Dowad
84c180d88b Add test suite for ISO-8859-x encoding verification and conversion 2020-10-16 22:25:48 +02:00
Nikita Popov
bd2488bc49 Merge branch 'PHP-8.0'
* PHP-8.0:
  Normalize mb_ereg() return value
2020-10-13 20:41:33 +02:00
Nikita Popov
5582490bf2 Normalize mb_ereg() return value
mb_ereg()/mb_eregi() currently have an inconsistent return value
based on whether the $matches parameter is passed or not:

> Returns the byte length of the matched string if a match for
> pattern was found in string, or FALSE if no matches were found
> or an error occurred.
>
> If the optional parameter regs was not passed or the length of
> the matched string is 0, this function returns 1.

Coupling this behavior to the $matches parameter doesn't make sense
-- we know the match length either way, there is no technical
reason to distinguish them. However, returning the match length
is not particularly useful either, especially due to the need to
convert 0-length into 1-length to satisfy "truthy" checks. We
could always return 1, which would kind of match the behavior of
preg_match() -- however, preg_match() actually returns the number
of matches, which is 0 or 1 for preg_match(), while false signals
an error. However, mb_ereg() returns false both for no match and
for an error. This would result in an odd 1|false return value.

The patch canonicalizes mb_ereg() to always return a boolean,
where true indicates a match and false indicates no match or error.
This also matches the behavior of the mb_ereg_match() and
mb_ereg_search() functions.

This fixes the default value integrity violation in PHP 8.

Closes GH-6331.
2020-10-13 20:40:55 +02:00
Alex Dowad
97beecc251 Add identify filter for UTF-16, UTF-16LE, UTF-16BE
There was one faulty test in the suite which only passed before because UTF-16 had no
identify filter. After this was fixed, it exposed the problem with the test.
2020-10-13 20:26:13 +02:00
Nikita Popov
cafceea742 Update mbstring parameter names
Closes GH-6207.
2020-09-28 09:51:58 +02:00
Larry Garfield
94854e0dff Standardize mbstring and string on using 'string' as a parameter name.
Closes GH-6171.
2020-09-21 12:06:50 +02:00
Máté Kocsis
e950ca13ea Consolidate the usage of "either" and "one of" in error messages
Closes GH-6173
2020-09-20 19:41:47 +02:00
Nikita Popov
c5401854fc Run tidy
This should fix most of the remaining issues with tabs and spaces
being mixed in tests.
2020-09-18 14:28:32 +02:00
Máté Kocsis
c37a1cd650 Promote a few remaining errors in ext/standard
Closes GH-6110
2020-09-15 14:26:16 +02:00
Nikita Popov
f33fd9b7fe Throw ValueError on null bytes in mb_send_mail()
Instead of silently replacing with spaces.
2020-09-11 10:46:59 +02:00
George Peter Banyard
0444158529 Promote some warnings in MBString Regexes
Closes GH-5341
2020-09-09 14:55:07 +02:00
Nikita Popov
623bf96e7e Throw on invalid mb_http_input() type 2020-09-07 09:59:51 +02:00
Nikita Popov
d57f9e5ea4 Handle null encoding in mb_http_input() 2020-09-04 17:15:35 +02:00
Alex Dowad
73dcfb6faa Fix typos in mbstring tests
Man, I can be pedantic sometimes. Tiny little things like misspelled words just
hurt me inside. So while it's not really a big deal, I couldn't leave these typos
alone...
2020-09-02 20:48:22 +02:00
Alex Dowad
dc98c1346d Additional tests for mbstring extension 2020-08-31 23:15:57 +02:00
Máté Kocsis
7aacc705d0 Add many missing closing PHP tags to tests
Closes GH-5958
2020-08-09 22:03:36 +02:00
Nikita Popov
52047addc7 Only force log startup errors if display_startup_errors disabled
Otherwise this results in duplicate errors.

Closes GH-5941.
2020-08-05 18:17:00 +02:00
Nikita Popov
d65d3f5298 Fix bug #79108
Don't expose references in debug_backtrace() or exception traces.
This is regardless of whether the argument is by-reference or not.

As a side-effect of this change, exception traces may now acquire
the interior value of a reference, which may be unexpected for
some internal functions. This is what necessitated the change in
the spl_array sort implementation.
2020-07-24 12:23:34 +02:00
Máté Kocsis
d30cd7d7e7 Review the usage of apostrophes in error messages
Closes GH-5590
2020-07-10 21:05:28 +02:00
Nikita Popov
0e71446e7a Merge branch 'PHP-7.4'
* PHP-7.4:
  Fix bug #79787
2020-07-08 11:22:47 +02:00
Nikita Popov
77a8a709da Merge branch 'PHP-7.3' into PHP-7.4
* PHP-7.3:
  Fix bug #79787
2020-07-08 11:22:18 +02:00
XXiang
3d5de7d746 Fix bug #79787
Closes GH-5807.
2020-07-08 11:20:58 +02:00
Fabien Villepinte
0c6d06ecfa Replace EXPECTF when possible
Closes GH-5779
2020-06-29 21:31:44 +02:00
Máté Kocsis
b5c7a83dca Remove unnecessary PHPDoc-alike blocks from tests
Closes GH-5759
2020-06-24 13:13:44 +02:00
Nikita Popov
5aa649cf51 Merge branch 'PHP-7.4' 2020-06-17 09:35:19 +02:00
Nikita Popov
3d6199db8a Add mbregex skipif 2020-06-17 09:35:02 +02:00
Nikita Popov
bbe74a6e3a Merge branch 'PHP-7.4' 2020-06-16 14:32:33 +02:00
Nikita Popov
3f2f36d5d4 Fix non-default syntax in mb_ereg_search() 2020-06-16 14:31:29 +02:00
Máté Kocsis
fbe30592d6 Improve type error messages when an object is given
From now on, we always display the given object's type instead of just reporting "object".
Additionally, make the format of return type errors match the format of argument errors.

Closes GH-5625
2020-05-26 19:06:19 +02:00
George Peter Banyard
7dd332f110 Refactor mb_substitute_character()
Using the new Fast ZPP API for string|int|null

This also fixes Bug #79448 which was too disruptive to fix in PHP 7.x
2020-05-11 17:30:01 +02:00
Nikita Popov
d38f819647 Fix test file encoding
The mb_http_input_pass.phpt was intended to use the same encoding
as mb_http_input.phpt, not UTF-8.
2020-05-07 21:18:49 +02:00
Nikita Popov
481b7421f3 Throw warning if invalid internal_encoding ini is specified 2020-05-07 14:44:13 +02:00