1
0
mirror of https://github.com/php/php-src.git synced 2026-03-29 03:32:20 +02:00
Commit Graph

1936 Commits

Author SHA1 Message Date
Alex Dowad
84c180d88b Add test suite for ISO-8859-x encoding verification and conversion 2020-10-16 22:25:48 +02:00
Alex Dowad
bc18e32690 Do not pass invalid ISO-8859-{3,6,7,8} characters through silently
mbstring has a bad habit of passing invalid characters through silently
when converting to the same (or a "compatible") encoding.

For example, if you give it an invalid JIS X 0208 kuten code encoded with SJIS,
and try to convert that to EUC-JP, mbstring will just quietly re-encode the
invalid code in the EUC-JP representation.

At the same, some parts of the code (like `mb_check_encoding`) assume that
invalid characters will be treated as... well, invalid. Let's unbreak things
by actually catching errors and reporting them, instead of swallowing them.
2020-10-16 22:17:45 +02:00
Alex Dowad
5c6b2a7ad2 Add identify filter for ISO-8859-8 (Latin/Hebrew) 2020-10-16 22:17:45 +02:00
Alex Dowad
ea687018cd Add identify filter for ISO-8859-7 (Latin/Greek) 2020-10-16 22:17:45 +02:00
Alex Dowad
a6603b60f7 Add identify filter for ISO-8859-6 (Latin/Arabic)
Note that some text encoding conversion libraries, such as Solaris iconv
and FreeBSD iconv, map 0x30-0x39 to the Arabic script numerals rather than
the 'regular' Roman numerals. (That is, to Unicode codepoints 0x660-0x669.)

Further, Windows CP28596 adds more mappings to use the unused bytes in
ISO-8859-6.
2020-10-16 22:17:45 +02:00
Alex Dowad
23270d7f9e Add identify filter for ISO-8859-3 (Latin-3)
There are some bytes in this encoding which are not mapped to any character.
Notably, MicroSoft added their own mappings for these 'unused' bits in their
version of Latin-3, called CP28593.
2020-10-16 22:17:45 +02:00
Alex Dowad
7b9bed0150 Add identify filter for ISO-8859-16 (Latin-10) encoding
Interestingly, it looks like the original author intended to add an identify filter
for this encoding, but never did so. The needed struct is there, but was never added
to the list of identify filters in mbfl_ident.c.
2020-10-16 20:56:45 +02:00
Alex Dowad
7dc16374b4 Remove unused IS_SJIS1 and IS_SJIS2 macros 2020-10-14 08:31:51 +02:00
Nikita Popov
bd2488bc49 Merge branch 'PHP-8.0'
* PHP-8.0:
  Normalize mb_ereg() return value
2020-10-13 20:41:33 +02:00
Nikita Popov
5582490bf2 Normalize mb_ereg() return value
mb_ereg()/mb_eregi() currently have an inconsistent return value
based on whether the $matches parameter is passed or not:

> Returns the byte length of the matched string if a match for
> pattern was found in string, or FALSE if no matches were found
> or an error occurred.
>
> If the optional parameter regs was not passed or the length of
> the matched string is 0, this function returns 1.

Coupling this behavior to the $matches parameter doesn't make sense
-- we know the match length either way, there is no technical
reason to distinguish them. However, returning the match length
is not particularly useful either, especially due to the need to
convert 0-length into 1-length to satisfy "truthy" checks. We
could always return 1, which would kind of match the behavior of
preg_match() -- however, preg_match() actually returns the number
of matches, which is 0 or 1 for preg_match(), while false signals
an error. However, mb_ereg() returns false both for no match and
for an error. This would result in an odd 1|false return value.

The patch canonicalizes mb_ereg() to always return a boolean,
where true indicates a match and false indicates no match or error.
This also matches the behavior of the mb_ereg_match() and
mb_ereg_search() functions.

This fixes the default value integrity violation in PHP 8.

Closes GH-6331.
2020-10-13 20:40:55 +02:00
Alex Dowad
7bb5b435af mUTF-7 (UTF7-IMAP) conversion: handle illegal (non-RFC-compliant) input correctly
Instead of looking the other way and letting things slide, report errors when
the input does not follow the RFC.
2020-10-13 20:26:14 +02:00
Alex Dowad
b43a7deacf Add 'mUTF-7' alias for UTF7-IMAP encoding 2020-10-13 20:26:14 +02:00
Alex Dowad
b975817265 Add comment explaining mUTF-7 to mbfilter_utf7imap.c 2020-10-13 20:26:14 +02:00
Alex Dowad
648c1cb51e Add identify filter for UCS-2, UCS-2BE, and UCS-2LE encodings 2020-10-13 20:26:14 +02:00
Alex Dowad
374f31e364 Add mbstring identify filter for 'binary' encoding 2020-10-13 20:26:13 +02:00
Alex Dowad
97beecc251 Add identify filter for UTF-16, UTF-16LE, UTF-16BE
There was one faulty test in the suite which only passed before because UTF-16 had no
identify filter. After this was fixed, it exposed the problem with the test.
2020-10-13 20:26:13 +02:00
Nikita Popov
4371a4b241 Merge branch 'PHP-8.0'
* PHP-8.0:
  Fix incorrect zpp parameter count in mb_substr() / mb_strcut()
2020-10-13 17:47:11 +02:00
Nikita Popov
9b4094c3d7 Fix incorrect zpp parameter count in mb_substr() / mb_strcut()
These functions only accept 4 params.
2020-10-13 17:46:56 +02:00
Nikita Popov
40e920ebd9 Merge branch 'PHP-8.0'
* PHP-8.0:
  Fix argument nullability in mbstring
2020-10-13 16:03:29 +02:00
Nikita Popov
124bce3c7a Fix argument nullability in mbstring
These arguments were declared nullable in stubs (and should be
nullable), but didn't accept null in zpp.
2020-10-13 16:03:04 +02:00
Alex Dowad
a98838e3b6 Handle illegal bytes properly when converting to '7bit' encoding
Previously, mbstring would silently drop illegal bytes when converting a
string to '7bit' encoding.
2020-10-13 06:12:38 +02:00
Alex Dowad
4aa7430f68 Add mbstring identify filter for '7bit' encoding 2020-10-13 06:12:38 +02:00
Alex Dowad
0ffc1f55b3 Refactor mbfl_ident.c, mbfl_encoding.c, mbfl_memory_device.c, mbfl_string.c
- Make everything less gratuitously verbose
- Don't litter the code with lots of unneeded NULL checks (for things which
  will never be NULL)
- Don't return success/failure code from functions which can never fail
- For encoding structs, don't use pointers to pointers to pointers for the
  list of alias strings. Pointers to pointers (2 levels of indirection)
  is what actually makes sense. This gets rid of some extraneous
  dereference operations.
2020-10-13 06:12:38 +02:00
Alex Dowad
e8b8ecbd4e Remove useless constants MBFL_CHP_{CTL,DIGIT,UALPHA,LALPHA,MSPECIAL} 2020-10-13 06:12:37 +02:00
Alex Dowad
aabbee2318 Remove useless validity check when converting UTF-16LE -> wchar
The check ensures that the decoded codepoint is between 0x10000-0x10FFFF,
which is the valid range which can be encoded in a UTF-16 surrogate pair.
However, just looking at the code, it's obvious that this will be true.
First of all, 0x10000 is added to the decoded codepoint on the previous
line, so how could it be less than 0x10000?

Further, even if the 20 data bits already decoded were 0xFFFFF (all ones),
when you add 0x10000, it comes to 0x10FFFF, which is the very top of the
valid range. So how could the decoded codepoint be more than 0x10FFFF?
It can't.
2020-10-13 06:12:37 +02:00
Alex Dowad
f474e5502c Refactor UTF-16LE -> wchar conversion code 2020-10-13 06:12:37 +02:00
Alex Dowad
3f1851dec2 Avoid compiler warnings related to mbstring flush functions 2020-10-13 06:12:37 +02:00
George Peter Banyard
fd1672a7f3 Fix [-Wduplicated-cond] in MBString extension 2020-10-09 20:54:23 +01:00
Nikita Popov
cafceea742 Update mbstring parameter names
Closes GH-6207.
2020-09-28 09:51:58 +02:00
Larry Garfield
94854e0dff Standardize mbstring and string on using 'string' as a parameter name.
Closes GH-6171.
2020-09-21 12:06:50 +02:00
Máté Kocsis
e950ca13ea Consolidate the usage of "either" and "one of" in error messages
Closes GH-6173
2020-09-20 19:41:47 +02:00
Nikita Popov
c5401854fc Run tidy
This should fix most of the remaining issues with tabs and spaces
being mixed in tests.
2020-09-18 14:28:32 +02:00
Remi Collet
b1c5532ad1 fix mbfl function prototypes
re-add mbfl_convert_filter_feed API
re-add pointer cast
2020-09-15 15:15:06 +02:00
Máté Kocsis
c37a1cd650 Promote a few remaining errors in ext/standard
Closes GH-6110
2020-09-15 14:26:16 +02:00
Máté Kocsis
1c81a34563 Make mb_send_mail() consistent with mail()
The $additional_headers parameter shouldn't accept null.
2020-09-14 11:52:33 +02:00
Máté Kocsis
c98d47696f Consolidate new union type ZPP macro names
They will now follow the canonical order of types. Older macros are
left intact due to maintaining BC.

Closes GH-6112
2020-09-11 11:00:18 +02:00
Nikita Popov
f33fd9b7fe Throw ValueError on null bytes in mb_send_mail()
Instead of silently replacing with spaces.
2020-09-11 10:46:59 +02:00
George Peter Banyard
0444158529 Promote some warnings in MBString Regexes
Closes GH-5341
2020-09-09 14:55:07 +02:00
Alex Dowad
5b78d76ec8 mb_str_split is already documented on php.net
So remove TODO comment which implies that it's not.
2020-09-08 20:09:45 +02:00
Nikita Popov
2386f655d8 Always use PCRE for mbstring.http_output_conv_mimetypes
Instead of using either oniguruma or pcre depending on which is
available. We always have PCRE, so use it. This ensures consistent
behavior.
2020-09-08 15:02:15 +02:00
Nikita Popov
623bf96e7e Throw on invalid mb_http_input() type 2020-09-07 09:59:51 +02:00
Nikita Popov
d57f9e5ea4 Handle null encoding in mb_http_input() 2020-09-04 17:15:35 +02:00
Alex Dowad
a81061d36c Use symbolic constants in Japanese kana conversion code (not magic numbers)
Also correct misspelling of 'hiragana' as 'hirangana' at the same time.
2020-09-03 15:56:29 +02:00
Alex Dowad
ec609916dc Remove unused 'from' field from mbfl_buffer_converter struct 2020-09-03 15:56:29 +02:00
Alex Dowad
f699d65391 Add comment to mbfilter_tl_jisx0201_jisx0208.h
Explain the 'ZEN' and 'HAN' in symbolic constant names.
2020-09-03 15:56:29 +02:00
Alex Dowad
a2b40ee9a5 Remove unneeded function mbfl_filt_ident_common_dtor
This was the default destructor for mbfl_identify_filter structs, but there's nothing
we actually need to do to those structs before freeing them.
2020-09-03 15:56:29 +02:00
Alex Dowad
dcd6c6043e Remove unneeded function mbfl_filt_conv_common_dtor
This is a default destructor for mbfl_convert_filter structs. The thing is: there
isn't really anything that needs to be done to those structs before freeing them.
The default destructor just zeroed out some fields, but there's no reason why
we should actually do that.
2020-09-03 15:56:29 +02:00
Alex Dowad
409aa20ab0 Refactor mbfl_convert.c 2020-09-03 15:56:29 +02:00
Alex Dowad
73dcfb6faa Fix typos in mbstring tests
Man, I can be pedantic sometimes. Tiny little things like misspelled words just
hurt me inside. So while it's not really a big deal, I couldn't leave these typos
alone...
2020-09-02 20:48:22 +02:00
Máté Kocsis
3e800e997b Move custom type checks to ZPP
Closes GH-6034
2020-09-02 11:11:38 +02:00