mbstring has a bad habit of passing invalid characters through silently
when converting to the same (or a "compatible") encoding.
For example, if you give it an invalid JIS X 0208 kuten code encoded with SJIS,
and try to convert that to EUC-JP, mbstring will just quietly re-encode the
invalid code in the EUC-JP representation.
At the same, some parts of the code (like `mb_check_encoding`) assume that
invalid characters will be treated as... well, invalid. Let's unbreak things
by actually catching errors and reporting them, instead of swallowing them.
Note that some text encoding conversion libraries, such as Solaris iconv
and FreeBSD iconv, map 0x30-0x39 to the Arabic script numerals rather than
the 'regular' Roman numerals. (That is, to Unicode codepoints 0x660-0x669.)
Further, Windows CP28596 adds more mappings to use the unused bytes in
ISO-8859-6.
There are some bytes in this encoding which are not mapped to any character.
Notably, MicroSoft added their own mappings for these 'unused' bits in their
version of Latin-3, called CP28593.
Interestingly, it looks like the original author intended to add an identify filter
for this encoding, but never did so. The needed struct is there, but was never added
to the list of identify filters in mbfl_ident.c.
mb_ereg()/mb_eregi() currently have an inconsistent return value
based on whether the $matches parameter is passed or not:
> Returns the byte length of the matched string if a match for
> pattern was found in string, or FALSE if no matches were found
> or an error occurred.
>
> If the optional parameter regs was not passed or the length of
> the matched string is 0, this function returns 1.
Coupling this behavior to the $matches parameter doesn't make sense
-- we know the match length either way, there is no technical
reason to distinguish them. However, returning the match length
is not particularly useful either, especially due to the need to
convert 0-length into 1-length to satisfy "truthy" checks. We
could always return 1, which would kind of match the behavior of
preg_match() -- however, preg_match() actually returns the number
of matches, which is 0 or 1 for preg_match(), while false signals
an error. However, mb_ereg() returns false both for no match and
for an error. This would result in an odd 1|false return value.
The patch canonicalizes mb_ereg() to always return a boolean,
where true indicates a match and false indicates no match or error.
This also matches the behavior of the mb_ereg_match() and
mb_ereg_search() functions.
This fixes the default value integrity violation in PHP 8.
Closes GH-6331.
There was one faulty test in the suite which only passed before because UTF-16 had no
identify filter. After this was fixed, it exposed the problem with the test.
- Make everything less gratuitously verbose
- Don't litter the code with lots of unneeded NULL checks (for things which
will never be NULL)
- Don't return success/failure code from functions which can never fail
- For encoding structs, don't use pointers to pointers to pointers for the
list of alias strings. Pointers to pointers (2 levels of indirection)
is what actually makes sense. This gets rid of some extraneous
dereference operations.
The check ensures that the decoded codepoint is between 0x10000-0x10FFFF,
which is the valid range which can be encoded in a UTF-16 surrogate pair.
However, just looking at the code, it's obvious that this will be true.
First of all, 0x10000 is added to the decoded codepoint on the previous
line, so how could it be less than 0x10000?
Further, even if the 20 data bits already decoded were 0xFFFFF (all ones),
when you add 0x10000, it comes to 0x10FFFF, which is the very top of the
valid range. So how could the decoded codepoint be more than 0x10FFFF?
It can't.
This is a default destructor for mbfl_convert_filter structs. The thing is: there
isn't really anything that needs to be done to those structs before freeing them.
The default destructor just zeroed out some fields, but there's no reason why
we should actually do that.
Man, I can be pedantic sometimes. Tiny little things like misspelled words just
hurt me inside. So while it's not really a big deal, I couldn't leave these typos
alone...