1
0
mirror of https://github.com/php/php-src.git synced 2026-04-17 13:01:02 +02:00
Commit Graph

318 Commits

Author SHA1 Message Date
Alex Dowad
e26234a044 UTF-32 conversion treats truncated characters as illegal 2020-10-27 10:19:01 +02:00
Alex Dowad
7047e5d2c4 Add identify filter for UTF-32{,BE,LE} 2020-10-27 10:19:01 +02:00
Alex Dowad
d8895cd054 Improve error handling for UTF-16{,BE,LE}
Catch various errors such as the first part of a surrogate pair not being
followed by a proper second part, the first part of a surrogate pair appearing
at the end of a string, the second part of a surrogate pair appearing out
of place, and so on.
2020-10-27 10:19:01 +02:00
Alex Dowad
d9ddeb6e85 UTF-16 text conversion handles truncated characters as illegal
This broke one old test (Zend/tests/multibyte_encoding_003.phpt), which used
a PHP script encoded as UTF-16. The problem was that to terminate the test
script, we need the text: "\n--EXPECT--". Out of that text, the terminating
newline (0x0A byte) becomes part of the resulting test script; but a bare
0x0A byte with no 0x00 is not valid UTF-16.

Since we now treat truncated UTF-16 characters as erroneous, an extra '?' is
appended to the output as an 'illegal character' marker.

Really, if we are running PHP scripts which are treated as encoded in UTF-16
or some other arbitrary text encoding (not ASCII), and the script is not
actually a valid string in that encoding, inserting '?' characters into the
code which the PHP interpreter runs is a bad thing to do. In such cases, the
script shouldn't be treated as UTF-16 (or whatever) at all.

I wonder if mbstring's encoding detection is being used in 'non-strict' mode?
2020-10-27 10:19:00 +02:00
Alex Dowad
bc18e32690 Do not pass invalid ISO-8859-{3,6,7,8} characters through silently
mbstring has a bad habit of passing invalid characters through silently
when converting to the same (or a "compatible") encoding.

For example, if you give it an invalid JIS X 0208 kuten code encoded with SJIS,
and try to convert that to EUC-JP, mbstring will just quietly re-encode the
invalid code in the EUC-JP representation.

At the same, some parts of the code (like `mb_check_encoding`) assume that
invalid characters will be treated as... well, invalid. Let's unbreak things
by actually catching errors and reporting them, instead of swallowing them.
2020-10-16 22:17:45 +02:00
Alex Dowad
5c6b2a7ad2 Add identify filter for ISO-8859-8 (Latin/Hebrew) 2020-10-16 22:17:45 +02:00
Alex Dowad
ea687018cd Add identify filter for ISO-8859-7 (Latin/Greek) 2020-10-16 22:17:45 +02:00
Alex Dowad
a6603b60f7 Add identify filter for ISO-8859-6 (Latin/Arabic)
Note that some text encoding conversion libraries, such as Solaris iconv
and FreeBSD iconv, map 0x30-0x39 to the Arabic script numerals rather than
the 'regular' Roman numerals. (That is, to Unicode codepoints 0x660-0x669.)

Further, Windows CP28596 adds more mappings to use the unused bytes in
ISO-8859-6.
2020-10-16 22:17:45 +02:00
Alex Dowad
23270d7f9e Add identify filter for ISO-8859-3 (Latin-3)
There are some bytes in this encoding which are not mapped to any character.
Notably, MicroSoft added their own mappings for these 'unused' bits in their
version of Latin-3, called CP28593.
2020-10-16 22:17:45 +02:00
Alex Dowad
7b9bed0150 Add identify filter for ISO-8859-16 (Latin-10) encoding
Interestingly, it looks like the original author intended to add an identify filter
for this encoding, but never did so. The needed struct is there, but was never added
to the list of identify filters in mbfl_ident.c.
2020-10-16 20:56:45 +02:00
Alex Dowad
7bb5b435af mUTF-7 (UTF7-IMAP) conversion: handle illegal (non-RFC-compliant) input correctly
Instead of looking the other way and letting things slide, report errors when
the input does not follow the RFC.
2020-10-13 20:26:14 +02:00
Alex Dowad
b43a7deacf Add 'mUTF-7' alias for UTF7-IMAP encoding 2020-10-13 20:26:14 +02:00
Alex Dowad
b975817265 Add comment explaining mUTF-7 to mbfilter_utf7imap.c 2020-10-13 20:26:14 +02:00
Alex Dowad
648c1cb51e Add identify filter for UCS-2, UCS-2BE, and UCS-2LE encodings 2020-10-13 20:26:14 +02:00
Alex Dowad
374f31e364 Add mbstring identify filter for 'binary' encoding 2020-10-13 20:26:13 +02:00
Alex Dowad
97beecc251 Add identify filter for UTF-16, UTF-16LE, UTF-16BE
There was one faulty test in the suite which only passed before because UTF-16 had no
identify filter. After this was fixed, it exposed the problem with the test.
2020-10-13 20:26:13 +02:00
Alex Dowad
a98838e3b6 Handle illegal bytes properly when converting to '7bit' encoding
Previously, mbstring would silently drop illegal bytes when converting a
string to '7bit' encoding.
2020-10-13 06:12:38 +02:00
Alex Dowad
4aa7430f68 Add mbstring identify filter for '7bit' encoding 2020-10-13 06:12:38 +02:00
Alex Dowad
0ffc1f55b3 Refactor mbfl_ident.c, mbfl_encoding.c, mbfl_memory_device.c, mbfl_string.c
- Make everything less gratuitously verbose
- Don't litter the code with lots of unneeded NULL checks (for things which
  will never be NULL)
- Don't return success/failure code from functions which can never fail
- For encoding structs, don't use pointers to pointers to pointers for the
  list of alias strings. Pointers to pointers (2 levels of indirection)
  is what actually makes sense. This gets rid of some extraneous
  dereference operations.
2020-10-13 06:12:38 +02:00
Alex Dowad
e8b8ecbd4e Remove useless constants MBFL_CHP_{CTL,DIGIT,UALPHA,LALPHA,MSPECIAL} 2020-10-13 06:12:37 +02:00
Alex Dowad
aabbee2318 Remove useless validity check when converting UTF-16LE -> wchar
The check ensures that the decoded codepoint is between 0x10000-0x10FFFF,
which is the valid range which can be encoded in a UTF-16 surrogate pair.
However, just looking at the code, it's obvious that this will be true.
First of all, 0x10000 is added to the decoded codepoint on the previous
line, so how could it be less than 0x10000?

Further, even if the 20 data bits already decoded were 0xFFFFF (all ones),
when you add 0x10000, it comes to 0x10FFFF, which is the very top of the
valid range. So how could the decoded codepoint be more than 0x10FFFF?
It can't.
2020-10-13 06:12:37 +02:00
Alex Dowad
f474e5502c Refactor UTF-16LE -> wchar conversion code 2020-10-13 06:12:37 +02:00
Alex Dowad
3f1851dec2 Avoid compiler warnings related to mbstring flush functions 2020-10-13 06:12:37 +02:00
George Peter Banyard
fd1672a7f3 Fix [-Wduplicated-cond] in MBString extension 2020-10-09 20:54:23 +01:00
Remi Collet
b1c5532ad1 fix mbfl function prototypes
re-add mbfl_convert_filter_feed API
re-add pointer cast
2020-09-15 15:15:06 +02:00
Alex Dowad
a81061d36c Use symbolic constants in Japanese kana conversion code (not magic numbers)
Also correct misspelling of 'hiragana' as 'hirangana' at the same time.
2020-09-03 15:56:29 +02:00
Alex Dowad
ec609916dc Remove unused 'from' field from mbfl_buffer_converter struct 2020-09-03 15:56:29 +02:00
Alex Dowad
f699d65391 Add comment to mbfilter_tl_jisx0201_jisx0208.h
Explain the 'ZEN' and 'HAN' in symbolic constant names.
2020-09-03 15:56:29 +02:00
Alex Dowad
a2b40ee9a5 Remove unneeded function mbfl_filt_ident_common_dtor
This was the default destructor for mbfl_identify_filter structs, but there's nothing
we actually need to do to those structs before freeing them.
2020-09-03 15:56:29 +02:00
Alex Dowad
dcd6c6043e Remove unneeded function mbfl_filt_conv_common_dtor
This is a default destructor for mbfl_convert_filter structs. The thing is: there
isn't really anything that needs to be done to those structs before freeing them.
The default destructor just zeroed out some fields, but there's no reason why
we should actually do that.
2020-09-03 15:56:29 +02:00
Alex Dowad
409aa20ab0 Refactor mbfl_convert.c 2020-09-03 15:56:29 +02:00
Alex Dowad
cdc664049c Comment constants in mbfl_consts.h, remove unused ones
These were unused, and almost certainly will never be used:

- MBFL_ENCTYPE_MWC4BE
- MBFL_ENCTYPE_MWC4LE
- MBFL_ENCTYPE_SHFTCODE
- MBFL_ENCTYPE_ENC_STRM

For the latter two, there were some encodings which were marked with these flags;
but nothing ever _checked_ these particular flags.
2020-08-31 23:18:56 +02:00
Alex Dowad
3a100cd7ac Add comment on mbstring East Asian Width table 2020-08-31 23:18:45 +02:00
Alex Dowad
62317d592f Remove redundant includes from mbstring (and make sure correct config.h is used)
Very interesting... it turns out that when Valgrind support was enabled,
`#include "config.h"` from within mbstring was actually including the file "config.h"
from Valgrind, and not the one from mbstring!!

This is because -I/usr/include/valgrind was added to the compiler invocation _before_
-Iext/mbstring/libmbfl.

Make sure we actually include the file which was intended.
2020-08-31 23:17:58 +02:00
Alex Dowad
b7808d02e8 Remove useless definition of NULL in mbfl_string.h
If NULL is not defined by the platform, mbfl_defs.h already defines it.
2020-08-31 23:17:49 +02:00
Alex Dowad
a64241b540 Remove unused functions from mbstring
- mbfl_buffer_converter_reset
- mbfl_buffer_converter_strncat
- mbfl_buffer_converter_getbuffer
- mbfl_oddlen
- mbfl_filter_output_pipe_flush
- mbfl_memory_device_output2
- mbfl_memory_device_output4
- mbfl_is_support_encoding
- mbfl_buffer_converter_feed2
- _php_mb_regex_globals_dtor
- mime_header_encoder_feed
- mime_header_decoder_feed
- mbfl_convert_filter_feed
2020-08-31 23:16:57 +02:00
Alex Dowad
d4ef7ef11d Inline unneeded indirection for mbstring memory management
All memory allocation and deallocation for mbstring bounces through a table of
function pointers before going to emalloc/efree/etc. But this is unnecessary.
The allocators are never swapped out. Better to just call them directly.
2020-08-31 23:16:09 +02:00
Nikita Popov
0e71446e7a Merge branch 'PHP-7.4'
* PHP-7.4:
  Fix bug #79787
2020-07-08 11:22:47 +02:00
Nikita Popov
77a8a709da Merge branch 'PHP-7.3' into PHP-7.4
* PHP-7.3:
  Fix bug #79787
2020-07-08 11:22:18 +02:00
XXiang
3d5de7d746 Fix bug #79787
Closes GH-5807.
2020-07-08 11:20:58 +02:00
Christoph M. Becker
3516a9c8f0 Replace ISO_8859-* with ISO8859-* aliases for MBString
We also remove the mbregex ISO 8859 aliases with underscores.
2020-06-30 18:43:40 +02:00
Nikita Popov
217f6013b3 Remove no_language from mbfl_string
This is not actually used for anything and just causes confusion.
2020-05-07 11:36:57 +02:00
Nikita Popov
226d9dd30a Only allow "pass" as input/output encoding
"pass" is not a real encoding, it just means "don't perform any
conversion". Using it as an internal encoding or passing it to
any of the mbstring() function will not work (and on master commonly
assert).
2020-05-07 11:19:14 +02:00
Nikita Popov
901417f0ae Fix mbfl default allocators
Forgot to remove the persistent allocators from here.
2020-05-04 23:41:39 +02:00
Nikita Popov
a0cae937c5 Spec mbfl allocators as infallible
And remove all NULL checks.
2020-05-04 23:19:07 +02:00
Nikita Popov
7d4ff8443e Remove persistent allocators from libmbfl
These functions are not used, and I don't think we have any plans
to ever use them.
2020-05-04 23:19:07 +02:00
Christoph M. Becker
17d4e66204 Fix #68690: Hypothetical off-by-one condition
We fix this, even though `filter->cache == jisx0213_u2_tbl_len` can
never be true here.
2020-04-03 14:20:37 +02:00
George Peter Banyard
363d87f256 Fix [-Wmissing-field-initializers] compiler warning in mbstring
Add missing NULL pointer for mbfl_convert_vtbl struct.
2020-02-21 13:19:09 +01:00
Nikita Popov
7d170eb295 Merge branch 'PHP-7.4'
* PHP-7.4:
  Fix shift ub in mbstring
  Restore digit check in mb_decode_numericentity()
2020-01-30 10:08:21 +01:00
Nikita Popov
43465768f1 Fix shift ub in mbstring
Ideally "c" would be an unsigned integer...
2020-01-30 10:07:25 +01:00