archived-php-src

mirror of https://github.com/php/php-src.git synced 2026-04-17 13:01:02 +02:00

Author	SHA1	Message	Date
Alex Dowad	e26234a044	UTF-32 conversion treats truncated characters as illegal	2020-10-27 10:19:01 +02:00
Alex Dowad	7047e5d2c4	Add identify filter for UTF-32{,BE,LE}	2020-10-27 10:19:01 +02:00
Alex Dowad	d8895cd054	Improve error handling for UTF-16{,BE,LE} Catch various errors such as the first part of a surrogate pair not being followed by a proper second part, the first part of a surrogate pair appearing at the end of a string, the second part of a surrogate pair appearing out of place, and so on.	2020-10-27 10:19:01 +02:00
Alex Dowad	d9ddeb6e85	UTF-16 text conversion handles truncated characters as illegal This broke one old test (Zend/tests/multibyte_encoding_003.phpt), which used a PHP script encoded as UTF-16. The problem was that to terminate the test script, we need the text: "\n--EXPECT--". Out of that text, the terminating newline (0x0A byte) becomes part of the resulting test script; but a bare 0x0A byte with no 0x00 is not valid UTF-16. Since we now treat truncated UTF-16 characters as erroneous, an extra '?' is appended to the output as an 'illegal character' marker. Really, if we are running PHP scripts which are treated as encoded in UTF-16 or some other arbitrary text encoding (not ASCII), and the script is not actually a valid string in that encoding, inserting '?' characters into the code which the PHP interpreter runs is a bad thing to do. In such cases, the script shouldn't be treated as UTF-16 (or whatever) at all. I wonder if mbstring's encoding detection is being used in 'non-strict' mode?	2020-10-27 10:19:00 +02:00
Alex Dowad	bc18e32690	Do not pass invalid ISO-8859-{3,6,7,8} characters through silently mbstring has a bad habit of passing invalid characters through silently when converting to the same (or a "compatible") encoding. For example, if you give it an invalid JIS X 0208 kuten code encoded with SJIS, and try to convert that to EUC-JP, mbstring will just quietly re-encode the invalid code in the EUC-JP representation. At the same, some parts of the code (like `mb_check_encoding`) assume that invalid characters will be treated as... well, invalid. Let's unbreak things by actually catching errors and reporting them, instead of swallowing them.	2020-10-16 22:17:45 +02:00
Alex Dowad	5c6b2a7ad2	Add identify filter for ISO-8859-8 (Latin/Hebrew)	2020-10-16 22:17:45 +02:00
Alex Dowad	ea687018cd	Add identify filter for ISO-8859-7 (Latin/Greek)	2020-10-16 22:17:45 +02:00
Alex Dowad	a6603b60f7	Add identify filter for ISO-8859-6 (Latin/Arabic) Note that some text encoding conversion libraries, such as Solaris iconv and FreeBSD iconv, map 0x30-0x39 to the Arabic script numerals rather than the 'regular' Roman numerals. (That is, to Unicode codepoints 0x660-0x669.) Further, Windows CP28596 adds more mappings to use the unused bytes in ISO-8859-6.	2020-10-16 22:17:45 +02:00
Alex Dowad	23270d7f9e	Add identify filter for ISO-8859-3 (Latin-3) There are some bytes in this encoding which are not mapped to any character. Notably, MicroSoft added their own mappings for these 'unused' bits in their version of Latin-3, called CP28593.	2020-10-16 22:17:45 +02:00
Alex Dowad	7b9bed0150	Add identify filter for ISO-8859-16 (Latin-10) encoding Interestingly, it looks like the original author intended to add an identify filter for this encoding, but never did so. The needed struct is there, but was never added to the list of identify filters in mbfl_ident.c.	2020-10-16 20:56:45 +02:00
Alex Dowad	7bb5b435af	mUTF-7 (UTF7-IMAP) conversion: handle illegal (non-RFC-compliant) input correctly Instead of looking the other way and letting things slide, report errors when the input does not follow the RFC.	2020-10-13 20:26:14 +02:00
Alex Dowad	b43a7deacf	Add 'mUTF-7' alias for UTF7-IMAP encoding	2020-10-13 20:26:14 +02:00
Alex Dowad	b975817265	Add comment explaining mUTF-7 to mbfilter_utf7imap.c	2020-10-13 20:26:14 +02:00
Alex Dowad	648c1cb51e	Add identify filter for UCS-2, UCS-2BE, and UCS-2LE encodings	2020-10-13 20:26:14 +02:00
Alex Dowad	374f31e364	Add mbstring identify filter for 'binary' encoding	2020-10-13 20:26:13 +02:00
Alex Dowad	97beecc251	Add identify filter for UTF-16, UTF-16LE, UTF-16BE There was one faulty test in the suite which only passed before because UTF-16 had no identify filter. After this was fixed, it exposed the problem with the test.	2020-10-13 20:26:13 +02:00
Alex Dowad	a98838e3b6	Handle illegal bytes properly when converting to '7bit' encoding Previously, mbstring would silently drop illegal bytes when converting a string to '7bit' encoding.	2020-10-13 06:12:38 +02:00
Alex Dowad	4aa7430f68	Add mbstring identify filter for '7bit' encoding	2020-10-13 06:12:38 +02:00
Alex Dowad	0ffc1f55b3	Refactor mbfl_ident.c, mbfl_encoding.c, mbfl_memory_device.c, mbfl_string.c - Make everything less gratuitously verbose - Don't litter the code with lots of unneeded NULL checks (for things which will never be NULL) - Don't return success/failure code from functions which can never fail - For encoding structs, don't use pointers to pointers to pointers for the list of alias strings. Pointers to pointers (2 levels of indirection) is what actually makes sense. This gets rid of some extraneous dereference operations.	2020-10-13 06:12:38 +02:00
Alex Dowad	e8b8ecbd4e	Remove useless constants MBFL_CHP_{CTL,DIGIT,UALPHA,LALPHA,MSPECIAL}	2020-10-13 06:12:37 +02:00
Alex Dowad	aabbee2318	Remove useless validity check when converting UTF-16LE -> wchar The check ensures that the decoded codepoint is between 0x10000-0x10FFFF, which is the valid range which can be encoded in a UTF-16 surrogate pair. However, just looking at the code, it's obvious that this will be true. First of all, 0x10000 is added to the decoded codepoint on the previous line, so how could it be less than 0x10000? Further, even if the 20 data bits already decoded were 0xFFFFF (all ones), when you add 0x10000, it comes to 0x10FFFF, which is the very top of the valid range. So how could the decoded codepoint be more than 0x10FFFF? It can't.	2020-10-13 06:12:37 +02:00
Alex Dowad	f474e5502c	Refactor UTF-16LE -> wchar conversion code	2020-10-13 06:12:37 +02:00
Alex Dowad	3f1851dec2	Avoid compiler warnings related to mbstring flush functions	2020-10-13 06:12:37 +02:00
George Peter Banyard	fd1672a7f3	Fix [-Wduplicated-cond] in MBString extension	2020-10-09 20:54:23 +01:00
Remi Collet	b1c5532ad1	fix mbfl function prototypes re-add mbfl_convert_filter_feed API re-add pointer cast	2020-09-15 15:15:06 +02:00
Alex Dowad	a81061d36c	Use symbolic constants in Japanese kana conversion code (not magic numbers) Also correct misspelling of 'hiragana' as 'hirangana' at the same time.	2020-09-03 15:56:29 +02:00
Alex Dowad	ec609916dc	Remove unused 'from' field from mbfl_buffer_converter struct	2020-09-03 15:56:29 +02:00
Alex Dowad	f699d65391	Add comment to mbfilter_tl_jisx0201_jisx0208.h Explain the 'ZEN' and 'HAN' in symbolic constant names.	2020-09-03 15:56:29 +02:00
Alex Dowad	a2b40ee9a5	Remove unneeded function mbfl_filt_ident_common_dtor This was the default destructor for mbfl_identify_filter structs, but there's nothing we actually need to do to those structs before freeing them.	2020-09-03 15:56:29 +02:00
Alex Dowad	dcd6c6043e	Remove unneeded function mbfl_filt_conv_common_dtor This is a default destructor for mbfl_convert_filter structs. The thing is: there isn't really anything that needs to be done to those structs before freeing them. The default destructor just zeroed out some fields, but there's no reason why we should actually do that.	2020-09-03 15:56:29 +02:00
Alex Dowad	409aa20ab0	Refactor mbfl_convert.c	2020-09-03 15:56:29 +02:00
Alex Dowad	cdc664049c	Comment constants in mbfl_consts.h, remove unused ones These were unused, and almost certainly will never be used: - MBFL_ENCTYPE_MWC4BE - MBFL_ENCTYPE_MWC4LE - MBFL_ENCTYPE_SHFTCODE - MBFL_ENCTYPE_ENC_STRM For the latter two, there were some encodings which were marked with these flags; but nothing ever _checked_ these particular flags.	2020-08-31 23:18:56 +02:00
Alex Dowad	3a100cd7ac	Add comment on mbstring East Asian Width table	2020-08-31 23:18:45 +02:00
Alex Dowad	62317d592f	Remove redundant includes from mbstring (and make sure correct config.h is used) Very interesting... it turns out that when Valgrind support was enabled, `#include "config.h"` from within mbstring was actually including the file "config.h" from Valgrind, and not the one from mbstring!! This is because -I/usr/include/valgrind was added to the compiler invocation _before_ -Iext/mbstring/libmbfl. Make sure we actually include the file which was intended.	2020-08-31 23:17:58 +02:00
Alex Dowad	b7808d02e8	Remove useless definition of NULL in mbfl_string.h If NULL is not defined by the platform, mbfl_defs.h already defines it.	2020-08-31 23:17:49 +02:00
Alex Dowad	a64241b540	Remove unused functions from mbstring - mbfl_buffer_converter_reset - mbfl_buffer_converter_strncat - mbfl_buffer_converter_getbuffer - mbfl_oddlen - mbfl_filter_output_pipe_flush - mbfl_memory_device_output2 - mbfl_memory_device_output4 - mbfl_is_support_encoding - mbfl_buffer_converter_feed2 - _php_mb_regex_globals_dtor - mime_header_encoder_feed - mime_header_decoder_feed - mbfl_convert_filter_feed	2020-08-31 23:16:57 +02:00
Alex Dowad	d4ef7ef11d	Inline unneeded indirection for mbstring memory management All memory allocation and deallocation for mbstring bounces through a table of function pointers before going to emalloc/efree/etc. But this is unnecessary. The allocators are never swapped out. Better to just call them directly.	2020-08-31 23:16:09 +02:00
Nikita Popov	0e71446e7a	Merge branch 'PHP-7.4' * PHP-7.4: Fix bug #79787	2020-07-08 11:22:47 +02:00
Nikita Popov	77a8a709da	Merge branch 'PHP-7.3' into PHP-7.4 * PHP-7.3: Fix bug #79787	2020-07-08 11:22:18 +02:00
XXiang	3d5de7d746	Fix bug #79787 Closes GH-5807.	2020-07-08 11:20:58 +02:00
Christoph M. Becker	3516a9c8f0	Replace ISO_8859-* with ISO8859-* aliases for MBString We also remove the mbregex ISO 8859 aliases with underscores.	2020-06-30 18:43:40 +02:00
Nikita Popov	217f6013b3	Remove no_language from mbfl_string This is not actually used for anything and just causes confusion.	2020-05-07 11:36:57 +02:00
Nikita Popov	226d9dd30a	Only allow "pass" as input/output encoding "pass" is not a real encoding, it just means "don't perform any conversion". Using it as an internal encoding or passing it to any of the mbstring() function will not work (and on master commonly assert).	2020-05-07 11:19:14 +02:00
Nikita Popov	901417f0ae	Fix mbfl default allocators Forgot to remove the persistent allocators from here.	2020-05-04 23:41:39 +02:00
Nikita Popov	a0cae937c5	Spec mbfl allocators as infallible And remove all NULL checks.	2020-05-04 23:19:07 +02:00
Nikita Popov	7d4ff8443e	Remove persistent allocators from libmbfl These functions are not used, and I don't think we have any plans to ever use them.	2020-05-04 23:19:07 +02:00
Christoph M. Becker	17d4e66204	Fix #68690 : Hypothetical off-by-one condition We fix this, even though `filter->cache == jisx0213_u2_tbl_len` can never be true here.	2020-04-03 14:20:37 +02:00
George Peter Banyard	363d87f256	Fix [-Wmissing-field-initializers] compiler warning in mbstring Add missing NULL pointer for mbfl_convert_vtbl struct.	2020-02-21 13:19:09 +01:00
Nikita Popov	7d170eb295	Merge branch 'PHP-7.4' * PHP-7.4: Fix shift ub in mbstring Restore digit check in mb_decode_numericentity()	2020-01-30 10:08:21 +01:00
Nikita Popov	43465768f1	Fix shift ub in mbstring Ideally "c" would be an unsigned integer...	2020-01-30 10:07:25 +01:00

1 2 3 4 5 ...

318 Commits