archived-php-src

mirror of https://github.com/php/php-src.git synced 2026-04-04 22:52:40 +02:00

Author	SHA1	Message	Date
Alex Dowad	30bfeef48d	mbfl_strwidth does not need to use legacy conversion filters now ...Because we have the new (faster) conversion filters now for ALL text encodings supported by mbstring.	2022-07-18 15:11:32 +02:00
Alex Dowad	40f5048aa7	Fix new conversion filter for UUEncode This code (written by yours truly) was very broken on input strings long enough to require processing in multiple chunks. Fuzzing revealed this very quickly; after initial rework, further fuzzing also found a couple of very obscure bugs in corner cases.	2022-07-18 15:11:32 +02:00
Alex Dowad	5fee30b630	Fix new conversion filter for QPrint (same order of check as legacy code) Because of checking for maximum line length before certain other checks, the new conversion filter for QPrint could produce different results from the old one in some cases. This was discovered while fuzzing the new implementation of mb_decode_numericentity.	2022-07-18 15:11:32 +02:00
Alex Dowad	3cf432798e	Fix new conversion filter for CP50220 (multi-codepoint kana at end of buffer) If two codepoints which needed to be collapsed into a single kuten code were separated, with one at the end of one buffer and the other at the beginning of the next buffer, they were not converted correctly. This was discovered while fuzzing the new implementation of mb_decode_numericentity.	2022-07-18 15:11:31 +02:00
Alex Dowad	7559bf77d2	Fix new conversion filters for mobile SJIS variants ('0' at end of buffer) Previously, I had adjusted this code so that if a character which could be part of a special Docomo/Softbank/KDDI 'keypad' emoji appeared at the end of one buffer, and the 'keypad' character appeared at the beginning of the next, they would still be combined. However, this broke the handling of such a character appearing at the end of one buffer, and a character which is NOT 'keypad' appearing at the beginning of the next. This was found while fuzzing the new implementation of mb_decode_numericentity.	2022-07-18 15:11:31 +02:00
Alex Dowad	fa83a8e15e	Fix new conversion filter for HTML entities While fuzzing the new mb_decode_numericentity implementation, I discovered that the fast conversion filter for 'HTML-ENTITIES' did not correctly handle an empty named entity ('&;'), nor did it correctly handle invalid named entities whose names were a prefix of a valid entity. Also, it did not correctly handle the case where a named entity is truncated and another named entity starts abruptly.	2022-07-18 15:11:31 +02:00
Alex Dowad	9c3972fb3d	Fix legacy conversion filter for HZ	2022-07-18 15:11:31 +02:00
Alex Dowad	1526bab6d0	Fix legacy conversion filter for GB18030	2022-07-18 15:11:31 +02:00
Alex Dowad	6938e35122	Fix legacy conversion filter for CP50220	2022-07-18 15:11:31 +02:00
Alex Dowad	1662f7f79f	Fix legacy conversion filter for UTF-7	2022-07-18 15:11:31 +02:00
Alex Dowad	c8e4f313fa	Fix legacy conversion filter for ISO-2022-KR When I was working on this code before, it really, really looked like the index into `uhc3_ucs_table` could never overrun the size of the table. Why did I get this wrong? Don't know. Anyways, libfuzzer tore away my illusions and unequivocally demonstrated that the index CAN be larger than the size of the table.	2022-07-18 15:11:31 +02:00
Alex Dowad	cebb8009c6	Fix legacy conversion filters for... almost all 8-bit text encodings	2022-07-18 15:11:31 +02:00
Alex Dowad	2eff19e38f	Fix legacy conversion filter for HTML entities	2022-07-18 15:11:31 +02:00
Alex Dowad	87b71595ba	Fix legacy conversion filter for Base64	2022-07-18 15:11:31 +02:00
Alex Dowad	7ece8f18b0	Fix legacy conversion filter for MacJapanese	2022-07-18 15:11:31 +02:00
Alex Dowad	d7bab66135	Fix legacy conversion filter for SJIS-2004	2022-07-18 15:11:31 +02:00
Alex Dowad	31cbb7a3a5	Fix legacy conversion filter for QPrint	2022-07-18 15:11:30 +02:00
Alex Dowad	048f6cbcde	Fix legacy conversion filter for JIS	2022-07-18 15:11:30 +02:00
Alex Dowad	91969e908f	New implementation of mb_{de,en}code_numericentity This new implementation uses the new encoding conversion filters. Aside from fewer LOC and (hopefully) improved readability, the differences are as follows: BEHAVIOR CHANGES: - The old implementation used signed arithmetic when operating on the 'convmap'. This meant that results could be surprising when using convmap entries with 1 in the MSB. Further, types like 'int' were used rather than those with a specific bit width, such as 'int32_t'. This meant that results could also depend on the platform width of an 'int'. Now unsigned arithmetic is used, with explicit bit widths. - Similarly, while converting decimal numeric entities, the legacy implementation would ensure that the value never overflowed INT_MAX, and if it did, the entity would be treated as invalid and passed through unconverted. However, that again means that results depend on the platform size of an 'int'. So now, we use a value with explicit bit width (32 bits) to hold the value of a deconverted decimal entity, and ensure that the entity value does not overflow that. Further, because we are using an UNSIGNED 32-bit value rather than a signed one, the ceiling for how large a decimal entity can be is higher now. All of this will probably not affect anyone, since Unicode codepoints above U+10FFFF are invalid anyways. To see the difference, you need to be using a text encoding like UCS-4, which allows huge 'codepoints'. - If it saw something which looked like a hex entity, but turned out not to be a valid numeric entity, the old implementation would sometimes convert the hexadecimal digits a-f to A-F (uppercase). The new implementation passes invalid numeric entities through without performing case conversion. - The old implementation of mb_encode_numericentity was limited in how many decimal/hex digits it could emit. If a text encoding like UCS-4 was in use, where 'codepoints' can have huge values (larger than the valid range stipulated by the Unicode standard), it would not error out on a 'codepoint' whose value was too large for it, but would rather mangle the value and emit a numeric entity which decoded to some other random codepoint. The new implementation is able to emit enough digits to express any value which fits in 32 bits. PERFORMANCE: Based on micro-benchmarks run on my development machine: Decoding numeric HTML entities is about 4 times faster, for both decimal and hexadecimal entities, across a variety of input string lengths. Encoding is about 3 times faster.	2022-07-18 15:11:30 +02:00
Alex Dowad	880803a21e	Use fast conversion filters to implement php_mb_ord Even for single-character strings, this is about 50% faster for ASCII, UTF-8, and UTF-16. For long strings, the performance gain is enormous, since the old code would convert the ENTIRE string, just to pick out the first codepoint.	2022-06-12 15:24:41 +02:00
Alex Dowad	9468fa7ff2	mbfl_strlen does not need to use old conversion filters any more	2022-06-12 15:24:41 +02:00
Alex Dowad	8533fccd63	Assert minimum size of wchar buffer in text conversion filters In all text conversion filters which require the wchar buffer used for output to have some minimum size, it's better to include an assertion; this will help us to catch bugs, and will also help future readers to understand what we expect of the function arguments. For UTF-7 and UTF7-IMAP, these assertions were already there, but I have added comments explaining why the minimum size is what it is.	2022-06-12 15:24:40 +02:00
Alex Dowad	871e61f942	Fully use available buffer space where converting Base64 I didn't think this through carefully enough when first writing this code, but it's not necessary to reserve space for the 1-2 wchars which may be emitted before exiting the function. Why? Well, we are guaranteed that when we enter the function, there are at least 3 spaces in the wchar buffer. The only way those can be consumed is if wchars are emitted in the main 'while' loop, but if it does emit any wchars, it will set 'bits' to zero at the same time, which means the final part will not emit anything. 'bits' can be incremented again by the main loop, but the main loop only runs while there are still at least 3 spaces in the buffer. So basically, we are guaranteed that when the main loop terminates, either there are 3 or more spaces remaining in the wchar buffer, or else 'bits' is zero, or both.	2022-06-12 15:24:40 +02:00
Alex Dowad	13479ee2bd	Restore backwards-compatible mappings for 0x5C/0x7E in SJIS (for fast conversion filter) In `d62f535caa`, the legacy mbstring conversion filters for Shift-JIS was updated to restore backwards-compatible mappings for 0x5C/0x7E. Make the same change to the newer fast conversion filters.	2022-06-11 17:09:16 +02:00
Christoph M. Becker	85a95a2982	Merge branch 'PHP-8.1' * PHP-8.1: Restore backwards-compatible mappings of 0x5C and 0x7E in SJIS	2022-06-11 16:32:33 +02:00
Alex Dowad	d62f535caa	Restore backwards-compatible mappings of 0x5C and 0x7E in SJIS According to the relevant Japan Industrial Standards Committee standards, SJIS 0x5C is a Yen sign, and 0x7E is an overline. However, this conflicts with the implementation of SJIS in various legacy software (notably Microsoft products), where SJIS 0x5C and 0x7E are taken as equivalent to the same ASCII bytes. Prior to PHP 8.1, mbstring's implementation of SJIS handled these bytes compatibly with Microsoft products. This was changed in PHP 8.1.0, in an attempt to comply with the JISC specifications. However, after discussion with various concerned Japanese developers, it seems that the historical behavior was more useful in the majority of applications which process SJIS-encoded text. Since we are now treating SJIS 0x5C as equivalent to U+005C and 0x7E as equivalent to U+007E, it does not make sense to convert U+203E (OVERLINE) to 0x7E, nor does it make sense to convert U+00A5 (YEN SIGN) to 0x5C. Restore the mappings for those codepoints from before PHP 8.1.0. Fixes GH-8281.	2022-06-11 16:31:47 +02:00
Alex Dowad	e2c4fc5755	Fix buffer overflow bugs in CP50222 text conversion code	2022-05-28 21:53:39 +02:00
Alex Dowad	1f17b5468f	Fix buffer overflow bug in HZ text conversion code	2022-05-28 21:53:39 +02:00
Alex Dowad	0154a5ac9f	Use fast text conversion filters to implement php_mb_convert_encoding_ex	2022-05-28 21:53:38 +02:00
Alex Dowad	8dddd3cfad	Fix buffer overflow bugs in UTF-7 text conversion After Nikita Popov found a buffer overrun bug in one of my pull requests, I was prompted to add more assertions in a38c7e5703 to help me catch such bugs myself more easily in testing. Wouldn't you just know it... as soon as I added those assertions, the mbstring test suite caught another buffer overrun bug in my UTF-7 conversion code, which I wrote the better part of a year ago. Then, when I started fuzzing the code with libfuzzer, I found and fixed another buffer overflow: If we enter the main loop, which normally outputs 3 decoded Base64 characters, where the first half of a surrogate pair had appeared at the end of the previous run, but the second half does not appear on this run, we need to output one error marker. Then, at the end of the main loop, if the Base64 input ends at an unexpected position AND the last character was not a legal Base64-encoded character, we need to output two error markers for that. The three error markers plus two valid, decoded bytes can push us over the available space in our wchar buffer.	2022-05-28 21:53:38 +02:00
Alex Dowad	e4b9aa1870	Add assertions to help catch buffer overflows in mbstring text conversion code	2022-05-28 21:53:38 +02:00
Alex Dowad	b2f963f91c	For JIS/ISO-2022-JP, treat a truncated escape sequence as error	2022-05-28 21:53:37 +02:00
Alex Dowad	4afa72126e	Implement fast text conversion interface for QPrint	2022-05-28 21:53:37 +02:00
Alex Dowad	3fda9f5095	Implement fast text conversion interface for HTML-ENTITIES	2022-05-28 21:53:37 +02:00
Alex Dowad	dc1ba61d09	Simplify code for converting UTF-8 An overly complex boolean test was used to check if a 3-byte code unit was valid. Convert it to an equivalent test with fewer terms.	2022-05-28 21:53:37 +02:00
Alex Dowad	85690ae26d	Implement fast text conversion interface for Base64	2022-05-28 21:53:37 +02:00
Alex Dowad	7c2587b1f6	Implement fast text conversion interface for UUENCODE	2022-05-28 21:53:37 +02:00
Alex Dowad	06a15e6395	Implement fast text conversion interface for '8bit'	2022-05-28 21:53:37 +02:00
Alex Dowad	073a88f34c	Implement fast text conversion interface for ISO-2022-JP-KDDI One bug in the previous implementation; when it saw a sequence of codepoints which looked like they might need to be emitted as a special KDDI emoji, it would totally forget whether it was in ASCII mode, JISX 0208 mode, or something else. So it could not reliably emit the correct escape sequence to switch to the right mode. Further, if the input ends with a codepoint which looks like it could be part of a special KDDI emoji, then the legacy code did not emit an escape sequence to switch back to ASCII mode at the end of the string. This means that the emitted ISO-2022-JP-KDDI strings could not always be safely concatenated.	2022-05-28 21:53:36 +02:00
Alex Dowad	c96ad91014	Implement fast text conversion interface for ISO-2022-JP-MS	2022-05-28 21:53:36 +02:00
Alex Dowad	a08f062cad	Implement fast text conversion interface for mobile variants of UTF-8	2022-05-28 21:53:36 +02:00
Alex Dowad	321dbd0413	Implement fast text conversion interface for ISO-2022-JP-2004 There were bugs in the legacy implementation. Lots of them. It did not properly track whether it has switched to JISX 0213 plane 1 or plane 2. If it processes a character in plane 1 and then immediately one in plane 2, it failed to emit the escape code to switch to plane 2. Further, when converting codepoints from 0x80-0xFF to ISO-2022-JP-2004, the legacy implementation would totally disregard which mode it was operating in. Such codepoints would pass through directly to the output without any escape sequences being emitted. If that was not enough, all the legacy implementations of JISX 0213:2004 encodings had another common bug; their 'flush function' did not call the next flush function in the chain of conversion filters. So if any of these encodings were converted to an encoding where the flush function was needed to finish the output string, then the output would be truncated.	2022-05-28 21:53:36 +02:00
Alex Dowad	29e21c0e6f	Implement fast text conversion interface for SJIS-2004 All the legacy implementations of JISX 0213:2004 encodings had a common bug; their 'flush function' did not call the next flush function in the chain of conversion filters. So if any of these encodings were converted to an encoding where the flush function was needed to finish the output string, then the output would be truncated.	2022-05-28 21:53:36 +02:00
Alex Dowad	e5fdd5cef2	Implement fast text conversion interface for EUC-JP-2004 All the legacy implementations of JISX 0213:2004 encodings had a common bug; their 'flush function' did not call the next flush function in the chain of conversion filters. So if any of these encodings were converted to an encoding where the flush function was needed to finish the output string, then the output would be truncated.	2022-05-28 21:53:36 +02:00
Alex Dowad	67d83f57c1	Implement fast text conversion interface for mobile SJIS variants	2022-05-28 21:53:36 +02:00
Alex Dowad	0d635d93f5	Implement fast text conversion interface for UTF7-IMAP The old code would convert a 0x00 byte in the input to 0x00 in the output, but this clearly violates the RFC which defines UTF7-IMAP.	2022-05-28 21:53:35 +02:00
Alex Dowad	6cf30356e0	Implement fast text conversion interface for SJIS-mac	2022-05-28 21:53:35 +02:00
Alex Dowad	c9479899c6	Implement fast text conversion interface for ISO-2022-KR When working on this, I read RFC 1557 again and realized that the comment at the top of the file was totally mistaken. Further, the legacy code did not obey the RFC. (It would emit the "ESC $ ) C" sequence anywhere, not just at the beginning of a line as the RFC requires.) The new code obeys the RFC; one quirk is that it always emits the escape sequence at the beginning of each output string, even if the string is completely ASCII (in which case the escape sequence is allowed, but not required). The new code doesn't always generate the same number of error markers for invalid escapes as the old code did. The old code could not emit the special KDDI emoji for national flags. Further, there was a bug in the test which the old code used to determine whether an 0xF byte should be emitted at the end of a string (to switch back to ASCII mode). As a result, it would not always switch back to ASCII mode, meaning that it was not always safe to concatenate the resulting strings.	2022-05-28 21:53:35 +02:00
Alex Dowad	763284a531	Implement fast text conversion interface for '7bit'	2022-05-28 21:53:35 +02:00
Alex Dowad	3f12d26e3a	Merge branch 'PHP-8.1' * PHP-8.1: Error handling for UTF-8 complies with WHATWG specification	2022-04-16 20:32:12 +02:00

1 2 3 4 5 ...

532 Commits