archived-php-src

mirror of https://github.com/php/php-src.git synced 2026-04-18 21:41:22 +02:00

Author	SHA1	Message	Date
Alex Dowad	880803a21e	Use fast conversion filters to implement php_mb_ord Even for single-character strings, this is about 50% faster for ASCII, UTF-8, and UTF-16. For long strings, the performance gain is enormous, since the old code would convert the ENTIRE string, just to pick out the first codepoint.	2022-06-12 15:24:41 +02:00
Alex Dowad	9468fa7ff2	mbfl_strlen does not need to use old conversion filters any more	2022-06-12 15:24:41 +02:00
Alex Dowad	8533fccd63	Assert minimum size of wchar buffer in text conversion filters In all text conversion filters which require the wchar buffer used for output to have some minimum size, it's better to include an assertion; this will help us to catch bugs, and will also help future readers to understand what we expect of the function arguments. For UTF-7 and UTF7-IMAP, these assertions were already there, but I have added comments explaining why the minimum size is what it is.	2022-06-12 15:24:40 +02:00
Alex Dowad	871e61f942	Fully use available buffer space where converting Base64 I didn't think this through carefully enough when first writing this code, but it's not necessary to reserve space for the 1-2 wchars which may be emitted before exiting the function. Why? Well, we are guaranteed that when we enter the function, there are at least 3 spaces in the wchar buffer. The only way those can be consumed is if wchars are emitted in the main 'while' loop, but if it does emit any wchars, it will set 'bits' to zero at the same time, which means the final part will not emit anything. 'bits' can be incremented again by the main loop, but the main loop only runs while there are still at least 3 spaces in the buffer. So basically, we are guaranteed that when the main loop terminates, either there are 3 or more spaces remaining in the wchar buffer, or else 'bits' is zero, or both.	2022-06-12 15:24:40 +02:00
Alex Dowad	13479ee2bd	Restore backwards-compatible mappings for 0x5C/0x7E in SJIS (for fast conversion filter) In `d62f535caa`, the legacy mbstring conversion filters for Shift-JIS was updated to restore backwards-compatible mappings for 0x5C/0x7E. Make the same change to the newer fast conversion filters.	2022-06-11 17:09:16 +02:00
Christoph M. Becker	85a95a2982	Merge branch 'PHP-8.1' * PHP-8.1: Restore backwards-compatible mappings of 0x5C and 0x7E in SJIS	2022-06-11 16:32:33 +02:00
Alex Dowad	d62f535caa	Restore backwards-compatible mappings of 0x5C and 0x7E in SJIS According to the relevant Japan Industrial Standards Committee standards, SJIS 0x5C is a Yen sign, and 0x7E is an overline. However, this conflicts with the implementation of SJIS in various legacy software (notably Microsoft products), where SJIS 0x5C and 0x7E are taken as equivalent to the same ASCII bytes. Prior to PHP 8.1, mbstring's implementation of SJIS handled these bytes compatibly with Microsoft products. This was changed in PHP 8.1.0, in an attempt to comply with the JISC specifications. However, after discussion with various concerned Japanese developers, it seems that the historical behavior was more useful in the majority of applications which process SJIS-encoded text. Since we are now treating SJIS 0x5C as equivalent to U+005C and 0x7E as equivalent to U+007E, it does not make sense to convert U+203E (OVERLINE) to 0x7E, nor does it make sense to convert U+00A5 (YEN SIGN) to 0x5C. Restore the mappings for those codepoints from before PHP 8.1.0. Fixes GH-8281.	2022-06-11 16:31:47 +02:00
Alex Dowad	e2c4fc5755	Fix buffer overflow bugs in CP50222 text conversion code	2022-05-28 21:53:39 +02:00
Alex Dowad	1f17b5468f	Fix buffer overflow bug in HZ text conversion code	2022-05-28 21:53:39 +02:00
Alex Dowad	0154a5ac9f	Use fast text conversion filters to implement php_mb_convert_encoding_ex	2022-05-28 21:53:38 +02:00
Alex Dowad	8dddd3cfad	Fix buffer overflow bugs in UTF-7 text conversion After Nikita Popov found a buffer overrun bug in one of my pull requests, I was prompted to add more assertions in a38c7e5703 to help me catch such bugs myself more easily in testing. Wouldn't you just know it... as soon as I added those assertions, the mbstring test suite caught another buffer overrun bug in my UTF-7 conversion code, which I wrote the better part of a year ago. Then, when I started fuzzing the code with libfuzzer, I found and fixed another buffer overflow: If we enter the main loop, which normally outputs 3 decoded Base64 characters, where the first half of a surrogate pair had appeared at the end of the previous run, but the second half does not appear on this run, we need to output one error marker. Then, at the end of the main loop, if the Base64 input ends at an unexpected position AND the last character was not a legal Base64-encoded character, we need to output two error markers for that. The three error markers plus two valid, decoded bytes can push us over the available space in our wchar buffer.	2022-05-28 21:53:38 +02:00
Alex Dowad	e4b9aa1870	Add assertions to help catch buffer overflows in mbstring text conversion code	2022-05-28 21:53:38 +02:00
Alex Dowad	b2f963f91c	For JIS/ISO-2022-JP, treat a truncated escape sequence as error	2022-05-28 21:53:37 +02:00
Alex Dowad	4afa72126e	Implement fast text conversion interface for QPrint	2022-05-28 21:53:37 +02:00
Alex Dowad	3fda9f5095	Implement fast text conversion interface for HTML-ENTITIES	2022-05-28 21:53:37 +02:00
Alex Dowad	dc1ba61d09	Simplify code for converting UTF-8 An overly complex boolean test was used to check if a 3-byte code unit was valid. Convert it to an equivalent test with fewer terms.	2022-05-28 21:53:37 +02:00
Alex Dowad	85690ae26d	Implement fast text conversion interface for Base64	2022-05-28 21:53:37 +02:00
Alex Dowad	7c2587b1f6	Implement fast text conversion interface for UUENCODE	2022-05-28 21:53:37 +02:00
Alex Dowad	06a15e6395	Implement fast text conversion interface for '8bit'	2022-05-28 21:53:37 +02:00
Alex Dowad	073a88f34c	Implement fast text conversion interface for ISO-2022-JP-KDDI One bug in the previous implementation; when it saw a sequence of codepoints which looked like they might need to be emitted as a special KDDI emoji, it would totally forget whether it was in ASCII mode, JISX 0208 mode, or something else. So it could not reliably emit the correct escape sequence to switch to the right mode. Further, if the input ends with a codepoint which looks like it could be part of a special KDDI emoji, then the legacy code did not emit an escape sequence to switch back to ASCII mode at the end of the string. This means that the emitted ISO-2022-JP-KDDI strings could not always be safely concatenated.	2022-05-28 21:53:36 +02:00
Alex Dowad	c96ad91014	Implement fast text conversion interface for ISO-2022-JP-MS	2022-05-28 21:53:36 +02:00
Alex Dowad	a08f062cad	Implement fast text conversion interface for mobile variants of UTF-8	2022-05-28 21:53:36 +02:00
Alex Dowad	321dbd0413	Implement fast text conversion interface for ISO-2022-JP-2004 There were bugs in the legacy implementation. Lots of them. It did not properly track whether it has switched to JISX 0213 plane 1 or plane 2. If it processes a character in plane 1 and then immediately one in plane 2, it failed to emit the escape code to switch to plane 2. Further, when converting codepoints from 0x80-0xFF to ISO-2022-JP-2004, the legacy implementation would totally disregard which mode it was operating in. Such codepoints would pass through directly to the output without any escape sequences being emitted. If that was not enough, all the legacy implementations of JISX 0213:2004 encodings had another common bug; their 'flush function' did not call the next flush function in the chain of conversion filters. So if any of these encodings were converted to an encoding where the flush function was needed to finish the output string, then the output would be truncated.	2022-05-28 21:53:36 +02:00
Alex Dowad	29e21c0e6f	Implement fast text conversion interface for SJIS-2004 All the legacy implementations of JISX 0213:2004 encodings had a common bug; their 'flush function' did not call the next flush function in the chain of conversion filters. So if any of these encodings were converted to an encoding where the flush function was needed to finish the output string, then the output would be truncated.	2022-05-28 21:53:36 +02:00
Alex Dowad	e5fdd5cef2	Implement fast text conversion interface for EUC-JP-2004 All the legacy implementations of JISX 0213:2004 encodings had a common bug; their 'flush function' did not call the next flush function in the chain of conversion filters. So if any of these encodings were converted to an encoding where the flush function was needed to finish the output string, then the output would be truncated.	2022-05-28 21:53:36 +02:00
Alex Dowad	67d83f57c1	Implement fast text conversion interface for mobile SJIS variants	2022-05-28 21:53:36 +02:00
Alex Dowad	0d635d93f5	Implement fast text conversion interface for UTF7-IMAP The old code would convert a 0x00 byte in the input to 0x00 in the output, but this clearly violates the RFC which defines UTF7-IMAP.	2022-05-28 21:53:35 +02:00
Alex Dowad	6cf30356e0	Implement fast text conversion interface for SJIS-mac	2022-05-28 21:53:35 +02:00
Alex Dowad	c9479899c6	Implement fast text conversion interface for ISO-2022-KR When working on this, I read RFC 1557 again and realized that the comment at the top of the file was totally mistaken. Further, the legacy code did not obey the RFC. (It would emit the "ESC $ ) C" sequence anywhere, not just at the beginning of a line as the RFC requires.) The new code obeys the RFC; one quirk is that it always emits the escape sequence at the beginning of each output string, even if the string is completely ASCII (in which case the escape sequence is allowed, but not required). The new code doesn't always generate the same number of error markers for invalid escapes as the old code did. The old code could not emit the special KDDI emoji for national flags. Further, there was a bug in the test which the old code used to determine whether an 0xF byte should be emitted at the end of a string (to switch back to ASCII mode). As a result, it would not always switch back to ASCII mode, meaning that it was not always safe to concatenate the resulting strings.	2022-05-28 21:53:35 +02:00
Alex Dowad	763284a531	Implement fast text conversion interface for '7bit'	2022-05-28 21:53:35 +02:00
Alex Dowad	3f12d26e3a	Merge branch 'PHP-8.1' * PHP-8.1: Error handling for UTF-8 complies with WHATWG specification	2022-04-16 20:32:12 +02:00
Alex Dowad	04e59c916f	Error handling for UTF-8 complies with WHATWG specification In `7502c86342`, I adjusted the number of error markers emitted on invalid UTF-8 text to be more consistent with mbstring's behavior on other text encodings (generally, it emits one error marker for one unexpected byte). I didn't expect that anybody would actually care one way or the other, but felt that it was better to be consistent than not. Later, Martin Auswöger kindly pointed out that the WHATWG encoding specification, which governs how various text encodings are handled by web browsers, does actually specify how many error markers should be generated for any given piece of invalid UTF-8 text. Until now, we have never really paid much attention to the WHATWG specification, but we do want to comply with as many relevant specifications as possible. And since PHP is commonly used for web applications, compatibility with the behavior of web browsers is obviously a good thing.	2022-04-16 15:04:38 +02:00
Christoph M. Becker	20c0eb47df	Merge branch 'PHP-8.1' * PHP-8.1: Fix GH-8208: mb_encode_mimeheader: $indent functionality broken	2022-03-17 17:35:06 +01:00
Christoph M. Becker	5003831260	Merge branch 'PHP-8.0' into PHP-8.1 * PHP-8.0: Fix GH-8208: mb_encode_mimeheader: $indent functionality broken	2022-03-17 17:34:31 +01:00
Christoph M. Becker	d0417ebc93	Fix GH-8208: mb_encode_mimeheader: $indent functionality broken We also need to factor in the indent, when getting the encoder result. Closes GH-8213.	2022-03-17 17:31:58 +01:00
Alex Dowad	ff76694f28	Merge branch 'PHP-8.1' * PHP-8.1: mb_check_encoding($str, '7bit') rejects strings with bytes over 0x7F	2022-02-22 23:58:57 +02:00
Alex Dowad	8a8533d263	mb_check_encoding($str, '7bit') rejects strings with bytes over 0x7F This was the old behavior of mb_check_encoding() before `3e7acf901d`, but yours truly broke it. If only we had more thorough tests at that time, this might not have slipped through the cracks. Thanks to divinity76 for the report.	2022-02-22 23:56:56 +02:00
Dmitry Stogov	e1782c08bf	Fix ASAN undefined behavior (unsigned char << 24) ext/mbstring/libmbfl/filters/mbfilter_utf32.c:259:20: runtime error: left shift of 128 by 24 places cannot be represented in type 'int'	2022-01-11 09:13:22 +03:00
Alex Dowad	53ffba967c	Implement fast text conversion interface for CP5022{0,1,2}	2021-12-26 22:19:51 +02:00
Alex Dowad	01afd9f141	Implement fast text conversion interface for JIS	2021-12-26 22:19:51 +02:00
Alex Dowad	cb4626c5b2	Implement fast text conversion interface for GB18030	2021-12-26 22:19:51 +02:00
Alex Dowad	3e8088dc80	Implement fast text conversion interface for EUC-JP-MS	2021-12-26 22:19:51 +02:00
Alex Dowad	e5af94b74f	Implement fast text conversion interface for CP51932	2021-12-26 22:19:51 +02:00
Alex Dowad	6ef1b35223	Implement fast text conversion interface for EUC-CN	2021-12-26 22:19:51 +02:00
Alex Dowad	9bd08a97d9	Implement fast text conversion interface for EUC-TW	2021-12-26 22:19:51 +02:00
Alex Dowad	661a10160b	Implement fast text conversion interface for CP936	2021-12-26 22:19:51 +02:00
Alex Dowad	20555371d5	Implement fast text conversion interface for CP932	2021-12-26 22:19:51 +02:00
Alex Dowad	43bb97c539	Implement fast text conversion interface for EUC-KR	2021-12-26 22:19:51 +02:00
Alex Dowad	c0936d48b0	Implement fast text conversion interface for UHC	2021-12-26 22:19:51 +02:00
Alex Dowad	40809cb19f	Implement fast text conversion interface for HZ	2021-12-26 22:19:51 +02:00

1 2 3 4 5 ...

513 Commits