archived-php-src

mirror of https://github.com/php/php-src.git synced 2026-04-29 03:03:26 +02:00

Author	SHA1	Message	Date
Eric Norris	09237f6126	Update request startup error messages	2022-07-18 23:19:59 +01:00
Alex Dowad	76a92c26e3	mb_decode_numericentity decodes valid entities which are truncated at end of string Since mb_decode_numericentity does not require all HTML entities to end with ';', but allows them to be terminated by ANY non-digit character, it doesn't make sense that valid entities which butt up against the end of the input string are not converted. As it turned out, supporting this case also made it possible to simplify the code nicely.	2022-07-18 15:11:47 +02:00
Alex Dowad	5d6bd557b3	mb_decode_numericentity converts entities which immediately follow a valid/invalid entity Thanks to Kamil Tieleka for suggesting that some of the behaviors of the legacy implementation which the new mb_decode_numericentity implementation took care to maintain were actually bugs and should be fixed. Thanks also to Trevor Rowbotham for providing a link to the HTML specification, showing how HTML numeric entities should be interpreted. mb_decode_numericentity now processes numeric entities in the following situations where the old implementation would not: - &<ENTITY> (for example, &A) - &#<ENTITY> - &#x<ENTITY> - <VALID BUT UNTERMINATED DECIMAL ENTITY><ENTITY> (for example, &#65A) - <VALID BUT UNTERMINATED HEX ENTITY><ENTITY> - <INVALID AND UNTERMINATED DECIMAL ENTITY><ENTITY> (it does not matter why the first entity is invalid; the value could be too big, it could have too many digits, or it could not match the 'convmap' parameter) - <INVALID AND UNTERMINATED HEX ENTITY><ENTITY> This is consistent with the way that web browsers process HTML entities.	2022-07-18 15:11:32 +02:00
Alex Dowad	40f5048aa7	Fix new conversion filter for UUEncode This code (written by yours truly) was very broken on input strings long enough to require processing in multiple chunks. Fuzzing revealed this very quickly; after initial rework, further fuzzing also found a couple of very obscure bugs in corner cases.	2022-07-18 15:11:32 +02:00
Alex Dowad	5fee30b630	Fix new conversion filter for QPrint (same order of check as legacy code) Because of checking for maximum line length before certain other checks, the new conversion filter for QPrint could produce different results from the old one in some cases. This was discovered while fuzzing the new implementation of mb_decode_numericentity.	2022-07-18 15:11:32 +02:00
Alex Dowad	3cf432798e	Fix new conversion filter for CP50220 (multi-codepoint kana at end of buffer) If two codepoints which needed to be collapsed into a single kuten code were separated, with one at the end of one buffer and the other at the beginning of the next buffer, they were not converted correctly. This was discovered while fuzzing the new implementation of mb_decode_numericentity.	2022-07-18 15:11:31 +02:00
Alex Dowad	7559bf77d2	Fix new conversion filters for mobile SJIS variants ('0' at end of buffer) Previously, I had adjusted this code so that if a character which could be part of a special Docomo/Softbank/KDDI 'keypad' emoji appeared at the end of one buffer, and the 'keypad' character appeared at the beginning of the next, they would still be combined. However, this broke the handling of such a character appearing at the end of one buffer, and a character which is NOT 'keypad' appearing at the beginning of the next. This was found while fuzzing the new implementation of mb_decode_numericentity.	2022-07-18 15:11:31 +02:00
Alex Dowad	fa83a8e15e	Fix new conversion filter for HTML entities While fuzzing the new mb_decode_numericentity implementation, I discovered that the fast conversion filter for 'HTML-ENTITIES' did not correctly handle an empty named entity ('&;'), nor did it correctly handle invalid named entities whose names were a prefix of a valid entity. Also, it did not correctly handle the case where a named entity is truncated and another named entity starts abruptly.	2022-07-18 15:11:31 +02:00
Alex Dowad	9c3972fb3d	Fix legacy conversion filter for HZ	2022-07-18 15:11:31 +02:00
Alex Dowad	1526bab6d0	Fix legacy conversion filter for GB18030	2022-07-18 15:11:31 +02:00
Alex Dowad	6938e35122	Fix legacy conversion filter for CP50220	2022-07-18 15:11:31 +02:00
Alex Dowad	91969e908f	New implementation of mb_{de,en}code_numericentity This new implementation uses the new encoding conversion filters. Aside from fewer LOC and (hopefully) improved readability, the differences are as follows: BEHAVIOR CHANGES: - The old implementation used signed arithmetic when operating on the 'convmap'. This meant that results could be surprising when using convmap entries with 1 in the MSB. Further, types like 'int' were used rather than those with a specific bit width, such as 'int32_t'. This meant that results could also depend on the platform width of an 'int'. Now unsigned arithmetic is used, with explicit bit widths. - Similarly, while converting decimal numeric entities, the legacy implementation would ensure that the value never overflowed INT_MAX, and if it did, the entity would be treated as invalid and passed through unconverted. However, that again means that results depend on the platform size of an 'int'. So now, we use a value with explicit bit width (32 bits) to hold the value of a deconverted decimal entity, and ensure that the entity value does not overflow that. Further, because we are using an UNSIGNED 32-bit value rather than a signed one, the ceiling for how large a decimal entity can be is higher now. All of this will probably not affect anyone, since Unicode codepoints above U+10FFFF are invalid anyways. To see the difference, you need to be using a text encoding like UCS-4, which allows huge 'codepoints'. - If it saw something which looked like a hex entity, but turned out not to be a valid numeric entity, the old implementation would sometimes convert the hexadecimal digits a-f to A-F (uppercase). The new implementation passes invalid numeric entities through without performing case conversion. - The old implementation of mb_encode_numericentity was limited in how many decimal/hex digits it could emit. If a text encoding like UCS-4 was in use, where 'codepoints' can have huge values (larger than the valid range stipulated by the Unicode standard), it would not error out on a 'codepoint' whose value was too large for it, but would rather mangle the value and emit a numeric entity which decoded to some other random codepoint. The new implementation is able to emit enough digits to express any value which fits in 32 bits. PERFORMANCE: Based on micro-benchmarks run on my development machine: Decoding numeric HTML entities is about 4 times faster, for both decimal and hexadecimal entities, across a variety of input string lengths. Encoding is about 3 times faster.	2022-07-18 15:11:30 +02:00
jcm	dbdef4a55c	QA -mb_convert_encoding_array - error for object item in array Closes GH-9023.	2022-07-15 17:34:35 +02:00
jcm	30d89b19cf	QA - mb_http_input - function returns FALSE for type 'L' or 'l' Closes GH-9018.	2022-07-15 14:22:39 +02:00
Christoph M. Becker	85a95a2982	Merge branch 'PHP-8.1' * PHP-8.1: Restore backwards-compatible mappings of 0x5C and 0x7E in SJIS	2022-06-11 16:32:33 +02:00
Alex Dowad	d62f535caa	Restore backwards-compatible mappings of 0x5C and 0x7E in SJIS According to the relevant Japan Industrial Standards Committee standards, SJIS 0x5C is a Yen sign, and 0x7E is an overline. However, this conflicts with the implementation of SJIS in various legacy software (notably Microsoft products), where SJIS 0x5C and 0x7E are taken as equivalent to the same ASCII bytes. Prior to PHP 8.1, mbstring's implementation of SJIS handled these bytes compatibly with Microsoft products. This was changed in PHP 8.1.0, in an attempt to comply with the JISC specifications. However, after discussion with various concerned Japanese developers, it seems that the historical behavior was more useful in the majority of applications which process SJIS-encoded text. Since we are now treating SJIS 0x5C as equivalent to U+005C and 0x7E as equivalent to U+007E, it does not make sense to convert U+203E (OVERLINE) to 0x7E, nor does it make sense to convert U+00A5 (YEN SIGN) to 0x5C. Restore the mappings for those codepoints from before PHP 8.1.0. Fixes GH-8281.	2022-06-11 16:31:47 +02:00
Alex Dowad	a789088527	Add more tests for mbstring encoding conversion When testing the preceding commits, I used a script to generate a large number of random strings and try to find strings which would yield different outputs from the new and old encoding conversion code. Some were found. In most cases, analysis revealed that the new code was correct and the old code was not. In all cases where the new code was incorrect, regression tests were added. However, there may be some value in adding regression tests for cases where the old code was incorrect as well. That is done here. This does not cover every case where the new and old code yielded different results. Some of them were very obscure, and it is proving difficult even to reproduce them (since I did not keep a record of all the input strings which triggered the differing output).	2022-05-28 21:53:38 +02:00
Alex Dowad	4afa72126e	Implement fast text conversion interface for QPrint	2022-05-28 21:53:37 +02:00
Alex Dowad	3fda9f5095	Implement fast text conversion interface for HTML-ENTITIES	2022-05-28 21:53:37 +02:00
Alex Dowad	85690ae26d	Implement fast text conversion interface for Base64	2022-05-28 21:53:37 +02:00
Alex Dowad	7c2587b1f6	Implement fast text conversion interface for UUENCODE	2022-05-28 21:53:37 +02:00
Alex Dowad	321dbd0413	Implement fast text conversion interface for ISO-2022-JP-2004 There were bugs in the legacy implementation. Lots of them. It did not properly track whether it has switched to JISX 0213 plane 1 or plane 2. If it processes a character in plane 1 and then immediately one in plane 2, it failed to emit the escape code to switch to plane 2. Further, when converting codepoints from 0x80-0xFF to ISO-2022-JP-2004, the legacy implementation would totally disregard which mode it was operating in. Such codepoints would pass through directly to the output without any escape sequences being emitted. If that was not enough, all the legacy implementations of JISX 0213:2004 encodings had another common bug; their 'flush function' did not call the next flush function in the chain of conversion filters. So if any of these encodings were converted to an encoding where the flush function was needed to finish the output string, then the output would be truncated.	2022-05-28 21:53:36 +02:00
Alex Dowad	67d83f57c1	Implement fast text conversion interface for mobile SJIS variants	2022-05-28 21:53:36 +02:00
Alex Dowad	0d635d93f5	Implement fast text conversion interface for UTF7-IMAP The old code would convert a 0x00 byte in the input to 0x00 in the output, but this clearly violates the RFC which defines UTF7-IMAP.	2022-05-28 21:53:35 +02:00
Alex Dowad	6cf30356e0	Implement fast text conversion interface for SJIS-mac	2022-05-28 21:53:35 +02:00
Alex Dowad	c9479899c6	Implement fast text conversion interface for ISO-2022-KR When working on this, I read RFC 1557 again and realized that the comment at the top of the file was totally mistaken. Further, the legacy code did not obey the RFC. (It would emit the "ESC $ ) C" sequence anywhere, not just at the beginning of a line as the RFC requires.) The new code obeys the RFC; one quirk is that it always emits the escape sequence at the beginning of each output string, even if the string is completely ASCII (in which case the escape sequence is allowed, but not required). The new code doesn't always generate the same number of error markers for invalid escapes as the old code did. The old code could not emit the special KDDI emoji for national flags. Further, there was a bug in the test which the old code used to determine whether an 0xF byte should be emitted at the end of a string (to switch back to ASCII mode). As a result, it would not always switch back to ASCII mode, meaning that it was not always safe to concatenate the resulting strings.	2022-05-28 21:53:35 +02:00
Alex Dowad	8b70a7db4f	Merge branch 'PHP-8.1' * PHP-8.1: mb_detect_encoding recognizes all letters in Hungarian alphabet	2022-05-25 13:10:23 +02:00
Alex Dowad	58d0aad75c	mb_detect_encoding recognizes all letters in Hungarian alphabet	2022-05-25 08:22:07 +02:00
Alex Dowad	212b31b51c	Merge branch 'PHP-8.1' * PHP-8.1: mb_detect_encoding recognizes all letters in Czech alphabet	2022-05-25 07:53:39 +02:00
Alex Dowad	6a4b6d2344	mb_detect_encoding recognizes all letters in Czech alphabet	2022-05-25 07:52:39 +02:00
Alex Dowad	83db088fc2	Merge branch 'PHP-8.1' * PHP-8.1: Fix mb_detect_encoding's recognition of Slavic names	2022-05-24 15:33:24 +02:00
Alex Dowad	9bb97ee8bc	Fix mb_detect_encoding's recognition of Slavic names Thanks to Côme Chilliet for reporting that mb_detect_encoding was not detecting the desired text encoding for strings containing š or Ž. These characters are used in Czech, Serbian, Croatian, Bosnian, Macedonian, etc. names.	2022-05-24 15:32:20 +02:00
Christoph M. Becker	c9787b4785	Fix skip clause The function required is called `mb_ereg_search()`.	2022-05-06 15:41:10 +02:00
Alex Dowad	3f12d26e3a	Merge branch 'PHP-8.1' * PHP-8.1: Error handling for UTF-8 complies with WHATWG specification	2022-04-16 20:32:12 +02:00
Alex Dowad	04e59c916f	Error handling for UTF-8 complies with WHATWG specification In `7502c86342`, I adjusted the number of error markers emitted on invalid UTF-8 text to be more consistent with mbstring's behavior on other text encodings (generally, it emits one error marker for one unexpected byte). I didn't expect that anybody would actually care one way or the other, but felt that it was better to be consistent than not. Later, Martin Auswöger kindly pointed out that the WHATWG encoding specification, which governs how various text encodings are handled by web browsers, does actually specify how many error markers should be generated for any given piece of invalid UTF-8 text. Until now, we have never really paid much attention to the WHATWG specification, but we do want to comply with as many relevant specifications as possible. And since PHP is commonly used for web applications, compatibility with the behavior of web browsers is obviously a good thing.	2022-04-16 15:04:38 +02:00
Christoph M. Becker	20c0eb47df	Merge branch 'PHP-8.1' * PHP-8.1: Fix GH-8208: mb_encode_mimeheader: $indent functionality broken	2022-03-17 17:35:06 +01:00
Christoph M. Becker	5003831260	Merge branch 'PHP-8.0' into PHP-8.1 * PHP-8.0: Fix GH-8208: mb_encode_mimeheader: $indent functionality broken	2022-03-17 17:34:31 +01:00
Christoph M. Becker	d0417ebc93	Fix GH-8208: mb_encode_mimeheader: $indent functionality broken We also need to factor in the indent, when getting the encoder result. Closes GH-8213.	2022-03-17 17:31:58 +01:00
Alex Dowad	ff76694f28	Merge branch 'PHP-8.1' * PHP-8.1: mb_check_encoding($str, '7bit') rejects strings with bytes over 0x7F	2022-02-22 23:58:57 +02:00
Alex Dowad	8a8533d263	mb_check_encoding($str, '7bit') rejects strings with bytes over 0x7F This was the old behavior of mb_check_encoding() before `3e7acf901d`, but yours truly broke it. If only we had more thorough tests at that time, this might not have slipped through the cracks. Thanks to divinity76 for the report.	2022-02-22 23:56:56 +02:00
Christoph M. Becker	58cbee1ce3	Merge branch 'PHP-8.1' * PHP-8.1: Fix GH-7902: mb_send_mail may delimit headers with LF only	2022-01-18 13:11:01 +01:00
Christoph M. Becker	69f6b09b2a	Merge branch 'PHP-8.0' into PHP-8.1 * PHP-8.0: Fix GH-7902: mb_send_mail may delimit headers with LF only	2022-01-18 13:09:52 +01:00
Christoph M. Becker	03816fba46	Fix GH-7902: mb_send_mail may delimit headers with LF only Email headers are supposed to be separated with CRLF. Period. We introduce a `CRLF` macro for better comprehensibility right away. Closes GH-7907.	2022-01-18 13:08:08 +01:00
Christoph M. Becker	51eec5086f	Run mb_send_mail tests on Windows, too We use the run-tests.php `{MAIL}` abstraction instead of `cat`. Closes GH-7908.	2022-01-07 22:46:02 +01:00
Alex Dowad	53ffba967c	Implement fast text conversion interface for CP5022{0,1,2}	2021-12-26 22:19:51 +02:00
Alex Dowad	c0936d48b0	Implement fast text conversion interface for UHC	2021-12-26 22:19:51 +02:00
Alex Dowad	40809cb19f	Implement fast text conversion interface for HZ	2021-12-26 22:19:51 +02:00
Alex Dowad	3c73225125	New internal interface for fast text conversion in mbstring When converting text to/from wchars, mbstring makes one function call for each and every byte or wchar to be converted. Typically, each of these conversion functions contains a state machine, and its state has to be restored and then saved for every single one of these calls. It doesn't take much to see that this is grossly inefficient. Instead of converting one byte or wchar on each call, the new conversion functions will either fill up or drain a whole buffer of wchars on each call. In benchmarks, this is about 3-10× faster. Adding the new, faster conversion functions for all supported legacy text encodings still needs some work. Also, all the code which uses the old-style conversion functions needs to be converted to use the new ones. After that, the old code can be dropped. (The mailparse extension will also have to be fixed up so it will still compile.)	2021-12-21 08:33:11 +02:00
Alex Dowad	edc6b756c1	Merge branch 'PHP-8.1' * PHP-8.1: mb_convert_encoding will not auto-detect input string as UUEncode, Base64, QPrint	2021-12-20 22:47:18 +02:00
Alex Dowad	f07c193583	mb_convert_encoding will not auto-detect input string as UUEncode, Base64, QPrint In `a2bc57e0e5`, mb_detect_encoding was modified to ensure it would never return 'UUENCODE', 'QPrint', or other non-encodings as the "detected text encoding". Before mb_detect_encoding was enhanced so that it could detect any supported text encoding, those were never returned, and they are not desired. Actually, we want to eventually remove them completely from mbstring, since PHP already contains other implementations of UUEncode, QPrint, Base64, and HTML entities. For more clarity on why we need to suppress UUEncode, etc. from being detected by mb_detect_encoding, the existing UUEncode implementation in mbstring never treats any input as erroneous. It just accepts everything. This means that it would always be treated as a valid choice by mb_detect_encoding, and would be returned in many, many cases where the input is obviously not UUEncoded. It turns out that the form of mb_convert_encoding where the user passes multiple candidate encodings (and mbstring auto-detects which one to use) was also affected by the same issue. Apply the same fix.	2021-12-20 22:09:33 +02:00

1 2 3 4 5 ...

777 Commits