archived-php-src

mirror of https://github.com/php/php-src.git synced 2026-03-26 01:02:25 +01:00

Author	SHA1	Message	Date
Christoph M. Becker	c2bdaa48e1	Fix GH-9008: mb_detect_encoding(): wrong results with null $encodings Passing `null` to `$encodings` is supposed to behave like passing the result of `mb_detect_order()`. Therefore, we need to remove the non- encodings from the `elist` in this case as well. Thus, we duplicate the global `elist`, so we can modify it. Closes GH-9063.	2022-07-20 16:58:55 +02:00
Alex Dowad	2dc9026cbc	Restore backwards-compatible mappings of 0x5C and 0x7E in SJIS According to the relevant Japan Industrial Standards Committee standards, SJIS 0x5C is a Yen sign, and 0x7E is an overline. However, this conflicts with the implementation of SJIS in various legacy software (notably Microsoft products), where SJIS 0x5C and 0x7E are taken as equivalent to the same ASCII bytes. Prior to PHP 8.1, mbstring's implementation of SJIS handled these bytes compatibly with Microsoft products. This was changed in PHP 8.1.0, in an attempt to comply with the JISC specifications. However, after discussion with various concerned Japanese developers, it seems that the historical behavior was more useful in the majority of applications which process SJIS-encoded text. Since we are now treating SJIS 0x5C as equivalent to U+005C and 0x7E as equivalent to U+007E, it does not make sense to convert U+203E (OVERLINE) to 0x7E, nor does it make sense to convert U+00A5 (YEN SIGN) to 0x5C. Restore the mappings for those codepoints from before PHP 8.1.0.	2022-06-10 21:04:36 +02:00
Alex Dowad	58d0aad75c	mb_detect_encoding recognizes all letters in Hungarian alphabet	2022-05-25 08:22:07 +02:00
Alex Dowad	6a4b6d2344	mb_detect_encoding recognizes all letters in Czech alphabet	2022-05-25 07:52:39 +02:00
Alex Dowad	9bb97ee8bc	Fix mb_detect_encoding's recognition of Slavic names Thanks to Côme Chilliet for reporting that mb_detect_encoding was not detecting the desired text encoding for strings containing š or Ž. These characters are used in Czech, Serbian, Croatian, Bosnian, Macedonian, etc. names.	2022-05-24 15:32:20 +02:00
Alex Dowad	04e59c916f	Error handling for UTF-8 complies with WHATWG specification In `7502c86342`, I adjusted the number of error markers emitted on invalid UTF-8 text to be more consistent with mbstring's behavior on other text encodings (generally, it emits one error marker for one unexpected byte). I didn't expect that anybody would actually care one way or the other, but felt that it was better to be consistent than not. Later, Martin Auswöger kindly pointed out that the WHATWG encoding specification, which governs how various text encodings are handled by web browsers, does actually specify how many error markers should be generated for any given piece of invalid UTF-8 text. Until now, we have never really paid much attention to the WHATWG specification, but we do want to comply with as many relevant specifications as possible. And since PHP is commonly used for web applications, compatibility with the behavior of web browsers is obviously a good thing.	2022-04-16 15:04:38 +02:00
Christoph M. Becker	5003831260	Merge branch 'PHP-8.0' into PHP-8.1 * PHP-8.0: Fix GH-8208: mb_encode_mimeheader: $indent functionality broken	2022-03-17 17:34:31 +01:00
Christoph M. Becker	d0417ebc93	Fix GH-8208: mb_encode_mimeheader: $indent functionality broken We also need to factor in the indent, when getting the encoder result. Closes GH-8213.	2022-03-17 17:31:58 +01:00
Alex Dowad	8a8533d263	mb_check_encoding($str, '7bit') rejects strings with bytes over 0x7F This was the old behavior of mb_check_encoding() before `3e7acf901d`, but yours truly broke it. If only we had more thorough tests at that time, this might not have slipped through the cracks. Thanks to divinity76 for the report.	2022-02-22 23:56:56 +02:00
Christoph M. Becker	69f6b09b2a	Merge branch 'PHP-8.0' into PHP-8.1 * PHP-8.0: Fix GH-7902: mb_send_mail may delimit headers with LF only	2022-01-18 13:09:52 +01:00
Christoph M. Becker	03816fba46	Fix GH-7902: mb_send_mail may delimit headers with LF only Email headers are supposed to be separated with CRLF. Period. We introduce a `CRLF` macro for better comprehensibility right away. Closes GH-7907.	2022-01-18 13:08:08 +01:00
Alex Dowad	f07c193583	mb_convert_encoding will not auto-detect input string as UUEncode, Base64, QPrint In `a2bc57e0e5`, mb_detect_encoding was modified to ensure it would never return 'UUENCODE', 'QPrint', or other non-encodings as the "detected text encoding". Before mb_detect_encoding was enhanced so that it could detect any supported text encoding, those were never returned, and they are not desired. Actually, we want to eventually remove them completely from mbstring, since PHP already contains other implementations of UUEncode, QPrint, Base64, and HTML entities. For more clarity on why we need to suppress UUEncode, etc. from being detected by mb_detect_encoding, the existing UUEncode implementation in mbstring never treats any input as erroneous. It just accepts everything. This means that it would always be treated as a valid choice by mb_detect_encoding, and would be returned in many, many cases where the input is obviously not UUEncoded. It turns out that the form of mb_convert_encoding where the user passes multiple candidate encodings (and mbstring auto-detects which one to use) was also affected by the same issue. Apply the same fix.	2021-12-20 22:09:33 +02:00
Christoph M. Becker	929d847152	Fix #81693 : mb_check_encoding(7bit) segfaults `php_mb_check_encoding()` now uses conversion to `mbfl_encoding_wchar`. Since `mbfl_encoding_7bit` has no `input_filter`, no filter can be found. Since we don't actually need to convert to wchar, we encode to 8bit. Closes GH-7712.	2021-12-03 22:49:47 +01:00
Alex Dowad	1a2c608053	Add unit tests for mb_detect_encoding on Polish text	2021-11-26 17:42:53 +02:00
Alex Dowad	a2bc57e0e5	mb_detect_encoding will not return non-encodings Among the text encodings supported by mbstring are several which are not really 'text encodings'. These include Base64, QPrint, UUencode, HTML entities, '7 bit', and '8 bit'. Rather than providing an explicit list of text encodings which they are interested in, users may pass the output of mb_list_encodings to mb_detect_encoding. Since Base64, QPrint, and so on are included in the output of mb_list_encodings, mb_detect_encoding can return one of these as its 'detected encoding' (and in fact, this often happens). Before mb_detect_encoding was enhanced so it could detect any of the supported text encodings, this did not happen, and it is never desired.	2021-10-19 18:05:52 +02:00
Alex Dowad	28b346bc06	Improve detection accuracy of mb_detect_encoding Originally, `mb_detect_encoding` essentially just checked all candidate encodings to see which ones the input string was valid in. However, it was only able to do this for a limited few of all the text encodings which are officially supported by mbstring. In `3e7acf901d`, I modified it so it could 'detect' any text encoding supported by mbstring. While this is arguably an improvement, if the only text encodings one is interested in are those which `mb_detect_encoding` could originally handle, the old `mb_detect_encoding` may have been preferable. Because the new one has more possible encodings which it can guess, it also has more chances to get the answer wrong. This commit adjusts the detection heuristics to provide accurate detection in a wider variety of scenarios. While the previous detection code would frequently confuse UTF-32BE with UTF-32LE or UTF-16BE with UTF-16LE, the adjusted code is extremely accurate in those cases. Detection for Chinese text in Chinese encodings like GB18030 or BIG5 and for Japanese text in Japanese encodings like EUC-JP or SJIS is greatly improved. Detection of UTF-7 is also greatly improved. An 8KB table, with one bit for each codepoint from U+0000 up to U+FFFF, is used to achieve this. One significant constraint is that the heuristics are completely based on looking at each codepoint in a string in isolation, treating some codepoints as 'likely' and others as 'unlikely'. It might still be possible to achieve great gains in detection accuracy by looking at sequences of codepoints rather than individual codepoints. However, this might require huge tables. Further, we might need a huge corpus of text in various languages to derive those tables. Accuracy is still dismal when trying to distinguish single-byte encodings like ISO-8859-1, ISO-8859-2, KOI8-R, and so on. This is because the valid bytes in these encodings are basically all the same, and all valid bytes decode to 'likely' codepoints, so our method of detection (which is based on rating codepoints as likely or unlikely) cannot tell any difference between the candidates at all. It just selects the first encoding in the provided list of candidates. Speaking of which, if one wants to get good results from `mb_detect_encoding`, it is important to order the list of candidate encodings according to your prior belief of which are more likely to be correct. When the function cannot tell any difference between two candidates, it returns whichever appeared earlier in the array.	2021-10-19 18:05:51 +02:00
Alex Dowad	c25a1ef8d0	Bug #81390 : mb_detect_encoding should not prematurely stop processing input As a performance optimization, mb_detect_encoding tries to stop processing the input string early when there is only one 'candidate' encoding which the input string is valid in. However, the code which keeps count of how many candidate encodings have already been rejected was buggy. This caused mb_detect_encoding to prematurely stop processing the input when it should have continued. As a result, it did not notice that in the test case provided by Alec, the input string was not valid in UTF-16.	2021-09-20 11:21:39 +02:00
Alex Dowad	df32267494	Add more tests for UTF7-IMAP text conversion	2021-08-31 13:41:34 +02:00
Alex Dowad	16a1e0a219	In UTF7-IMAP, reject the 2nd part of surrogate pair if it appears unexpectedly	2021-08-31 13:41:34 +02:00
Alex Dowad	355464935d	Add another test for UTF-7 text conversion	2021-08-31 13:41:34 +02:00
Alex Dowad	51b6c687db	Add another test for GB18030 text conversion	2021-08-31 13:41:34 +02:00
Alex Dowad	a0415b22ab	Add more tests for CP5022{0,1,2} text conversion	2021-08-31 13:41:34 +02:00
Alex Dowad	e3f6a9fbfe	CP5022{0,1,2} supports 'IBM extension' codes from ku 115-119 mbstring has always had the conversion tables to support CP932 codes in ku 115-119, and the conversion code for CP5022x has an 'if' clause specifically to handle such characters... but that 'if' clause was dead code, since a guard clause earlier in the same function prevented it from accepting 2-byte characters with a starting byte of 0x93-0x97. Adjust the guard clause so that these characters can be converted as the original author apparently intended. The code which handles ku 115-119 is the part which reads: } else if (s >= cp932ext3_ucs_table_min && s < cp932ext3_ucs_table_max) { w = cp932ext3_ucs_table[s - cp932ext3_ucs_table_min];	2021-08-31 13:41:34 +02:00
Alex Dowad	671dcee01e	Add test for mb_str_split on UCS-2 text	2021-08-31 13:41:34 +02:00
Alex Dowad	776296e12f	mbstring no longer provides 'long' substitutions for erroneous input bytes Previously, mbstring had a special mode whereby it would convert erroneous input byte sequences to output like "BAD+XXXX", where "XXXX" would be the erroneous bytes expressed in hexadecimal. This mode could be enabled by calling `mb_substitute_character("long")`. However, accurately reproducing input byte sequences from the cached state of a conversion filter is often tricky, and this significantly complicates the implementation. Further, the means used for passing the erroneous bytes through to where the "BAD+XXXX" text is generated only allows for up to 3 bytes to be passed, meaning that some erroneous byte sequences are truncated anyways. More to the point, a search of publically available PHP code indicates that nobody is really using this feature anyways. Incidentally, this feature also provided error output like "JIS+XXXX" if the input 'should have' represented a JISX 0208 codepoint, but it decodes to a codepoint which does not exist in the JISX 0208 charset. Similarly, specific error output was provided for non-existent JISX 0212 codepoints, and likewise for JISX 0213, CP932, and a few other charsets. All of that is now consigned to the flames. However, "long" error markers also include a somewhat more useful "U+XXXX" marker for Unicode codepoints which were successfully decoded from the input text, but cannot be represented in the output encoding. Those are still supported. With this change, there is no need to use a variety of special values in the high bits of a wchar to represent different types of error values. We can (and will) just use a single error value. This will be equal to -1. One complicating factor: Text conversion functions return an integer to indicate whether the conversion operation should be immediately aborted, and the magic 'abort' marker is -1. Also, almost all of these functions would return the received byte/codepoint to indicate success. That doesn't work with the new error value; if an input filter detects an error and passes -1 to the output filter, and the output filter returns it back, that would be taken to mean 'abort'. Therefore, amend all these functions to return 0 for success.	2021-08-31 13:41:34 +02:00
Alex Dowad	15ba73cee3	Add more tests for UTF-8 text conversion	2021-08-30 16:29:58 +02:00
Alex Dowad	51a32ccaf4	Add another test for UTF-16LE	2021-08-30 16:29:58 +02:00
Alex Dowad	7472c82c45	Add tests for UCS-4 text conversion	2021-08-30 16:29:58 +02:00
Alex Dowad	79015b23aa	Add tests for UCS-2 text encoding	2021-08-30 16:29:58 +02:00
Alex Dowad	34ef8f3ca2	Add tests for '7bit' and '8bit' text encodings in mbstring	2021-08-30 16:29:58 +02:00
Alex Dowad	e6f1a72235	Add test suite for mobile variants of UTF-8 (and fix bugs)	2021-08-30 16:29:58 +02:00
Alex Dowad	1865576694	Add test suite for EUC-JP-WIN (or EUC-JP-MS) text encoding (and fix bugs)	2021-08-30 16:29:58 +02:00
Alex Dowad	0de4d6872e	Add more tests for SJIS-2004 text conversion	2021-08-30 16:29:58 +02:00
Alex Dowad	c7d47cbb4c	Add more tests for SJIS text conversion	2021-08-30 16:29:58 +02:00
Alex Dowad	299690a1cf	Add more tests for ISO-2022-JP/JIS7/JIS8 text conversion	2021-08-30 16:29:58 +02:00
Alex Dowad	b2be85d11a	Add more tests for ISO-2022-JP-MS text conversion	2021-08-30 16:29:58 +02:00
Alex Dowad	ae4c956089	Add more tests for ISO-2022-JP-KDDI text conversion	2021-08-30 16:29:58 +02:00
Alex Dowad	51e0d323e4	ISO-2022-JP-MS treats truncated multi-byte chars as error Sigh. I included tests which were intended to check this case in the test suite for ISO-2022-JP-MS, but those tests were faulty and didn't actually test what they were supposed to. Fixing the tests revealed that there were still bugs in this area.	2021-08-30 16:29:58 +02:00
Alex Dowad	57a81af041	ISO-2022-JP-KDDI text conversion doesn't swallow PUA codepoints There was a bit of legacy code here which looks like the original author of mbstring intended to allow conversion of Unicode Private Use Area codepoints to ISO-2022-JP-KDDI. However, that code never worked. It set the output variable to values which were not matched by any of the 'if' clauses below, which meant that nothing was actually emitted to the output. In other words, if one tried to convert Unicode to ISO-2022-JP-KDDI, and the Unicode string contained PUA codepoints, they would be quietly 'swallowed' and disappear. I don't know what ISO-2022-JP-KDDI byte sequences the author wanted to map those PUA codepoints to, and anyways, this use case is so obscure that there is little point in worrying about it. However, it is better to remove the non-functioning code than to leave it in. This means that if now one tries to convert PUA codepoints to ISO-2022-JP-KDDI, those codepoints will be treated as erroneous rather than silently ignored.	2021-08-30 16:29:58 +02:00
Alex Dowad	51b9d7a5e1	Test behavior of 'long' illegal character markers After mb_substitute_character("long"), mbstring will respond to erroneous input by inserting 'long' error markers into the output. Depending on the situation, these error markers will either look like BAD+XXXX (for general bad input), U+XXXX (when the input is OK, but it converts to Unicode codepoints which cannot be represented in the output encoding), or an encoding-specific marker like JISX+XXXX or W932+XXXX. We have almost no tests for this feature. Add a bunch of tests to ensure that all our legacy encoding handlers work in a reasonable way when 'long' error markers are enabled.	2021-08-30 16:29:58 +02:00
Nikita Popov	43cb2548f7	Flush filter during non-strict encoding detection If we reach the end of the string without reducing to a single encoding, then we should flush to check whether the last character is incomplete.	2021-08-27 14:48:32 +02:00
Nikita Popov	14173186db	Add EXTENSIONS section	2021-08-11 14:03:18 +02:00
Nikita Popov	28500fe4ef	Fixed bug #81349 The ascii to wchar was reporting errors using conv_illegal_output, while it should have been using WCSGROUP_THROUGH. Effectively that replaced illegal characters with '?' for the purpose of identification.	2021-08-11 11:37:02 +02:00
Nikita Popov	89aa42c74b	Add missing EXTENSIONS section	2021-07-28 12:38:41 +02:00
Nikita Popov	a1c1ee6a48	Don't use opaque for encoding detection score opaque is used by the htmlentities filter, which means that we end up trying to free the score value as a pointer. Don't try to be overly tricky here and simply allocate a separate structure to hold the number of illegal characters and the score.	2021-07-28 10:54:27 +02:00
Nikita Popov	9d0db2e98a	Fixed bug #81298 Creation of the filter may fail for some special encodings, for which detection is not supported.	2021-07-28 10:11:46 +02:00
Alex Dowad	13136a575d	Fix conversion of GB18030 text (and add test suite) - Truncated multi-byte characters are treated as an error - Reject GB18030 4-byte codes which translate to (non-existent) Unicode codepoints above 0x10FFFF - Add a number of missing mappings from the GB18030 standards (These mappings are supported by iconv. I don't know why they were missing from mbstring.)	2021-07-19 12:17:00 +02:00
Alex Dowad	73c6a5b89d	Fix conversion of Big5 and CP950 text (and add test suite) - Truncated multi-byte characters are treated as an error - Follow recommended mappings from Unicode consortium	2021-07-19 12:17:00 +02:00
Nikita Popov	639015845f	Deprecate calling mb_check_encoding() without argument Part of https://wiki.php.net/rfc/deprecations_php_8_1.	2021-07-08 15:34:49 +02:00
Alex Dowad	b626e893ff	Fix conversion of ISO-2022-KR text (and add test suite) - Truncated multi-byte characters are treated as an error - Truncated or unrecognized escape sequences are treated as an error - ASCII control characters are not allowed to appear in the middle of a multi-byte character	2021-07-05 16:28:16 +02:00

1 2 3 4 5 ...

718 Commits