archived-php-src

mirror of https://github.com/php/php-src.git synced 2026-04-18 05:21:02 +02:00

Author	SHA1	Message	Date
Alex Dowad	9962aa9774	Merge branch 'PHP-8.1' * PHP-8.1: mb_detect_encoding will not return non-encodings Improve detection accuracy of mb_detect_encoding	2021-10-19 18:11:35 +02:00
Alex Dowad	28b346bc06	Improve detection accuracy of mb_detect_encoding Originally, `mb_detect_encoding` essentially just checked all candidate encodings to see which ones the input string was valid in. However, it was only able to do this for a limited few of all the text encodings which are officially supported by mbstring. In `3e7acf901d`, I modified it so it could 'detect' any text encoding supported by mbstring. While this is arguably an improvement, if the only text encodings one is interested in are those which `mb_detect_encoding` could originally handle, the old `mb_detect_encoding` may have been preferable. Because the new one has more possible encodings which it can guess, it also has more chances to get the answer wrong. This commit adjusts the detection heuristics to provide accurate detection in a wider variety of scenarios. While the previous detection code would frequently confuse UTF-32BE with UTF-32LE or UTF-16BE with UTF-16LE, the adjusted code is extremely accurate in those cases. Detection for Chinese text in Chinese encodings like GB18030 or BIG5 and for Japanese text in Japanese encodings like EUC-JP or SJIS is greatly improved. Detection of UTF-7 is also greatly improved. An 8KB table, with one bit for each codepoint from U+0000 up to U+FFFF, is used to achieve this. One significant constraint is that the heuristics are completely based on looking at each codepoint in a string in isolation, treating some codepoints as 'likely' and others as 'unlikely'. It might still be possible to achieve great gains in detection accuracy by looking at sequences of codepoints rather than individual codepoints. However, this might require huge tables. Further, we might need a huge corpus of text in various languages to derive those tables. Accuracy is still dismal when trying to distinguish single-byte encodings like ISO-8859-1, ISO-8859-2, KOI8-R, and so on. This is because the valid bytes in these encodings are basically all the same, and all valid bytes decode to 'likely' codepoints, so our method of detection (which is based on rating codepoints as likely or unlikely) cannot tell any difference between the candidates at all. It just selects the first encoding in the provided list of candidates. Speaking of which, if one wants to get good results from `mb_detect_encoding`, it is important to order the list of candidate encodings according to your prior belief of which are more likely to be correct. When the function cannot tell any difference between two candidates, it returns whichever appeared earlier in the array.	2021-10-19 18:05:51 +02:00
Alex Dowad	dcaa010fff	Strict validation of conversion flags to mb_convert_kana mb_convert_kana is controlled by user-provided flags, which specify what it should convert and to what. These flags come in inverse pairs, for example "fullwidth numerals to halfwidth numerals" and "halfwidth numerals to fullwidth numerals". It does not make sense to combine inverse flags. But, clever reader of commit logs, you will surely say: What if I want all my halfwidth numerals to become fullwidth, and all my fullwidth numerals to become halfwidth? Much too clever, you are! Let's put aside the fact that this bizarre switch-up is ridiculous and will never be used, and face up to another stark reality: mb_convert_kana does not work for that case, and never has. This was probably never noticed because nobody ever tried. Disallowing useless combinations of flags gives freedom to rearrange the kana conversion code without changing behavior. We can also reject unrecognized flags. This may help users to catch bugs. Interestingly, the existing tests used a 'Z' flag, which is useless (it's not recognized at all).	2021-10-01 19:27:39 +02:00
Alex Dowad	0b32a15eb0	Optimize mb_str{,im}width for performance Rather than doing a linear search of a table of fullwidth codepoint ranges for every input character, 1) Short-cut the search if the codepoint is below the first such range 2) Otherwise, do a binary (rather than linear) search	2021-09-29 18:19:01 +02:00
Alex Dowad	f4365d2c26	Remove unused typedef 'mbfl_encoding_id'	2021-09-29 18:19:01 +02:00
Alex Dowad	3bf431969e	Don't check for impossible error condition in mb_substr_count	2021-09-29 18:19:01 +02:00
Alex Dowad	8c32deb605	Don't check for impossible error condition in mb_strwidth	2021-09-29 18:19:01 +02:00
Alex Dowad	bf78070cbe	Don't check for impossible error condition in mb_strlen	2021-09-29 18:19:01 +02:00
Alex Dowad	07c4b3b8c0	Simplify code for handling mbstring language aliases Rather than using pointers to pointers to pointers (3 levels of indirection), what makes sense is two levels. This reduces unnecessary pointer dereference operations.	2021-09-20 11:27:54 +02:00
Alex Dowad	2f096c4039	Remove useless constant MBFL_ENCTYPE_MWC2	2021-09-20 11:27:54 +02:00
Alex Dowad	68176fdfb1	Use char literals in HTML numeric entity {en,de}coding functions	2021-09-20 11:27:54 +02:00
Alex Dowad	f663344f33	Merge branch 'PHP-8.1' * PHP-8.1: Bug #81390: mb_detect_encoding should not prematurely stop processing input mb_detect_encoding with only one candidate encoding uses mb_check_encoding Optimize text encoding detection for speed (eliminate Unicode property lookups)	2021-09-20 11:27:07 +02:00
Alex Dowad	c25a1ef8d0	Bug #81390 : mb_detect_encoding should not prematurely stop processing input As a performance optimization, mb_detect_encoding tries to stop processing the input string early when there is only one 'candidate' encoding which the input string is valid in. However, the code which keeps count of how many candidate encodings have already been rejected was buggy. This caused mb_detect_encoding to prematurely stop processing the input when it should have continued. As a result, it did not notice that in the test case provided by Alec, the input string was not valid in UTF-16.	2021-09-20 11:21:39 +02:00
Alex Dowad	6acd4f7f3a	Optimize text encoding detection for speed (eliminate Unicode property lookups) ...By just testing the input codepoints if they are within a few fixed ranges instead. This avoids hash lookups in property tables. From (micro-)benchmarking on my PC, this looks to be a bit less than 4x faster than the existing code.	2021-09-20 11:20:53 +02:00
Nikita Popov	e740907ec9	Merge branch 'PHP-8.1' * PHP-8.1: Update Unicode tables to 14.0.0	2021-09-20 09:58:32 +02:00
Colin O'Dell	fe36b81d5e	Update Unicode tables to 14.0.0 Closes GH-7502.	2021-09-20 09:58:20 +02:00
Alex Dowad	92fb3de9d7	Remove unused MBFL_FILT_TL_*_MASK constants Sending more unused, unneeded, unwanted, unrequired, unloved and uncalled-for code where it belongs.	2021-09-06 13:16:23 +02:00
Alex Dowad	9e1447dbf3	Rename KANA2HIRA and HIRA2KANA constants (for mb_convert_kana) mb_convert_kana is able to convert fullwidth katakana to fullwidth hiragana (and vice versa). The constants referring to these modes had names like MBFL_FILT_TL_ZEN2HAN_KANA2HIRA. The "ZEN2HAN" part of the name is misleading, since these modes do not convert fullwidth (zenkaku) kana to halfwidth (hankaku). The converted characters are fullwidth both before and after the conversion. So... let's name the constants accordingly.	2021-09-06 13:16:23 +02:00
Alex Dowad	c8e65c9d74	Remove COMPAT2 conversion modes for mb_convert_kana mb_convert_kana has conversion modes selected using 'M'/'m', which convert a few various punctuation and symbol characters between 'ordinary' and full-width forms. The constants which refer to these modes have names ending with COMPAT1. Internally, there are similar conversion modes with names ending in COMPAT2. They are like COMPAT1 modes, but they operate on a smaller set of characters. But... that is all just dead code, because there is no way for user code to select the COMPAT2 modes. I have no idea what the original author intended those COMPAT2 modes to actually be used for. Guess it doesn't really matter, anyways. At this point, it's just more food for the flames.	2021-09-06 13:16:23 +02:00
Alex Dowad	d2f5a8b328	Add more tests for SJIS-mac text conversion	2021-09-06 13:16:23 +02:00
Alex Dowad	0957f54eb1	Treat truncated escape sequences for CP5022{0,1,2} as error	2021-09-06 13:16:23 +02:00
Alex Dowad	64e379d81e	Declare CP50222 flush function as 'static'	2021-09-06 13:16:23 +02:00
Alex Dowad	a312620607	Remove redundant NULL checks in mbstring Whoever originally wrote mbstring seems to have a deathly fear of NULL pointers lurking behind every corner. A common pattern is that one function will check if a pointer is NULL, then pass it to another function, which will again check if it is NULL, then pass to yet another function, which will yet again check if it is NULL... it's NULL checks all the way down. Remove all the NULL checks in places where pointers could not possibly be NULL.	2021-09-06 13:16:23 +02:00
Alex Dowad	626f0fec54	Remove some dead code from mbstring mbstring has a great deal of dead code. Some common types are: - Default switch clauses which will never be taken - If clauses intended to convert codepoints which were not present in a conversion table... but the codepoint in question is in the table, so the if clause is not needed. - Bounds checks in places where it is not possible for a value to ever be out of bounds. - Checks to see if an unmatched Unicode codepoint is in CP932 extension range 3... but every codepoint in range 3 is also in range 2, so no codepoint will ever be matched and converted by that code.	2021-09-06 13:16:23 +02:00
Alex Dowad	16a1e0a219	In UTF7-IMAP, reject the 2nd part of surrogate pair if it appears unexpectedly	2021-08-31 13:41:34 +02:00
Alex Dowad	e3f6a9fbfe	CP5022{0,1,2} supports 'IBM extension' codes from ku 115-119 mbstring has always had the conversion tables to support CP932 codes in ku 115-119, and the conversion code for CP5022x has an 'if' clause specifically to handle such characters... but that 'if' clause was dead code, since a guard clause earlier in the same function prevented it from accepting 2-byte characters with a starting byte of 0x93-0x97. Adjust the guard clause so that these characters can be converted as the original author apparently intended. The code which handles ku 115-119 is the part which reads: } else if (s >= cp932ext3_ucs_table_min && s < cp932ext3_ucs_table_max) { w = cp932ext3_ucs_table[s - cp932ext3_ucs_table_min];	2021-08-31 13:41:34 +02:00
Alex Dowad	f303fc8a9b	Use bool in mbfl_filt_conv_output_hex (rather than int)	2021-08-31 13:41:34 +02:00
Alex Dowad	776296e12f	mbstring no longer provides 'long' substitutions for erroneous input bytes Previously, mbstring had a special mode whereby it would convert erroneous input byte sequences to output like "BAD+XXXX", where "XXXX" would be the erroneous bytes expressed in hexadecimal. This mode could be enabled by calling `mb_substitute_character("long")`. However, accurately reproducing input byte sequences from the cached state of a conversion filter is often tricky, and this significantly complicates the implementation. Further, the means used for passing the erroneous bytes through to where the "BAD+XXXX" text is generated only allows for up to 3 bytes to be passed, meaning that some erroneous byte sequences are truncated anyways. More to the point, a search of publically available PHP code indicates that nobody is really using this feature anyways. Incidentally, this feature also provided error output like "JIS+XXXX" if the input 'should have' represented a JISX 0208 codepoint, but it decodes to a codepoint which does not exist in the JISX 0208 charset. Similarly, specific error output was provided for non-existent JISX 0212 codepoints, and likewise for JISX 0213, CP932, and a few other charsets. All of that is now consigned to the flames. However, "long" error markers also include a somewhat more useful "U+XXXX" marker for Unicode codepoints which were successfully decoded from the input text, but cannot be represented in the output encoding. Those are still supported. With this change, there is no need to use a variety of special values in the high bits of a wchar to represent different types of error values. We can (and will) just use a single error value. This will be equal to -1. One complicating factor: Text conversion functions return an integer to indicate whether the conversion operation should be immediately aborted, and the magic 'abort' marker is -1. Also, almost all of these functions would return the received byte/codepoint to indicate success. That doesn't work with the new error value; if an input filter detects an error and passes -1 to the output filter, and the output filter returns it back, that would be taken to mean 'abort'. Therefore, amend all these functions to return 0 for success.	2021-08-31 13:41:34 +02:00
Alex Dowad	97f8495e0f	UCS-4 conversion does not pass BOM through to output This is to match the way that we handle UCS-2. When a BOM is found at the beginning of a 'UCS-2' string (NOT 'UCS-2BE' or 'UCS-2LE'), we take note of the intended byte order and handle the string accordingly, but do NOT emit a BOM to the output. Rather, we just use the default byte order for the requested output encoding. Some might argue that if the input string used a BOM, and we are emitting output in a text encoding where both big-endian and little-endian byte orders are possible, we should include a BOM in the output string. To such hypothetical debaters of minutiae, I can only offer you a shoulder shrug. No reasonable program which handles UCS-2 and UCS-4 text should require a BOM. Really, the concept of the BOM is a poor idea and should not have been included in Unicode. Standardizing on a single byte order would have been much better, similar to 'network byte order' for the Internet Protocol. But this is not the place to speak at length of such things.	2021-08-30 16:29:58 +02:00
Alex Dowad	e6f1a72235	Add test suite for mobile variants of UTF-8 (and fix bugs)	2021-08-30 16:29:58 +02:00
Alex Dowad	1865576694	Add test suite for EUC-JP-WIN (or EUC-JP-MS) text encoding (and fix bugs)	2021-08-30 16:29:58 +02:00
Alex Dowad	6a693d2d33	Remove useless variable: mbfl_encoding_utf8_kddi_a_aliases	2021-08-30 16:29:58 +02:00
Alex Dowad	d4561894ea	Extraneous trailing UCS-4 bytes are treated as error	2021-08-30 16:29:58 +02:00
Alex Dowad	51e0d323e4	ISO-2022-JP-MS treats truncated multi-byte chars as error Sigh. I included tests which were intended to check this case in the test suite for ISO-2022-JP-MS, but those tests were faulty and didn't actually test what they were supposed to. Fixing the tests revealed that there were still bugs in this area.	2021-08-30 16:29:58 +02:00
Alex Dowad	57a81af041	ISO-2022-JP-KDDI text conversion doesn't swallow PUA codepoints There was a bit of legacy code here which looks like the original author of mbstring intended to allow conversion of Unicode Private Use Area codepoints to ISO-2022-JP-KDDI. However, that code never worked. It set the output variable to values which were not matched by any of the 'if' clauses below, which meant that nothing was actually emitted to the output. In other words, if one tried to convert Unicode to ISO-2022-JP-KDDI, and the Unicode string contained PUA codepoints, they would be quietly 'swallowed' and disappear. I don't know what ISO-2022-JP-KDDI byte sequences the author wanted to map those PUA codepoints to, and anyways, this use case is so obscure that there is little point in worrying about it. However, it is better to remove the non-functioning code than to leave it in. This means that if now one tries to convert PUA codepoints to ISO-2022-JP-KDDI, those codepoints will be treated as erroneous rather than silently ignored.	2021-08-30 16:29:58 +02:00
Alex Dowad	51b9d7a5e1	Test behavior of 'long' illegal character markers After mb_substitute_character("long"), mbstring will respond to erroneous input by inserting 'long' error markers into the output. Depending on the situation, these error markers will either look like BAD+XXXX (for general bad input), U+XXXX (when the input is OK, but it converts to Unicode codepoints which cannot be represented in the output encoding), or an encoding-specific marker like JISX+XXXX or W932+XXXX. We have almost no tests for this feature. Add a bunch of tests to ensure that all our legacy encoding handlers work in a reasonable way when 'long' error markers are enabled.	2021-08-30 16:29:58 +02:00
Alex Dowad	f6f0506c84	Correct comment in mbfilter_ucs4.c	2021-08-30 16:29:58 +02:00
Alex Dowad	03392ecd50	Simplify code for converting UHC to Unicode	2021-08-30 16:29:58 +02:00
Alex Dowad	9363b0b5a7	Declare ARMSCII-8 conversion functions as 'static'	2021-08-30 16:29:58 +02:00
Alex Dowad	97b7fc893c	Output illegal character marker for 4-byte illegal characters > 0x7FFFFFFF Some text encodings supported by mbstring (such as UCS-4) accept 4-byte characters. When mbstring encounters an illegal byte sequence for the encoding it is using, it should emit an 'illegal character' marker, which can either be a single character like '?', an HTML hexadecimal entity, or a marker string like 'BAD+XXXX'. Because of the use of signed integers to hold 4-byte characters, illegal 4-byte sequences with a 'negative' value (one with the high bit set) were not handled correctly when emitting the illegal char marker. The result is that such illegal sequences were just skipped over (and the marker was not emitted to the output). Fix that.	2021-08-30 16:29:58 +02:00
Nikita Popov	634f2e21d3	Don't expose wchar encoding to users (#7415 ) The "wchar" encoding isn't really an encoding -- it's what we internally use as the representation of decoded characters. In practice, it tends to behave a lot like the 8bit encoding when used from userland, because input code units end up being treated as code points. This patch removes the wchar encoding from the public encoding list and reserves it for internal use only.	2021-08-30 11:11:33 +02:00
Nikita Popov	43cb2548f7	Flush filter during non-strict encoding detection If we reach the end of the string without reducing to a single encoding, then we should flush to check whether the last character is incomplete.	2021-08-27 14:48:32 +02:00
Nikita Popov	28500fe4ef	Fixed bug #81349 The ascii to wchar was reporting errors using conv_illegal_output, while it should have been using WCSGROUP_THROUGH. Effectively that replaced illegal characters with '?' for the purpose of identification.	2021-08-11 11:37:02 +02:00
Nikita Popov	a1c1ee6a48	Don't use opaque for encoding detection score opaque is used by the htmlentities filter, which means that we end up trying to free the score value as a pointer. Don't try to be overly tricky here and simply allocate a separate structure to hold the number of illegal characters and the score.	2021-07-28 10:54:27 +02:00
Nikita Popov	9d0db2e98a	Fixed bug #81298 Creation of the filter may fail for some special encodings, for which detection is not supported.	2021-07-28 10:11:46 +02:00
Alex Dowad	26fc7c4256	Fix typo in mbfilter.h As pointed out by Bruno Haible (https://haible.de/bruno).	2021-07-19 12:17:00 +02:00
Alex Dowad	13136a575d	Fix conversion of GB18030 text (and add test suite) - Truncated multi-byte characters are treated as an error - Reject GB18030 4-byte codes which translate to (non-existent) Unicode codepoints above 0x10FFFF - Add a number of missing mappings from the GB18030 standards (These mappings are supported by iconv. I don't know why they were missing from mbstring.)	2021-07-19 12:17:00 +02:00
Alex Dowad	340164bcc9	Reduce size of conversion tables for CP936	2021-07-19 12:17:00 +02:00
Alex Dowad	73c6a5b89d	Fix conversion of Big5 and CP950 text (and add test suite) - Truncated multi-byte characters are treated as an error - Follow recommended mappings from Unicode consortium	2021-07-19 12:17:00 +02:00
Alex Dowad	b626e893ff	Fix conversion of ISO-2022-KR text (and add test suite) - Truncated multi-byte characters are treated as an error - Truncated or unrecognized escape sequences are treated as an error - ASCII control characters are not allowed to appear in the middle of a multi-byte character	2021-07-05 16:28:16 +02:00

1 2 3 4 5 ...

458 Commits