archived-php-src

mirror of https://github.com/php/php-src.git synced 2026-04-18 05:21:02 +02:00

Author	SHA1	Message	Date
Alex Dowad	776296e12f	mbstring no longer provides 'long' substitutions for erroneous input bytes Previously, mbstring had a special mode whereby it would convert erroneous input byte sequences to output like "BAD+XXXX", where "XXXX" would be the erroneous bytes expressed in hexadecimal. This mode could be enabled by calling `mb_substitute_character("long")`. However, accurately reproducing input byte sequences from the cached state of a conversion filter is often tricky, and this significantly complicates the implementation. Further, the means used for passing the erroneous bytes through to where the "BAD+XXXX" text is generated only allows for up to 3 bytes to be passed, meaning that some erroneous byte sequences are truncated anyways. More to the point, a search of publically available PHP code indicates that nobody is really using this feature anyways. Incidentally, this feature also provided error output like "JIS+XXXX" if the input 'should have' represented a JISX 0208 codepoint, but it decodes to a codepoint which does not exist in the JISX 0208 charset. Similarly, specific error output was provided for non-existent JISX 0212 codepoints, and likewise for JISX 0213, CP932, and a few other charsets. All of that is now consigned to the flames. However, "long" error markers also include a somewhat more useful "U+XXXX" marker for Unicode codepoints which were successfully decoded from the input text, but cannot be represented in the output encoding. Those are still supported. With this change, there is no need to use a variety of special values in the high bits of a wchar to represent different types of error values. We can (and will) just use a single error value. This will be equal to -1. One complicating factor: Text conversion functions return an integer to indicate whether the conversion operation should be immediately aborted, and the magic 'abort' marker is -1. Also, almost all of these functions would return the received byte/codepoint to indicate success. That doesn't work with the new error value; if an input filter detects an error and passes -1 to the output filter, and the output filter returns it back, that would be taken to mean 'abort'. Therefore, amend all these functions to return 0 for success.	2021-08-31 13:41:34 +02:00
Alex Dowad	0de4d6872e	Add more tests for SJIS-2004 text conversion	2021-08-30 16:29:58 +02:00
Alex Dowad	51b9d7a5e1	Test behavior of 'long' illegal character markers After mb_substitute_character("long"), mbstring will respond to erroneous input by inserting 'long' error markers into the output. Depending on the situation, these error markers will either look like BAD+XXXX (for general bad input), U+XXXX (when the input is OK, but it converts to Unicode codepoints which cannot be represented in the output encoding), or an encoding-specific marker like JISX+XXXX or W932+XXXX. We have almost no tests for this feature. Add a bunch of tests to ensure that all our legacy encoding handlers work in a reasonable way when 'long' error markers are enabled.	2021-08-30 16:29:58 +02:00
Nikita Popov	39131219e8	Migrate more SKIPIF -> EXTENSIONS (#7139 ) This is a mix of more automated and manual migration. It should remove all applicable extension_loaded() checks outside of skipif.inc files.	2021-06-11 12:58:44 +02:00
Alex Dowad	0d0029d729	0x7E is not a tilde in Shift-JIS{,-2004}	2020-11-25 20:51:45 +02:00
Alex Dowad	315d48b434	0x5C is not a backslash in Shift-JIS-2004 Shift-JIS-2004 is an extension of Shift-JIS, which uses 0x5C for the Yen sign. Therefore, it is not correct to convert ASCII 0x5C (backslash) to Shift-JIS-2004 0x5C (yen sign). JIS X 0208 does have a backslash, so we can convert ASCII backslash to SJIS-2004 backslash instead. From time immemorial, there has been confusion around the treatment of 0x5C bytes on systems using legacy Japanese encodings. JIS X 0201 specified that 0x5C means a yen sign, and thus fonts on Japanese systems, including early versions of Windows, displayed a 0x5C byte as a yen sign. This meant that when ASCII text files were displayed on such systems, what were meant to be backslashes would appear as yen signs. Japanese C programmers could write character escapes using yen signs, and C compilers built on the assumption that the input was ASCII would interpret these escapes as desired. Likewise for shell scripts. Et cetera, et cetera... Therefore, if the input to `mb_convert_encoding` is (for example) a C program, and after converting to Shift-JIS-2004, the user wishes to feed the output into a C compiler, then perhaps ASCII 0x5C should be mapped to SJIS 0x5C. However, this scenario is ridiculous and will never happen. A more realistic scenario might be: an article written in SJIS-2004 has embedded Windows file paths (like 'C:\Program Files'), with yen signs used as a path separator. If we convert SJIS-2004 0x5C to ASCII 0x5C, then the path separators will be 'fixed' by the conversion. For general written texts, it is much better to convert backslashes to... backslashes. And yen signs, to yen signs.	2020-11-25 20:51:44 +02:00
Alex Dowad	d40f9cf735	Add test suite for SJIS-2004 encoding	2020-11-11 11:18:58 +02:00

7 Commits