archived-php-src

mirror of https://github.com/php/php-src.git synced 2026-04-01 13:12:16 +02:00

Author	SHA1	Message	Date
Alex Dowad	caeaa662ab	Strict conversion of UHC text to Unicode Previously, mbstring would accept a lot of things which were not valid UHC text. No more. - Don't allow single-byte control characters to appear where the 2nd byte of a multi-byte character should be. - Validate that the 2nd byte of a multi-byte character is in the expected range. - Treat it as an error if a multi-byte character is truncated. Also add a test suite to confirm that UHC conversion (both to and from Unicode) works according to spec.	2021-06-17 13:12:40 +02:00
Alex Dowad	9868c17368	Mark CP932 and CP51932 encoding tests as 'slow tests'	2021-06-17 13:12:40 +02:00
Alex Dowad	e2459857af	Remove duplicate implementation of CP932 from mbstring Sigh. Double sigh. After fruitlessly searching the Internet for information on this mysterious text encoding called "SJIS-open", I wrote a script to try converting every Unicode codepoint from 0-0xFFFF and compare the results from different variants of Shift-JIS, to see which one "SJIS-open" would be most similar to. The result? It's just CP932. There is no difference at all. So why do we have two implementations of CP932 in mbstring? In case somebody, somewhere is using "SJIS-open" (or its aliases "SJIS-win" or "SJIS-ms"), add these as aliases to CP932 so existing code will continue to work.	2021-06-17 13:12:40 +02:00
Alex Dowad	7502c86342	Add test suite for UTF-{7,8,16,32} Also fix a couple small problems with UTF-32 and UTF-8 support: - UTF-32 would pass very large codepoints (>= 0x80000000), which are not valid. - UTF-8 would sometimes emit two error marker characters for a single bad input byte.	2021-06-17 13:12:40 +02:00
Nikita Popov	a06d015e61	Remove unnecessary mbstring skipifs These functions are always available (if the extension is available at all).	2021-06-14 15:27:28 +02:00
Nikita Popov	6600ad6067	Add some missing EXTENSIONS sections to misc tests	2021-06-14 14:52:44 +02:00
Nikita Popov	4083600bd5	Port mbstring to use EXTENSIONS	2021-06-11 14:00:43 +02:00
Nikita Popov	39131219e8	Migrate more SKIPIF -> EXTENSIONS (#7139 ) This is a mix of more automated and manual migration. It should remove all applicable extension_loaded() checks outside of skipif.inc files.	2021-06-11 12:58:44 +02:00
Nikita Popov	7485978339	Migrate SKIPIF -> EXTENSIONS (#7138 ) This is an automated migration of most SKIPIF extension_loaded checks.	2021-06-11 11:57:42 +02:00
Ayesh Karunaratne	b8e380ab09	Update deprecation message for incompatible float to int conversion Updates the deprecation message for implicit incompatible float to int conversion from: ``` Implicit conversion from non-compatible float %.H to int in %s on line %d ``` to ``` Implicit conversion from float %.H to int loses precision in %s on line %d ``` Related: #6661	2021-06-07 14:36:11 +02:00
George Peter Banyard	b6958bb847	Implement "Deprecate implicit non-integer-compatible float to int conversions" RFC. (#6661 ) RFC: https://wiki.php.net/rfc/implicit-float-int-deprecate Co-authored-by: Nikita Popov <nikita.ppv@gmail.com>	2021-05-31 15:48:45 +01:00
Christoph M. Becker	592cfa309e	Merge branch 'PHP-8.0' * PHP-8.0: Fix #81011: mb_convert_encoding removes references from arrays	2021-05-04 18:40:23 +02:00
Christoph M. Becker	d1c0cbdcb1	Merge branch 'PHP-7.4' into PHP-8.0 * PHP-7.4: Fix #81011: mb_convert_encoding removes references from arrays	2021-05-04 18:39:39 +02:00
Christoph M. Becker	0cafd53d18	Fix #81011 : mb_convert_encoding removes references from arrays We need to dereference references. Closes GH-6938.	2021-05-04 18:37:40 +02:00
Alex Dowad	7159907d30	Fix mbstring support for ISO-2022-JP-MS encoding - Treat it as error if multi-byte string or escape sequence is truncated - Don't allow 'control' characters or escape sequences to appear in the middle of a multi-byte char As with ISO-2022-JP-KDDI, the main reference used to develop the tests was the behavior of the existing code. It would have been better to have some independent reference which we could cross-check our code against, but I couldn't find one.	2021-04-15 15:52:31 +02:00
Alex Dowad	570e89a9f3	Fix mbstring support for ISO-2022-JP-KDDI encoding - Treat it as an error if a multi-byte character or escape sequence is truncated - When converting other encodings to ISO-2022-JP-KDDI, don't swallow trailing hash characters or digits - Don't allow 'control' characters to appear in the middle of a multi-byte char Note: I was not able to find any kind of official or even semi-official specification for this legacy encoding. Therefore, the test suite for ISO-2022-JP-KDDI is based largely on the behavior of the existing code. Verifying the correctness of program code in this way is very questionable. In a sense, all you are proving is that the code "does what it does". However, the test suite will still expose any unintended _changes_ to behavior.	2021-04-15 15:52:31 +02:00
Alex Dowad	f5f3ee7aee	Add test suite for mUTF-7 (IMAP) encoding	2021-04-15 15:52:31 +02:00
Alex Dowad	ebe6500a0b	Fix error reporting bug for Unicode -> CP50220 conversion To detect errors in conversion from Unicode to another text encoding, each mbstring conversion filter object maintains a count of 'bad' characters. After a conversion operation finishes, this count is checked to see if there was any error. The problem with CP50220 was that mbstring used a chain of two conversion filter objects. The 'bad character count' would be incremented on the second object in the chain, but this didn't do anything, as only the count on the first such object is ever checked. Fix this by implementing the conversion using a single conversion filter object, rather than a chain of two. This is possible because of the recent refactoring, which pulled out the needed logic for CP50220 conversion into a helper function.	2021-04-15 15:52:31 +02:00
Max Semenik	b11771271e	Remove stray mentions of mbstring.func_overload This feature has been completely removed. Closes GH-6688.	2021-02-15 09:47:28 +01:00
Nikita Popov	b10416a652	Deprecate passing null to non-nullable arg of internal function This deprecates passing null to non-nullable scale arguments of internal functions, with the eventual goal of making the behavior consistent with userland functions, where null is never accepted for non-nullable arguments. This change is expected to cause quite a lot of fallout. In most cases, calling code should be adjusted to avoid passing null. In some cases, PHP should be adjusted to make some function arguments nullable. I have already fixed a number of functions before landing this, but feel free to file a bug if you encounter a function that doesn't accept null, but probably should. (The rule of thumb for this to be applicable is that the function must have special behavior for 0 or "", which is distinct from the natural behavior of the parameter.) RFC: https://wiki.php.net/rfc/deprecate_null_to_scalar_internal_arg Closes GH-6475.	2021-02-11 21:46:13 +01:00
Alex Dowad	d8c785b894	Update 'East Asian Width' table to comply with Unicode 13.0 Instead of manually maintaining the data in eaw_table.h, it is now automatically generated by ucgendat/ucgendat.php, using the EastAsianWidth.txt file from the Unicode Consortium. Something must be said about the deleted test case. Back in 2004, someone noticed that `mb_strwidth` didn't comply with Unicode 4.0. A test case was added to expose the problem. Well, time keeps moving on, and with the changing years, new Unicodes are born and old Unicodes die. Some characters which were counted as double-width in Unicode 4.0 are no longer such in Unicode 13.0, which renders the test case obsolete. At the same time, make a couple of spelling/grammar fixes in ucgendat.php.	2021-01-19 20:38:44 +02:00
Alex Dowad	888f5d7729	CP5022{0,1,2}: treat truncated multibyte characters as error	2021-01-15 21:55:41 +02:00
Alex Dowad	2a93a8bb8c	Add test suite for CP5022{0,1,2}	2021-01-15 21:55:41 +02:00
Nikita Popov	e2c8ab7c33	Print "interned" instead of fake refcount in debug_zval_dump() debug_zval_dump() currently prints refcount 1 for interned strings and arrays, which does not really reflect the truth. These values are not refcounted, so the refcount is misleading. Instead print an "interned" tag. Closes GH-6598.	2021-01-15 12:21:24 +01:00
Alex Dowad	4299e2de42	JIS7/JIS8 encoding: treat truncated multibyte characters as error	2021-01-14 22:34:16 +02:00
Alex Dowad	b67e358e75	JIS7/JIS8 encoding: handle invalid 2nd byte for Kanji correctly Previously, in ISO-2022-JP/JIS7/JIS8, if an escape sequence (starting with 0x1B) appeared where the 2nd byte of a multibyte character should have been, mbstring would forget all about the truncated multibyte character and happily accept the escape sequence. However, such sequences are not legal and should be flagged as errors. Also, any other illegal bytes appearing where the 2nd byte of a multibyte character was expected were just passed through quietly to the output. Fix that. Also add a test suite for both ISO-2022-JP and JIS7/JIS8. (These are extremely similar encodings; JIS7 and JIS8 are variants of ISO-2022-JP. mbstring's 'JIS' is actually a combination of JIS7 _and_ JIS8, since the extensions which each one adds to ISO-2022-JP are disjoint.)	2021-01-14 22:31:31 +02:00
Alex Dowad	4b95fdf2ca	ISO-2022-JP-2004 conversion: handle invalid characters correctly	2021-01-14 22:26:24 +02:00
Alex Dowad	c9fea7db72	Convert U+00AF (MACRON) to 0x8150 (FULLWIDTH MACRON) in some SJIS variants Except for vanilla Shift-JIS, where 0x7E is a halfwidth overline/macron. As for Shift-JIS-2004, it has an added character (byte sequence 0x854A) which was defined as a halfwidth macron in JIS X 0213:2000, so we use that.	2020-11-25 20:51:45 +02:00
Alex Dowad	ecf718470b	Convert U+FF5E (FULLWIDTH TILDE) to 0x8160 (WAVE DASH) in SJIS variants By entering this character in the JIS X 0208 conversion table, we can remove a bunch of explicit `if` clauses in different conversion filters. It also means that U+FF5E can be converted into SJIS-mac now; I don't know why this one SJIS variant rejected U+FF5E before, since 0x8160 means the same thing in SJIS-mac as the others.	2020-11-25 20:51:45 +02:00
Alex Dowad	4f3bd2e235	Convert U+203E (OVERLINE) to 0x8150 (FULLWIDTH MACRON) in some SJIS variants Converting U+203E to 0x7E was especially wrong for CP932, where 0x7E represents a tilde. For vanilla Shift-JIS and Shift-JIS-2004, converting to 0x7E is acceptable, since 0x7E does represent an overline/macron in those encodings. Follow the same principle in CP51932, which is closely related to CP932.	2020-11-25 20:51:45 +02:00
Alex Dowad	0d0029d729	0x7E is not a tilde in Shift-JIS{,-2004}	2020-11-25 20:51:45 +02:00
Alex Dowad	e4ee979111	0x5C is not a Yen sign in CP932 (or CP51932) When Microsoft created CP932 (their version of Shift-JIS), they explicitly used bytes 0-0x7F to represent ASCII characters rather than JIS X 0201 characters. So when converting Unicode to CP932, it is not correct to convert U+00A5 to CP932 0x5C. Fortunately, CP932 does have a multi-byte FULLWIDTH YEN SIGN character which we can use instead. CP51932 uses the same extended character set as CP932; while CP932 is MicroSoft's extended version of Shift-JIS, CP51932 is their extended version of EUC-JP. So the same reasoning applies to CP51932.	2020-11-25 20:51:45 +02:00
Alex Dowad	315d48b434	0x5C is not a backslash in Shift-JIS-2004 Shift-JIS-2004 is an extension of Shift-JIS, which uses 0x5C for the Yen sign. Therefore, it is not correct to convert ASCII 0x5C (backslash) to Shift-JIS-2004 0x5C (yen sign). JIS X 0208 does have a backslash, so we can convert ASCII backslash to SJIS-2004 backslash instead. From time immemorial, there has been confusion around the treatment of 0x5C bytes on systems using legacy Japanese encodings. JIS X 0201 specified that 0x5C means a yen sign, and thus fonts on Japanese systems, including early versions of Windows, displayed a 0x5C byte as a yen sign. This meant that when ASCII text files were displayed on such systems, what were meant to be backslashes would appear as yen signs. Japanese C programmers could write character escapes using yen signs, and C compilers built on the assumption that the input was ASCII would interpret these escapes as desired. Likewise for shell scripts. Et cetera, et cetera... Therefore, if the input to `mb_convert_encoding` is (for example) a C program, and after converting to Shift-JIS-2004, the user wishes to feed the output into a C compiler, then perhaps ASCII 0x5C should be mapped to SJIS 0x5C. However, this scenario is ridiculous and will never happen. A more realistic scenario might be: an article written in SJIS-2004 has embedded Windows file paths (like 'C:\Program Files'), with yen signs used as a path separator. If we convert SJIS-2004 0x5C to ASCII 0x5C, then the path separators will be 'fixed' by the conversion. For general written texts, it is much better to convert backslashes to... backslashes. And yen signs, to yen signs.	2020-11-25 20:51:44 +02:00
Alex Dowad	5c805655db	Enhance handling of CP51932 encoding - Don't pass 'control' characters through in the middle of a multi-byte char - Treat truncated multi-byte characters as an error	2020-11-25 20:51:44 +02:00
Alex Dowad	beef597124	Fix mbstring support for SJIS-Mobile (DoCoMo, KDDI, and Softbank variants of Shift-JIS) Lots of problems here. - Don't pass 'control' characters through silently in the middle of a multi-byte character. - Treat it as an error if a multi-byte character is truncated. - For ESC sequences used to encode emoji on earlier Softbank phones, if an invalid ESC sequence is found, don't pass it through. Rather, handle it as an error and respect `mb_substitute_character`. - In ranges used by mobile vendors for emoji, if a certain byte sequence doesn't map to any emoji, don't emit a mangled value (actually a raw (ku*94)+ten value, which may not even be a valid Unicode codepoint at all). - When converting Unicode to SJIS-Mobile, don't mangle codepoints which fall in the 2nd range of MicroSoft vendor extensions. Some vendor-specific emoji have been mapped to standard Unicode codepoints now, rather than 'private use area' codepoints. When the legacy code was written, these codepoints may not have existed yet in the Unicode standard which was current at that time. Also do a major code cleanup -- remove dead code, rearrange what is left, use some new macros and helper functions to make the code clearer...	2020-11-25 20:51:44 +02:00
Alex Dowad	2759874a42	Enhance handling of CP932 text encoding - Don't allow control characters to appear in the middle of a multi-byte character. (This was a strange feature of mbstring; it doesn't make much sense, and iconv doesn't allow it.) - Treat truncated multi-byte characters as an error.	2020-11-25 19:52:19 +02:00
Alex Dowad	b489c1bc4d	Bugfixes for findInvalidChars (helper for mbstring test suite)	2020-11-25 19:52:19 +02:00
Alex Dowad	6dd75478d5	Leading BOM is stripped for UTF-32 For consistency with UTF-16 and UCS-4. Also, do some code cleanup.	2020-11-11 11:18:59 +02:00
Alex Dowad	1cf12c02f0	Add test suite for SJIS-mac encoding	2020-11-11 11:18:58 +02:00
Alex Dowad	d40f9cf735	Add test suite for SJIS-2004 encoding	2020-11-11 11:18:58 +02:00
Alex Dowad	d1d50c2b7a	Test EUC-JP and Shift-JIS more thoroughly Previously, the unit tests for these text encodings covered all mappings from legacy -> Unicode, and all _reversible_ mappings from Unicode -> legacy. However, we should also test the few Unicode -> legacy mappings which are not reversible.	2020-11-11 11:18:58 +02:00
Alex Dowad	3e7acf901d	Remove mbstring identify filters mbstring had an 'identify filter' for almost every supported text encoding which was used when auto-detecting the most likely encoding for a string. It would run over the string and set a 'flag' if it saw anything which did not appear likely to be the encoding in question. One problem with this scheme was that encodings which merely appeared less likely to be the correct one were completely rejected, even if there was no better candidate. Another problem was that the 'identify filters' had a huge amount of code duplication with the 'conversion filters'. Eliminate the identify filters. Instead, when auto-detecting text encoding, use conversion filters to see whether the input string is valid in candidate encodings or not. At the same type, watch the type of codepoints which the string decodes to and mark it as less likely if non-printable characters (ESC, form feed, bell, etc.) or 'private use area' codepoints are seen. Interestingly, one old test case in which JIS text was misidentified as UTF-8 (and this wrong behavior was enshrined in the test) was 'fixed' and the JIS string is now auto-detected as JIS.	2020-11-09 13:45:17 +02:00
Alex Dowad	8f6889b20d	Fix mbstring support for EUC-JP text encoding - Don't allow control characters to appear in the middle of a multi-byte character. (A strange feature, or perhaps misfeature, of mbstring which is not present in other libraries such as iconv.) - When checking whether string is valid, reject kuten codes which do not map to any character, whether converting from EUC-JP to another encoding, or converting another encoding which uses JIS X 0208/0212 charsets to EUC-JP. - Truncated multi-byte characters are treated as an error.	2020-11-09 13:45:17 +02:00
Alex Dowad	ad7e0f16cc	Fix mbstring support for Shift-JIS - Reject otherwise valid kuten codes which don't map to anything in JIS X 0208. - Handle truncated multi-byte characters as an error. - Convert Shift-JIS 0x7E to Unicode 0x203E (overline) as recommended by the Unicode Consortium, and as iconv does. - Convert Shift-JIS 0x5C to Unicode 0xA5 (yen sign) as recommended by the Unicode Consortium, and as iconv does. (NOTE: This will affect PHP scripts which use an internal encoding of Shift-JIS! PHP assigns a special meaning to 0x5C, the backslash. For example, it is used for escapes in double-quoted strings. Mapping the Shift-JIS yen sign to the Unicode yen sign means the yen sign will not be usable for C escapes in double-quoted strings. Japanese PHP programmers who want to write their source code in Shift-JIS for some strange reason will have to use the JIS X 0208 backlash or 'REVERSE SOLIDUS' character for their C escapes.) - Convert Unicode 0x5C (backslash) to Shift-JIS 0x815F (reverse solidus). - Immediately handle error if first Shift-JIS byte is over 0xEF, rather than waiting to see the next byte. (Previously, the value used was 0xFC, which is the limit for the 2nd byte and not the 1st byte of a multi-byte character.) - Don't allow 'control characters' to appear in the middle of a multi-byte character. The test case for bug 47399 is now obsolete. That test assumed that a number of Shift-JIS byte sequences which don't map to any character were 'valid' (because the byte values were within the legal ranges).	2020-11-09 13:45:16 +02:00
Alex Dowad	cc03c54c36	Remove useless byte{2,4}{be,le} encodings from mbstring There is no meaningful difference between these and UCS-{2,4}. They are just a little bit more lax about passing errors silently. They also have no known use. Alias to UCS-{2,4} in case someone, somewhere is using them.	2020-11-09 13:45:16 +02:00
Alex Dowad	3eb8828d1a	Fix issues with mbstring encoding tests I made some mistakes on this code, which meant that not everything which should be tested was actually being tested.	2020-11-09 13:45:16 +02:00
Alex Dowad	ff953f254c	Add test suite for ARMSCII-8 encoding	2020-11-02 21:31:06 +02:00
Alex Dowad	335c1b98c2	Add test suite for KOI8-U encoding	2020-11-02 21:31:06 +02:00
Alex Dowad	9db4387f14	Add test suite for KOI8-R encoding	2020-11-02 21:31:06 +02:00
Alex Dowad	9980534a4e	Add test suite for CP850 encoding	2020-11-02 21:31:06 +02:00

1 2 3 4 5 ...

662 Commits