archived-php-src

mirror of https://github.com/php/php-src.git synced 2026-04-27 18:23:26 +02:00

Author	SHA1	Message	Date
Alex Dowad	3cf432798e	Fix new conversion filter for CP50220 (multi-codepoint kana at end of buffer) If two codepoints which needed to be collapsed into a single kuten code were separated, with one at the end of one buffer and the other at the beginning of the next buffer, they were not converted correctly. This was discovered while fuzzing the new implementation of mb_decode_numericentity.	2022-07-18 15:11:31 +02:00
Alex Dowad	6938e35122	Fix legacy conversion filter for CP50220	2022-07-18 15:11:31 +02:00
Alex Dowad	a789088527	Add more tests for mbstring encoding conversion When testing the preceding commits, I used a script to generate a large number of random strings and try to find strings which would yield different outputs from the new and old encoding conversion code. Some were found. In most cases, analysis revealed that the new code was correct and the old code was not. In all cases where the new code was incorrect, regression tests were added. However, there may be some value in adding regression tests for cases where the old code was incorrect as well. That is done here. This does not cover every case where the new and old code yielded different results. Some of them were very obscure, and it is proving difficult even to reproduce them (since I did not keep a record of all the input strings which triggered the differing output).	2022-05-28 21:53:38 +02:00
Alex Dowad	53ffba967c	Implement fast text conversion interface for CP5022{0,1,2}	2021-12-26 22:19:51 +02:00
Alex Dowad	0957f54eb1	Treat truncated escape sequences for CP5022{0,1,2} as error	2021-09-06 13:16:23 +02:00
Alex Dowad	a0415b22ab	Add more tests for CP5022{0,1,2} text conversion	2021-08-31 13:41:34 +02:00
Alex Dowad	e3f6a9fbfe	CP5022{0,1,2} supports 'IBM extension' codes from ku 115-119 mbstring has always had the conversion tables to support CP932 codes in ku 115-119, and the conversion code for CP5022x has an 'if' clause specifically to handle such characters... but that 'if' clause was dead code, since a guard clause earlier in the same function prevented it from accepting 2-byte characters with a starting byte of 0x93-0x97. Adjust the guard clause so that these characters can be converted as the original author apparently intended. The code which handles ku 115-119 is the part which reads: } else if (s >= cp932ext3_ucs_table_min && s < cp932ext3_ucs_table_max) { w = cp932ext3_ucs_table[s - cp932ext3_ucs_table_min];	2021-08-31 13:41:34 +02:00
Alex Dowad	776296e12f	mbstring no longer provides 'long' substitutions for erroneous input bytes Previously, mbstring had a special mode whereby it would convert erroneous input byte sequences to output like "BAD+XXXX", where "XXXX" would be the erroneous bytes expressed in hexadecimal. This mode could be enabled by calling `mb_substitute_character("long")`. However, accurately reproducing input byte sequences from the cached state of a conversion filter is often tricky, and this significantly complicates the implementation. Further, the means used for passing the erroneous bytes through to where the "BAD+XXXX" text is generated only allows for up to 3 bytes to be passed, meaning that some erroneous byte sequences are truncated anyways. More to the point, a search of publically available PHP code indicates that nobody is really using this feature anyways. Incidentally, this feature also provided error output like "JIS+XXXX" if the input 'should have' represented a JISX 0208 codepoint, but it decodes to a codepoint which does not exist in the JISX 0208 charset. Similarly, specific error output was provided for non-existent JISX 0212 codepoints, and likewise for JISX 0213, CP932, and a few other charsets. All of that is now consigned to the flames. However, "long" error markers also include a somewhat more useful "U+XXXX" marker for Unicode codepoints which were successfully decoded from the input text, but cannot be represented in the output encoding. Those are still supported. With this change, there is no need to use a variety of special values in the high bits of a wchar to represent different types of error values. We can (and will) just use a single error value. This will be equal to -1. One complicating factor: Text conversion functions return an integer to indicate whether the conversion operation should be immediately aborted, and the magic 'abort' marker is -1. Also, almost all of these functions would return the received byte/codepoint to indicate success. That doesn't work with the new error value; if an input filter detects an error and passes -1 to the output filter, and the output filter returns it back, that would be taken to mean 'abort'. Therefore, amend all these functions to return 0 for success.	2021-08-31 13:41:34 +02:00
Alex Dowad	51b9d7a5e1	Test behavior of 'long' illegal character markers After mb_substitute_character("long"), mbstring will respond to erroneous input by inserting 'long' error markers into the output. Depending on the situation, these error markers will either look like BAD+XXXX (for general bad input), U+XXXX (when the input is OK, but it converts to Unicode codepoints which cannot be represented in the output encoding), or an encoding-specific marker like JISX+XXXX or W932+XXXX. We have almost no tests for this feature. Add a bunch of tests to ensure that all our legacy encoding handlers work in a reasonable way when 'long' error markers are enabled.	2021-08-30 16:29:58 +02:00
Nikita Popov	39131219e8	Migrate more SKIPIF -> EXTENSIONS (#7139 ) This is a mix of more automated and manual migration. It should remove all applicable extension_loaded() checks outside of skipif.inc files.	2021-06-11 12:58:44 +02:00
Alex Dowad	ebe6500a0b	Fix error reporting bug for Unicode -> CP50220 conversion To detect errors in conversion from Unicode to another text encoding, each mbstring conversion filter object maintains a count of 'bad' characters. After a conversion operation finishes, this count is checked to see if there was any error. The problem with CP50220 was that mbstring used a chain of two conversion filter objects. The 'bad character count' would be incremented on the second object in the chain, but this didn't do anything, as only the count on the first such object is ever checked. Fix this by implementing the conversion using a single conversion filter object, rather than a chain of two. This is possible because of the recent refactoring, which pulled out the needed logic for CP50220 conversion into a helper function.	2021-04-15 15:52:31 +02:00
Alex Dowad	888f5d7729	CP5022{0,1,2}: treat truncated multibyte characters as error	2021-01-15 21:55:41 +02:00
Alex Dowad	2a93a8bb8c	Add test suite for CP5022{0,1,2}	2021-01-15 21:55:41 +02:00

13 Commits