archived-php-src

mirror of https://github.com/php/php-src.git synced 2026-04-04 22:52:40 +02:00

Author	SHA1	Message	Date
Alex Dowad	c9fea7db72	Convert U+00AF (MACRON) to 0x8150 (FULLWIDTH MACRON) in some SJIS variants Except for vanilla Shift-JIS, where 0x7E is a halfwidth overline/macron. As for Shift-JIS-2004, it has an added character (byte sequence 0x854A) which was defined as a halfwidth macron in JIS X 0213:2000, so we use that.	2020-11-25 20:51:45 +02:00
Alex Dowad	ecf718470b	Convert U+FF5E (FULLWIDTH TILDE) to 0x8160 (WAVE DASH) in SJIS variants By entering this character in the JIS X 0208 conversion table, we can remove a bunch of explicit `if` clauses in different conversion filters. It also means that U+FF5E can be converted into SJIS-mac now; I don't know why this one SJIS variant rejected U+FF5E before, since 0x8160 means the same thing in SJIS-mac as the others.	2020-11-25 20:51:45 +02:00
Alex Dowad	4f3bd2e235	Convert U+203E (OVERLINE) to 0x8150 (FULLWIDTH MACRON) in some SJIS variants Converting U+203E to 0x7E was especially wrong for CP932, where 0x7E represents a tilde. For vanilla Shift-JIS and Shift-JIS-2004, converting to 0x7E is acceptable, since 0x7E does represent an overline/macron in those encodings. Follow the same principle in CP51932, which is closely related to CP932.	2020-11-25 20:51:45 +02:00
Alex Dowad	0d0029d729	0x7E is not a tilde in Shift-JIS{,-2004}	2020-11-25 20:51:45 +02:00
Alex Dowad	e4ee979111	0x5C is not a Yen sign in CP932 (or CP51932) When Microsoft created CP932 (their version of Shift-JIS), they explicitly used bytes 0-0x7F to represent ASCII characters rather than JIS X 0201 characters. So when converting Unicode to CP932, it is not correct to convert U+00A5 to CP932 0x5C. Fortunately, CP932 does have a multi-byte FULLWIDTH YEN SIGN character which we can use instead. CP51932 uses the same extended character set as CP932; while CP932 is MicroSoft's extended version of Shift-JIS, CP51932 is their extended version of EUC-JP. So the same reasoning applies to CP51932.	2020-11-25 20:51:45 +02:00
Alex Dowad	315d48b434	0x5C is not a backslash in Shift-JIS-2004 Shift-JIS-2004 is an extension of Shift-JIS, which uses 0x5C for the Yen sign. Therefore, it is not correct to convert ASCII 0x5C (backslash) to Shift-JIS-2004 0x5C (yen sign). JIS X 0208 does have a backslash, so we can convert ASCII backslash to SJIS-2004 backslash instead. From time immemorial, there has been confusion around the treatment of 0x5C bytes on systems using legacy Japanese encodings. JIS X 0201 specified that 0x5C means a yen sign, and thus fonts on Japanese systems, including early versions of Windows, displayed a 0x5C byte as a yen sign. This meant that when ASCII text files were displayed on such systems, what were meant to be backslashes would appear as yen signs. Japanese C programmers could write character escapes using yen signs, and C compilers built on the assumption that the input was ASCII would interpret these escapes as desired. Likewise for shell scripts. Et cetera, et cetera... Therefore, if the input to `mb_convert_encoding` is (for example) a C program, and after converting to Shift-JIS-2004, the user wishes to feed the output into a C compiler, then perhaps ASCII 0x5C should be mapped to SJIS 0x5C. However, this scenario is ridiculous and will never happen. A more realistic scenario might be: an article written in SJIS-2004 has embedded Windows file paths (like 'C:\Program Files'), with yen signs used as a path separator. If we convert SJIS-2004 0x5C to ASCII 0x5C, then the path separators will be 'fixed' by the conversion. For general written texts, it is much better to convert backslashes to... backslashes. And yen signs, to yen signs.	2020-11-25 20:51:44 +02:00
Alex Dowad	5c805655db	Enhance handling of CP51932 encoding - Don't pass 'control' characters through in the middle of a multi-byte char - Treat truncated multi-byte characters as an error	2020-11-25 20:51:44 +02:00
Alex Dowad	beef597124	Fix mbstring support for SJIS-Mobile (DoCoMo, KDDI, and Softbank variants of Shift-JIS) Lots of problems here. - Don't pass 'control' characters through silently in the middle of a multi-byte character. - Treat it as an error if a multi-byte character is truncated. - For ESC sequences used to encode emoji on earlier Softbank phones, if an invalid ESC sequence is found, don't pass it through. Rather, handle it as an error and respect `mb_substitute_character`. - In ranges used by mobile vendors for emoji, if a certain byte sequence doesn't map to any emoji, don't emit a mangled value (actually a raw (ku*94)+ten value, which may not even be a valid Unicode codepoint at all). - When converting Unicode to SJIS-Mobile, don't mangle codepoints which fall in the 2nd range of MicroSoft vendor extensions. Some vendor-specific emoji have been mapped to standard Unicode codepoints now, rather than 'private use area' codepoints. When the legacy code was written, these codepoints may not have existed yet in the Unicode standard which was current at that time. Also do a major code cleanup -- remove dead code, rearrange what is left, use some new macros and helper functions to make the code clearer...	2020-11-25 20:51:44 +02:00
Alex Dowad	bbbadae0ae	Combine MBFL_ENCTYPE_MWC2{BE,LE} constants These constants indicate that a text encoding uses 2+ bytes for each character, and is either big endian or little endian (respectively). But nothing in mbstring cares about the difference between MBFL_ENCTYPE_MWC2BE and MBFL_ENCTYPE_MWC2LE. (Actually, nothing cares about whether these flags are set at all... maybe we should just remove them?)	2020-11-25 19:52:19 +02:00
Alex Dowad	72660c416a	Combine MBFL_ENCTYPE_WCS{2,4}{BE,LE} constants These flags identify text encodings in mbstring which use a constant number of bytes per character. While some parts of the code do use these flags, usually to detect cases which can be optimized due to constant-width encoding, nothing cares whether the encodings are 'LE' (little-endian) or 'BE' (big-endian). So we can simplify things by combining constants.	2020-11-25 19:52:19 +02:00
Alex Dowad	5ffcf563bd	Don't pass invalid JIS X 0212, JIS X 0213, and Windows-CP932 characters through Similarly to JIS X 0208, mbstring would pass kuten codes which are not mapped in the JIS X 0212, JIS X 0213, or CP932 character sets through silently when converting to another Japanese encoding.	2020-11-25 19:52:19 +02:00
Alex Dowad	8ae0473324	Don't pass invalid JIS X 0208 characters through Many Japanese encodings, such as JIS7/8, Shift JIS, ISO-2022-JP, EUC-JP, and so on encode characters from the JIS X 0208 character set. JIS X 0208 is based on the concept of a 94x94 table, with numbered rows and columns. However, more than a thousand of the cells in that table are empty; JIS X 0208 does not actually use all 94x94=8,836 possible kuten codes. mbstring had a dubious feature whereby, if a Japanese string contained one of these 'unmapped' kuten codes, and it was being converted to another Japanese encoding which was also based on JIS X 0208, the non-existent character would be silently passed through, and the unmapped kuten code would be re-encoded using the normal encoding method of the target text encoding. Again, this _only_ happened if converting the text with the funky kuten code to a Japanese encoding. If one tried converting it to Unicode, mbstring would treat that as an error. If somebody, somewhere, made their own private extension to JIS X 0208, and used the regular Japanese encodings like Shift JIS and EUC-JP to encode this private character set, then this feature might conceivably be useful. But how likely is that? If someone is using Shift JIS, EUC-JP, ISO-2022-JP, etc. to encode a funky version of JIS X 0208 with extra characters added, then that should be treated as a separate text encoding. The code which flags such characters with MBFL_WCSPLANE_JIS0208 is retained solely for error reporting in `mbfl_filt_conv_illegal_output`.	2020-11-25 19:52:19 +02:00
Alex Dowad	2759874a42	Enhance handling of CP932 text encoding - Don't allow control characters to appear in the middle of a multi-byte character. (This was a strange feature of mbstring; it doesn't make much sense, and iconv doesn't allow it.) - Treat truncated multi-byte characters as an error.	2020-11-25 19:52:19 +02:00
Alex Dowad	b489c1bc4d	Bugfixes for findInvalidChars (helper for mbstring test suite)	2020-11-25 19:52:19 +02:00
Alex Dowad	e169ad3b61	Consolidate all single-byte encodings in one source file We can squeeze out a lot of duplicated code in this way.	2020-11-11 11:18:59 +02:00
Alex Dowad	17d82b6886	Enhance mbstring support for UCS-2 text - For consistency with UTF-16, UTF-32, and UCS-4, strip leading byte order marks. - Treat it as an error if string is truncated (i.e. has an odd number of bytes).	2020-11-11 11:18:59 +02:00
Alex Dowad	6dd75478d5	Leading BOM is stripped for UTF-32 For consistency with UTF-16 and UCS-4. Also, do some code cleanup.	2020-11-11 11:18:59 +02:00
Alex Dowad	1cf12c02f0	Add test suite for SJIS-mac encoding	2020-11-11 11:18:58 +02:00
Alex Dowad	fbdcab953d	Unicode -> SJIS-mac conversion doesn't reject valid codepoints after a bad transcoding hint To give the background on this issue, here is an excerpt from JAPANESE.txt, from the Unicode Consortium: Apple has defined a block of 32 corporate characters as "transcoding hints." These are used in combination with standard Unicode characters to force them to be treated in a special way for mapping to other encodings; they have no other effect. Sixteen of these transcoding hints are "grouping hints" - they indicate that the next 2-4 Unicode characters should be treated as a single entity for transcoding. The other sixteen transcoding hints are "variant tags" - they are like combining characters, and can follow a standard Unicode (or a sequence consisting of a base character and other combining characters) to cause it to be treated in a special way for transcoding. These always terminate a combining-character sequence. The transcoding coding hints used in this mapping table are: 0xF860 group next 2 characters as a single entity for transcoding 0xF861 group next 3 characters as a single entity for transcoding 0xF862 group next 4 characters as a single entity for transcoding 0xF87A variant tag for "negative" (i.e. black & white reversed) 0xF87E variant tag for vertical form 0xF87F variant tag for other alternate form For example, the Apple addition character 0x85AB is Roman numeral thirteen. There is no single Unicode for this (although there are standard Unicodes for Roman numerals 1-12). Using the grouping hint 0xF862 in combination with standard Unicodes, we can map this as 0xF862+0x0058+0x0049+0x0049+0x0049 (i.e. X + I + I + I). Our SJIS-mac conversion code actually recognizes some special sequences which start with an Apple 'transcoding hint'. However, if a transcoding hint is misplaced and is not followed by one of the expected sequences, we can just emit one error marker for the bad transcoding hint and then process the following codepoint as normal.	2020-11-11 11:18:58 +02:00
Alex Dowad	b27a34c5a9	SJIS-mac encoding conversion: Stop the carnage of innocent Unicode codepoints When converting Unicode to MacJapanese, some special sequences of Unicode codepoints are collapsed into a single SJIS character. When the implementation sees a codepoint which might begin such a sequence, it is cached and examined again after the next codepoint arrives. If it turns out that it wasn't one of the 'special' sequences, then a 'fallback' conversion table is consulted to convert the cached codepoint. Then we re-enter the regular conversion code to convert the immediately following codepoint. BUT, local variables need to be reinitialized properly when doing this! Because the locals weren't reinitialized, the sad result was that some codepoints would get chopped up into bit salad and emitted as something totally bogus (which might not even be valid SJIS-mac text at all).	2020-11-11 11:18:58 +02:00
Alex Dowad	7f0e86b2dc	Convert Unicode halfwidth Yen sign to MacJapanese halfwidth Yen sign Since 1993, Unicode has had a specific codepoint for a fullwidth Yen sign. Likewise, MacJapanese has separate kuten codes for halfwidth and fullwidth Yen signs. But mbstring mapped _both_ Yen sign codepoints to the MacJapanese fullwidth Yen sign. It's probably more appropriate to map the 'ordinary' Yen sign to the MacJapanese halfwidth Yen sign. Besides, this means that the conversion between Unicode and MacJapanese is closer to being lossless and reversible.	2020-11-11 11:18:58 +02:00
Alex Dowad	4c39cd3d1d	SJIS-mac encoding conversion: handle invalid (or truncated) 2nd byte for Kanji correctly Also, don't accept 1st bytes above 0xED, since none of the possible 2-byte sequences starting with 0xEE and above are actually mapped to any character.	2020-11-11 11:18:58 +02:00
Alex Dowad	d40f9cf735	Add test suite for SJIS-2004 encoding	2020-11-11 11:18:58 +02:00
Alex Dowad	eda73a5f6f	Don't mangle non-Japanese chars which appear after a 'combining' kana in SJIS-2004 Unicode has 'combining' characters which join with another following character. Japanese hiragana and katakana with the 'two dots' voice mark can be represented in this way, with one Unicode character for the 'base' kana and another one which adds the voice mark. In SJIS-2004, however, there are dedicated characters for voiced and unvoiced kana. So some special checks are done to identify sequences of Unicode characters which need to be 'collapsed' into a single SJIS-2004 character. If a kana is immediately followed by some other unrelated character, like a Cyrillic letter, then the cached kana should be output 'as is' and we proceed with encoding the unrelated character. When doing this, though, we need to re-initialize local variables, or else the unrelated character will be mangled in some cases.	2020-11-11 11:18:58 +02:00
Alex Dowad	2f98bd8844	SJIS-2004 encoding conversion: handle invalid (or truncated) 2nd byte for Kanji correctly If the 2nd byte of a 2-byte character is invalid, then mb_substitute_character() should be respected. Instead, what mbstring was doing was 'swallowing' the first byte, then emitting the 2nd byte as if it was an ASCII character. Likewise, if the 2nd byte is missing, instead of just keeping quiet, report an illegal character as specified by mb_substitute_character().	2020-11-11 11:18:58 +02:00
Alex Dowad	a5827c2d35	Fix broken binary search function in mbstring This faulty binary search would never reject values at the very high end of the range being searched, even if they were not actually in the table. Among other things, this meant that some Unicode codepoints which do not correspond to any character in JIS X 0213 would be converted to bogus Shift-JIS-2004 values rather than being rejected.	2020-11-11 11:18:58 +02:00
Alex Dowad	b05ad5112a	Don't redundantly flush mbstring filters multiple times Each flush function in a chain of mbstring conversion filters always calls the next flush function in the chain. So it is not necessary to explicitly flush the second filter in a chain. (Due to this bug, in many cases, flush functions were actually being called three times.)	2020-11-11 11:18:58 +02:00
Alex Dowad	d1d50c2b7a	Test EUC-JP and Shift-JIS more thoroughly Previously, the unit tests for these text encodings covered all mappings from legacy -> Unicode, and all _reversible_ mappings from Unicode -> legacy. However, we should also test the few Unicode -> legacy mappings which are not reversible.	2020-11-11 11:18:58 +02:00
Alex Dowad	3e7acf901d	Remove mbstring identify filters mbstring had an 'identify filter' for almost every supported text encoding which was used when auto-detecting the most likely encoding for a string. It would run over the string and set a 'flag' if it saw anything which did not appear likely to be the encoding in question. One problem with this scheme was that encodings which merely appeared less likely to be the correct one were completely rejected, even if there was no better candidate. Another problem was that the 'identify filters' had a huge amount of code duplication with the 'conversion filters'. Eliminate the identify filters. Instead, when auto-detecting text encoding, use conversion filters to see whether the input string is valid in candidate encodings or not. At the same type, watch the type of codepoints which the string decodes to and mark it as less likely if non-printable characters (ESC, form feed, bell, etc.) or 'private use area' codepoints are seen. Interestingly, one old test case in which JIS text was misidentified as UTF-8 (and this wrong behavior was enshrined in the test) was 'fixed' and the JIS string is now auto-detected as JIS.	2020-11-09 13:45:17 +02:00
Alex Dowad	a416f938f3	Treat non-ASCII characters as erroneous when converting ASCII text	2020-11-09 13:45:17 +02:00
Alex Dowad	8f6889b20d	Fix mbstring support for EUC-JP text encoding - Don't allow control characters to appear in the middle of a multi-byte character. (A strange feature, or perhaps misfeature, of mbstring which is not present in other libraries such as iconv.) - When checking whether string is valid, reject kuten codes which do not map to any character, whether converting from EUC-JP to another encoding, or converting another encoding which uses JIS X 0208/0212 charsets to EUC-JP. - Truncated multi-byte characters are treated as an error.	2020-11-09 13:45:17 +02:00
Alex Dowad	ad7e0f16cc	Fix mbstring support for Shift-JIS - Reject otherwise valid kuten codes which don't map to anything in JIS X 0208. - Handle truncated multi-byte characters as an error. - Convert Shift-JIS 0x7E to Unicode 0x203E (overline) as recommended by the Unicode Consortium, and as iconv does. - Convert Shift-JIS 0x5C to Unicode 0xA5 (yen sign) as recommended by the Unicode Consortium, and as iconv does. (NOTE: This will affect PHP scripts which use an internal encoding of Shift-JIS! PHP assigns a special meaning to 0x5C, the backslash. For example, it is used for escapes in double-quoted strings. Mapping the Shift-JIS yen sign to the Unicode yen sign means the yen sign will not be usable for C escapes in double-quoted strings. Japanese PHP programmers who want to write their source code in Shift-JIS for some strange reason will have to use the JIS X 0208 backlash or 'REVERSE SOLIDUS' character for their C escapes.) - Convert Unicode 0x5C (backslash) to Shift-JIS 0x815F (reverse solidus). - Immediately handle error if first Shift-JIS byte is over 0xEF, rather than waiting to see the next byte. (Previously, the value used was 0xFC, which is the limit for the 2nd byte and not the 1st byte of a multi-byte character.) - Don't allow 'control characters' to appear in the middle of a multi-byte character. The test case for bug 47399 is now obsolete. That test assumed that a number of Shift-JIS byte sequences which don't map to any character were 'valid' (because the byte values were within the legal ranges).	2020-11-09 13:45:16 +02:00
Alex Dowad	cc03c54c36	Remove useless byte{2,4}{be,le} encodings from mbstring There is no meaningful difference between these and UCS-{2,4}. They are just a little bit more lax about passing errors silently. They also have no known use. Alias to UCS-{2,4} in case someone, somewhere is using them.	2020-11-09 13:45:16 +02:00
Alex Dowad	3eb8828d1a	Fix issues with mbstring encoding tests I made some mistakes on this code, which meant that not everything which should be tested was actually being tested.	2020-11-09 13:45:16 +02:00
Alex Dowad	ff953f254c	Add test suite for ARMSCII-8 encoding	2020-11-02 21:31:06 +02:00
Alex Dowad	9f5a4b3bd9	Fix mbstring support for ARMSCII-8 - Identify filter was completely wrong. - Respect `mb_substitute_character` rather than converting invalid bytes to Unicode 0xFFFD (generic replacement character). - Don't convert Unicode 0xFFFD to a valid ARMSCII-8 character. - When converting ARMSCII-8 to ARMSCII-8, don't pass invalid bytes through silently.	2020-11-02 21:31:06 +02:00
Alex Dowad	be1a215538	Optimize (AND FIX) mb_check_encoding (cut execution time by 50%+) Previously, `mb_check_encoding` did an awful lot of unneeded work. In order to determine whether a string was valid or not, it would convert the whole string into wchar (code points), which required dynamically allocating a (potentially large) buffer. Then it would turn right around and convert that big 'ol buffer of code points back to the original encoding again. Finally, it would check whether any invalid bytes were detected during that long and onerous process. The thing is, mbstring _already_ has machinery for detecting whether a string is valid in a certain encoding or not, and it doesn't require copying any data around or allocating buffers. Better yet, it can fail fast when an invalid byte is found. Why not use it? It's sure a lot faster! Further, the legacy code was also badly broken. Why? Because aside from checking whether illegal characters were detected, it would also check whether the conversion to and from wchars was lossless. But, some encodings have more than one valid encoding for the same character. In such cases, it is not possible to make the conversion to and from wchars lossless for every valid character. So `mb_check_encoding` would actually reject good strings in a lot of encodings!	2020-11-02 21:31:06 +02:00
Alex Dowad	335c1b98c2	Add test suite for KOI8-U encoding	2020-11-02 21:31:06 +02:00
Alex Dowad	e81458862b	Remove dead code from mbfilter_koi8u.c (and do general code cleanup)	2020-11-02 21:31:06 +02:00
Alex Dowad	f9826fba46	All bytes are valid in KOI8-U encoding	2020-11-02 21:31:06 +02:00
Alex Dowad	9db4387f14	Add test suite for KOI8-R encoding	2020-11-02 21:31:06 +02:00
Alex Dowad	fde7794556	Remove dead code from mbfilter_iso8859_{2,4,5,9,10,13,14,15,16}.c ...Plus some dead code related to ISO-8859-1.	2020-11-02 21:31:06 +02:00
Alex Dowad	0a8ebb36a5	Remove dead code from mbfilter_koi8r.c	2020-11-02 21:31:06 +02:00
Alex Dowad	7b97789ec0	All bytes are valid in KOI8-R encoding	2020-11-02 21:31:06 +02:00
Alex Dowad	9980534a4e	Add test suite for CP850 encoding	2020-11-02 21:31:06 +02:00
Alex Dowad	b6e75265d0	Remove dead code from mbfilter_cp850.c (and do general code cleanup) Since there are no invalid bytes in CP850, these `if` conditions will never be true.	2020-11-02 21:31:06 +02:00
Alex Dowad	8926252ee8	All bytes are valid in CP850 encoding	2020-11-02 21:31:06 +02:00
Alex Dowad	0485bed4c7	Add test suite for CP866 encoding	2020-11-02 21:31:06 +02:00
Alex Dowad	20a404f765	Remove dead code from mbfilter_cp866.c (and do general code cleanup) Since there are no invalid bytes in CP866, these `if` conditions will never be true.	2020-11-02 21:31:06 +02:00
Alex Dowad	bc04e0cc6d	All bytes are valid in CP866 encoding	2020-11-02 21:31:05 +02:00

1 2 3 4 5 ...

1997 Commits