archived-php-src

mirror of https://github.com/php/php-src.git synced 2026-03-31 20:53:00 +02:00

Author	SHA1	Message	Date
Alex Dowad	b1ab76f742	Minor formatting tweaks in mbfilter_euc_kr.c	2021-06-17 13:12:40 +02:00
Alex Dowad	958ef47d2b	When flushing CP5022x conversion filter, also flush next filter in chain All the mbstring encoding conversion filters do this. I missed it when adding a flush function for CP5022x.	2021-06-17 13:12:40 +02:00
Alex Dowad	caeaa662ab	Strict conversion of UHC text to Unicode Previously, mbstring would accept a lot of things which were not valid UHC text. No more. - Don't allow single-byte control characters to appear where the 2nd byte of a multi-byte character should be. - Validate that the 2nd byte of a multi-byte character is in the expected range. - Treat it as an error if a multi-byte character is truncated. Also add a test suite to confirm that UHC conversion (both to and from Unicode) works according to spec.	2021-06-17 13:12:40 +02:00
Alex Dowad	4550036d96	Minor formatting tweaks in mbfilter_uhc.c	2021-06-17 13:12:40 +02:00
Alex Dowad	e2459857af	Remove duplicate implementation of CP932 from mbstring Sigh. Double sigh. After fruitlessly searching the Internet for information on this mysterious text encoding called "SJIS-open", I wrote a script to try converting every Unicode codepoint from 0-0xFFFF and compare the results from different variants of Shift-JIS, to see which one "SJIS-open" would be most similar to. The result? It's just CP932. There is no difference at all. So why do we have two implementations of CP932 in mbstring? In case somebody, somewhere is using "SJIS-open" (or its aliases "SJIS-win" or "SJIS-ms"), add these as aliases to CP932 so existing code will continue to work.	2021-06-17 13:12:40 +02:00
Alex Dowad	7502c86342	Add test suite for UTF-{7,8,16,32} Also fix a couple small problems with UTF-32 and UTF-8 support: - UTF-32 would pass very large codepoints (>= 0x80000000), which are not valid. - UTF-8 would sometimes emit two error marker characters for a single bad input byte.	2021-06-17 13:12:40 +02:00
George Peter Banyard	c40231afbf	Mark various functions with void arguments. This fixes a bunch of [-Wstrict-prototypes] warning, because in C func() and func(void) have different semantics.	2021-05-12 14:55:53 +01:00
KsaR	01b3fc03c3	Update http->https in license (#6945 ) 1. Update: http://www.php.net/license/3_01.txt to https, as there is anyway server header "Location:" to https. 2. Update few license 3.0 to 3.01 as 3.0 states "php 5.1.1, 4.1.1, and earlier". 3. In some license comments is "at through the world-wide-web" while most is without "at", so deleted. 4. fixed indentation in some files before \|	2021-05-06 12:16:35 +02:00
Alex Dowad	7159907d30	Fix mbstring support for ISO-2022-JP-MS encoding - Treat it as error if multi-byte string or escape sequence is truncated - Don't allow 'control' characters or escape sequences to appear in the middle of a multi-byte char As with ISO-2022-JP-KDDI, the main reference used to develop the tests was the behavior of the existing code. It would have been better to have some independent reference which we could cross-check our code against, but I couldn't find one.	2021-04-15 15:52:31 +02:00
Alex Dowad	570e89a9f3	Fix mbstring support for ISO-2022-JP-KDDI encoding - Treat it as an error if a multi-byte character or escape sequence is truncated - When converting other encodings to ISO-2022-JP-KDDI, don't swallow trailing hash characters or digits - Don't allow 'control' characters to appear in the middle of a multi-byte char Note: I was not able to find any kind of official or even semi-official specification for this legacy encoding. Therefore, the test suite for ISO-2022-JP-KDDI is based largely on the behavior of the existing code. Verifying the correctness of program code in this way is very questionable. In a sense, all you are proving is that the code "does what it does". However, the test suite will still expose any unintended _changes_ to behavior.	2021-04-15 15:52:31 +02:00
Alex Dowad	78dc160e3b	Catch and handle errors in mUTF-7 (IMAP) conversion	2021-04-15 15:52:31 +02:00
Alex Dowad	cef4b94eef	Code cleanup in mbfilter_utf7imap.c	2021-04-15 15:52:31 +02:00
Alex Dowad	8abc5e6827	Catch and handle errors in UTF-7 text conversion	2021-04-15 15:52:31 +02:00
Alex Dowad	689978a63b	Code cleanup in mbfilter_utf7.c	2021-04-15 15:52:31 +02:00
Alex Dowad	ebe6500a0b	Fix error reporting bug for Unicode -> CP50220 conversion To detect errors in conversion from Unicode to another text encoding, each mbstring conversion filter object maintains a count of 'bad' characters. After a conversion operation finishes, this count is checked to see if there was any error. The problem with CP50220 was that mbstring used a chain of two conversion filter objects. The 'bad character count' would be incremented on the second object in the chain, but this didn't do anything, as only the count on the first such object is ever checked. Fix this by implementing the conversion using a single conversion filter object, rather than a chain of two. This is possible because of the recent refactoring, which pulled out the needed logic for CP50220 conversion into a helper function.	2021-04-15 15:52:31 +02:00
Alex Dowad	1f130d4e58	Refactor mbfl_filt_tl_jisx0201_jisx0208 by moving kana conversion into helper function This will enable us to simplify the code for CP50220 conversion, which also relies on this same kana conversion logic.	2021-04-15 15:52:31 +02:00
Alex Dowad	319a340843	Simplify code for working with halfwidth/fullwidth kana conversion filter There's no need to dynamically allocate a struct to hold the 'mode' parameter; just store it directly in `filt->opaque`. Some other things were also being done in an unnecessarily roundabout way. Also, the 'copy' function for CP50220 conversion filters was both broken and unnecessary. Broken, because it malloc'd memory which was never freed by anything. Unnecessary, because the point of the copy is so that various algorithms can try running bytes through a conversion filter and see how many output bytes or characters result, and then back out by restoring the filters to their previous state. But here's the thing; CP50220 conversion filters don't hold cached bytes, which is the main thing which would need to be restored to a previous state.	2021-04-15 15:52:31 +02:00
Alex Dowad	a900ec3397	Remove unneeded 'filter_ctor' member from mbfl_convert_filter struct This function pointer is only called when initializing the struct. After that nothing is done with it. Therefore, there is no need to keep it in the struct.	2021-04-15 15:52:31 +02:00
Alex Dowad	affc3076f3	Remove unused 'next_filter' member from mbfl_filt_tl_jisx0201_jisx0208_param struct	2021-04-15 15:52:31 +02:00
Alex Dowad	636251a522	Remove useless function mbfl_filt_tl_jisx0201_jisx0208_init This constructor function doesn't do anything different than the generic one. There's no need to invoke it, either, when initializing a CP50220 conversion filter.	2021-04-15 15:52:31 +02:00
George Peter Banyard	5caaf40b43	Introduce pseudo-keyword ZEND_FALLTHROUGH And use it instead of comments	2021-04-07 00:46:29 +01:00
Alex Dowad	d8c785b894	Update 'East Asian Width' table to comply with Unicode 13.0 Instead of manually maintaining the data in eaw_table.h, it is now automatically generated by ucgendat/ucgendat.php, using the EastAsianWidth.txt file from the Unicode Consortium. Something must be said about the deleted test case. Back in 2004, someone noticed that `mb_strwidth` didn't comply with Unicode 4.0. A test case was added to expose the problem. Well, time keeps moving on, and with the changing years, new Unicodes are born and old Unicodes die. Some characters which were counted as double-width in Unicode 4.0 are no longer such in Unicode 13.0, which renders the test case obsolete. At the same time, make a couple of spelling/grammar fixes in ucgendat.php.	2021-01-19 20:38:44 +02:00
Alex Dowad	a06c20a17c	Remove useless constant MBFL_ENCTYPE_MBCS This flag indicated that an encoding was 'multi-byte'; it can use a variable number of bytes to encode each character. As it turns out, we don't actually need to check this flag anywhere, so it's better to remove it.	2021-01-15 21:55:41 +02:00
Alex Dowad	6cbeb6476e	Remove unused macros from mbfilter_cp51932.c, mbfilter_iso2022jp_mobile.c	2021-01-15 21:55:41 +02:00
Alex Dowad	34ece40872	Remove useless mbstring encoding 'JIS-ms' MicroSoft invented three encodings very similar to ISO-2022-JP/JIS7/JIS8, called CP50220, CP50221, and CP50222. All three are supported by mbstring. Since these encodings are very similar, some code can be shared. Actually, conversion of CP50220/1/2 to Unicode is exactly the same operation; it's when converting from Unicode to CP50220/1/2 that some small differences arise in how certain katakana are handled. The most important common code was a function called `mbfl_filt_wchar_jis_ms`. The `jis_ms` part doubtless refers to the fact that these encodings are modified versions of 'JIS' invented by 'MS'. mbstring also went a step further and exported 'JIS-ms' to userland as a separate encoding from CP50220/1/2. If users requested 'JIS-ms' conversion, they got something like CP50220/1/2, minus their special ways of handling half-width katakana when converting from Unicode. But... that 'encoding' is not something which actually exists in the world outside of mbstring. CP50220/1/2 do exist in MicroSoft software, but not 'JIS-ms'. For a text encoding conversion library, inventing new variant encodings and implementing them is not very productive. Our interest is in handling text encodings which real people actually use for... you know, storing actual text and things like that.	2021-01-15 21:55:41 +02:00
Alex Dowad	fcbe45de10	Remove useless mbstring encoding 'CP50220-raw' CP50220 is a variant of ISO-2022-JP invented by MicroSoft, which handles some Unicode characters which are not representable in ISO-2022-JP by converting them to similar characters which are representable. What, then, is CP50220-raw? An Internet search turns up absolutely nothing. Reference works which I consulted don't say anything about it. Other text conversion libraries don't support it. From looking at the code: It's just the same as CP50220, but it accepts unmapped JIS X 0208 characters passed through from other Japanese encodings and silently encodes them using the usual ISO-2022-JP escape sequence and representation for JIS X 0208 characters. It's hard to see how this could be useful. OK, let me come out and say it: it's _not_ useful. We can confidently jettison this (mis)feature.	2021-01-15 21:55:41 +02:00
Alex Dowad	888f5d7729	CP5022{0,1,2}: treat truncated multibyte characters as error	2021-01-15 21:55:41 +02:00
Alex Dowad	0ec34da8e0	CP5022{0,1,2}: treat unrecognized escapes as error	2021-01-15 08:30:36 +02:00
Alex Dowad	a50607d11d	CP5022{0,1,2}: use JISX0201 for U+203E (overline) Same issue as `d497c0e96f` addressed for JIS7/JIS8, but for CP5022{0,1,2} this time.	2021-01-15 08:30:30 +02:00
Alex Dowad	5e5243ab65	CP5022{0,1,2}: convert Unicode codepoints in 'user' area (0xE000-E757) correctly Unicode has a range of 'private' codepoints which individual applications can use for their own purposes. When they were inventing CP932, MicroSoft mapped these 'private' or 'user' codepoints to ten new rows added to the JIS X 0208 character table. (JIS X 0208 is based on a 94x94 table; MS used rows 95-114 for private characters.) `mbfl_filt_conv_wchar_jis_ms` converted these private codepoints to rows 85-94 rather than 95-114. The code included a link to a document on the OpenGroup web site, dating back to 1996 [1], which proposed mapping private codepoints to these rows. However, that is not consistent with what mbstring does when converting CP5022x to Unicode. There seems to be a dearth of information on CP5022x on the web. However, I did find one (Japanese-language) page on CP50221, which states that it maps kuten codes 0x7F21-0x927E to the 'private' Unicode codepoints [2]. As a side note, using rows higher than 95 does seem to defeat one purpose of using an ISO-2022-JP variant: ISO-2022-JP was specifically designed to be "7-bit clean", but once you go beyond row 95, the ku codes are 0x80 and up, so 8 bits are needed. [1] https://web.archive.org/web/20000229180004/http://www.opengroup.or.jp/jvc/cde/ucs-conv.html [2] https://www.wdic.org/w/WDIC/Microsoft%20Windows%20Codepage%20%3A%2050221	2021-01-15 08:26:46 +02:00
Alex Dowad	6e9c8386cb	CP5022{0,1,2}: convert characters in ku 0x2D (13th row) correctly Essentially, CP5022{0,1,2} are to CP932 as ISO-2022-JP is to Shift-JIS. As Shift-JIS and ISO-2022-JP both encode characters from the JIS X 0208 charset, CP932 and CP5022x both encode characters from JIS X 0208 _plus_ extra characters added as MicroSoft vendor extensions. Among the added characters are a number of symbols which MS put in the 13th row of the 94x94 character table. (In JIS X 0208, that row is empty.) mbfilter_cp50220x.c had an `if` clause which was intended to handle the conversion of characters in that 13th row, but it was dead code, as the previous clause was always true in those cases. The solution is to reverse the order of those two clauses (just as they already appeared in mbfilter_cp932.c).	2021-01-15 08:26:38 +02:00
Alex Dowad	cdd0724291	Stricter handling of erroneous input when converting CP5022{0,1,2} text encoding Don't allow escape sequences to start in the middle of a multibyte character. Also, don't silently pass through illegal bytes which appear where the 2nd byte of a multibyte character should be.	2021-01-15 08:25:44 +02:00
Alex Dowad	4299e2de42	JIS7/JIS8 encoding: treat truncated multibyte characters as error	2021-01-14 22:34:16 +02:00
Alex Dowad	b67e358e75	JIS7/JIS8 encoding: handle invalid 2nd byte for Kanji correctly Previously, in ISO-2022-JP/JIS7/JIS8, if an escape sequence (starting with 0x1B) appeared where the 2nd byte of a multibyte character should have been, mbstring would forget all about the truncated multibyte character and happily accept the escape sequence. However, such sequences are not legal and should be flagged as errors. Also, any other illegal bytes appearing where the 2nd byte of a multibyte character was expected were just passed through quietly to the output. Fix that. Also add a test suite for both ISO-2022-JP and JIS7/JIS8. (These are extremely similar encodings; JIS7 and JIS8 are variants of ISO-2022-JP. mbstring's 'JIS' is actually a combination of JIS7 _and_ JIS8, since the extensions which each one adds to ISO-2022-JP are disjoint.)	2021-01-14 22:31:31 +02:00
Alex Dowad	d497c0e96f	JIS7/JIS8 encoding: use JISX0201 for U+203E (overline) In other legacy Japanese encodings like Shift-JIS, we are now using a specific JISX 0208 character for the Unicode overline (U+203E). Previously, the single byte 0x7E was used, but an ASCII 0x7E does not represent an overline, so this was changed. However, JIS7/JIS8 can represent characters in the JISX 0201 character set as well. That character set also includes an overline character, which takes less bytes to encode than the corresponding JISX 0208 character, so we'll use it. This is what mbstring had been doing for a long time; but it changed as a side effect of the recent changes to how U+203E is encoded in Shift-JIS, etc. So change it back.	2021-01-14 22:26:24 +02:00
Alex Dowad	40384da36a	JIS7/JIS8 encoding: treat unrecognized escapes as error	2021-01-14 22:26:24 +02:00
Alex Dowad	c11e12ffe0	Add comment explaining why ISO-2022-JP-2004, etc strings end with ESC ( B These encodings have multiple modes which can be selected via escape sequences. The default starting mode is ASCII. If a string _ends_ in a different mode, we emit a 'redundant' escape sequence to switch back to ASCII. If the resulting string is never concatenated with other strings, that extra escape sequence serves no purpose. But if the resulting string is concatenated with other strings of the same encoding, it ensures that the resulting string will be valid.	2021-01-14 22:26:24 +02:00
Alex Dowad	4b95fdf2ca	ISO-2022-JP-2004 conversion: handle invalid characters correctly	2021-01-14 22:26:24 +02:00
Alex Dowad	e14bdc041a	ISO-2022-JP-2004 conversion: treat unrecognized escapes as error	2021-01-14 22:26:24 +02:00
Alex Dowad	4d65c2a992	ISO-2022-JP-2004 conversion: represent backslash and tilde as ASCII This issue dates back to some commits I merged recently, which made encodings like Shift-JIS-2004 use appropriate JIS X 0208 characters to represent backslashes and tildes, rather than single-byte characters which are used in those encodings with a different meaning (for example, in these encodings, 0x5C is used for a halfwidth Yen sign, rather than a backslash). There was an unintended side effect: ISO-2022-JP-2004 was also made to represent backslashes and tildes using JIS X 0208 characters. However, ISO-2022-JP explicitly includes ASCII as one of its selectable character sets, and ISO-2022-JP-2004 is just an extension of ISO-2022-JP. So when converting text to ISO-2022-JP-2004, we can convert Unicode backslashes and tildes to ASCII rather than using the corresponding JIS X 0208 characters.	2021-01-14 22:26:24 +02:00
Alex Dowad	c9fea7db72	Convert U+00AF (MACRON) to 0x8150 (FULLWIDTH MACRON) in some SJIS variants Except for vanilla Shift-JIS, where 0x7E is a halfwidth overline/macron. As for Shift-JIS-2004, it has an added character (byte sequence 0x854A) which was defined as a halfwidth macron in JIS X 0213:2000, so we use that.	2020-11-25 20:51:45 +02:00
Alex Dowad	ecf718470b	Convert U+FF5E (FULLWIDTH TILDE) to 0x8160 (WAVE DASH) in SJIS variants By entering this character in the JIS X 0208 conversion table, we can remove a bunch of explicit `if` clauses in different conversion filters. It also means that U+FF5E can be converted into SJIS-mac now; I don't know why this one SJIS variant rejected U+FF5E before, since 0x8160 means the same thing in SJIS-mac as the others.	2020-11-25 20:51:45 +02:00
Alex Dowad	4f3bd2e235	Convert U+203E (OVERLINE) to 0x8150 (FULLWIDTH MACRON) in some SJIS variants Converting U+203E to 0x7E was especially wrong for CP932, where 0x7E represents a tilde. For vanilla Shift-JIS and Shift-JIS-2004, converting to 0x7E is acceptable, since 0x7E does represent an overline/macron in those encodings. Follow the same principle in CP51932, which is closely related to CP932.	2020-11-25 20:51:45 +02:00
Alex Dowad	0d0029d729	0x7E is not a tilde in Shift-JIS{,-2004}	2020-11-25 20:51:45 +02:00
Alex Dowad	e4ee979111	0x5C is not a Yen sign in CP932 (or CP51932) When Microsoft created CP932 (their version of Shift-JIS), they explicitly used bytes 0-0x7F to represent ASCII characters rather than JIS X 0201 characters. So when converting Unicode to CP932, it is not correct to convert U+00A5 to CP932 0x5C. Fortunately, CP932 does have a multi-byte FULLWIDTH YEN SIGN character which we can use instead. CP51932 uses the same extended character set as CP932; while CP932 is MicroSoft's extended version of Shift-JIS, CP51932 is their extended version of EUC-JP. So the same reasoning applies to CP51932.	2020-11-25 20:51:45 +02:00
Alex Dowad	315d48b434	0x5C is not a backslash in Shift-JIS-2004 Shift-JIS-2004 is an extension of Shift-JIS, which uses 0x5C for the Yen sign. Therefore, it is not correct to convert ASCII 0x5C (backslash) to Shift-JIS-2004 0x5C (yen sign). JIS X 0208 does have a backslash, so we can convert ASCII backslash to SJIS-2004 backslash instead. From time immemorial, there has been confusion around the treatment of 0x5C bytes on systems using legacy Japanese encodings. JIS X 0201 specified that 0x5C means a yen sign, and thus fonts on Japanese systems, including early versions of Windows, displayed a 0x5C byte as a yen sign. This meant that when ASCII text files were displayed on such systems, what were meant to be backslashes would appear as yen signs. Japanese C programmers could write character escapes using yen signs, and C compilers built on the assumption that the input was ASCII would interpret these escapes as desired. Likewise for shell scripts. Et cetera, et cetera... Therefore, if the input to `mb_convert_encoding` is (for example) a C program, and after converting to Shift-JIS-2004, the user wishes to feed the output into a C compiler, then perhaps ASCII 0x5C should be mapped to SJIS 0x5C. However, this scenario is ridiculous and will never happen. A more realistic scenario might be: an article written in SJIS-2004 has embedded Windows file paths (like 'C:\Program Files'), with yen signs used as a path separator. If we convert SJIS-2004 0x5C to ASCII 0x5C, then the path separators will be 'fixed' by the conversion. For general written texts, it is much better to convert backslashes to... backslashes. And yen signs, to yen signs.	2020-11-25 20:51:44 +02:00
Alex Dowad	5c805655db	Enhance handling of CP51932 encoding - Don't pass 'control' characters through in the middle of a multi-byte char - Treat truncated multi-byte characters as an error	2020-11-25 20:51:44 +02:00
Alex Dowad	beef597124	Fix mbstring support for SJIS-Mobile (DoCoMo, KDDI, and Softbank variants of Shift-JIS) Lots of problems here. - Don't pass 'control' characters through silently in the middle of a multi-byte character. - Treat it as an error if a multi-byte character is truncated. - For ESC sequences used to encode emoji on earlier Softbank phones, if an invalid ESC sequence is found, don't pass it through. Rather, handle it as an error and respect `mb_substitute_character`. - In ranges used by mobile vendors for emoji, if a certain byte sequence doesn't map to any emoji, don't emit a mangled value (actually a raw (ku*94)+ten value, which may not even be a valid Unicode codepoint at all). - When converting Unicode to SJIS-Mobile, don't mangle codepoints which fall in the 2nd range of MicroSoft vendor extensions. Some vendor-specific emoji have been mapped to standard Unicode codepoints now, rather than 'private use area' codepoints. When the legacy code was written, these codepoints may not have existed yet in the Unicode standard which was current at that time. Also do a major code cleanup -- remove dead code, rearrange what is left, use some new macros and helper functions to make the code clearer...	2020-11-25 20:51:44 +02:00
Alex Dowad	bbbadae0ae	Combine MBFL_ENCTYPE_MWC2{BE,LE} constants These constants indicate that a text encoding uses 2+ bytes for each character, and is either big endian or little endian (respectively). But nothing in mbstring cares about the difference between MBFL_ENCTYPE_MWC2BE and MBFL_ENCTYPE_MWC2LE. (Actually, nothing cares about whether these flags are set at all... maybe we should just remove them?)	2020-11-25 19:52:19 +02:00
Alex Dowad	72660c416a	Combine MBFL_ENCTYPE_WCS{2,4}{BE,LE} constants These flags identify text encodings in mbstring which use a constant number of bytes per character. While some parts of the code do use these flags, usually to detect cases which can be optimized due to constant-width encoding, nothing cares whether the encodings are 'LE' (little-endian) or 'BE' (big-endian). So we can simplify things by combining constants.	2020-11-25 19:52:19 +02:00

1 2 3 4 5 ...

400 Commits