archived-php-src

mirror of https://github.com/php/php-src.git synced 2026-04-03 22:22:18 +02:00

Author	SHA1	Message	Date
Nikita Popov	2dafb0e30f	Add comments to grouped character properties [ci skip]	2021-08-24 22:09:26 +02:00
Nikita Popov	425c2e3ba1	Combine control into one character group Same as with punct, we're currently not interested in distinguishing between Cc and Cf, so only store their union.	2021-08-24 20:39:16 +02:00
Nikita Popov	f458b16041	Combine punctuation into one character group We're not currently interested in distinguishing between individual punctuation types, so just merge everything into one general category to make the property lookup more efficient.	2021-08-24 19:21:21 +02:00
Nikita Popov	d2073179e3	Return bool from php_unicode_is_prop()	2021-08-24 19:21:21 +02:00
Nikita Popov	3be94217f4	Don't use sentinel value for unicode property lookup 0xffff was used to mark character properties without any members. This made the code unnecessarily complicated, because we need to check for 0xffff values when looking up the property ranges. We can simply encode this as an empty set of ranges.	2021-08-24 15:53:43 +02:00
Nikita Popov	14173186db	Add EXTENSIONS section	2021-08-11 14:03:18 +02:00
Nikita Popov	28500fe4ef	Fixed bug #81349 The ascii to wchar was reporting errors using conv_illegal_output, while it should have been using WCSGROUP_THROUGH. Effectively that replaced illegal characters with '?' for the purpose of identification.	2021-08-11 11:37:02 +02:00
Nikita Popov	89aa42c74b	Add missing EXTENSIONS section	2021-07-28 12:38:41 +02:00
Nikita Popov	a1c1ee6a48	Don't use opaque for encoding detection score opaque is used by the htmlentities filter, which means that we end up trying to free the score value as a pointer. Don't try to be overly tricky here and simply allocate a separate structure to hold the number of illegal characters and the score.	2021-07-28 10:54:27 +02:00
Nikita Popov	9d0db2e98a	Fixed bug #81298 Creation of the filter may fail for some special encodings, for which detection is not supported.	2021-07-28 10:11:46 +02:00
Alex Dowad	26fc7c4256	Fix typo in mbfilter.h As pointed out by Bruno Haible (https://haible.de/bruno).	2021-07-19 12:17:00 +02:00
Alex Dowad	13136a575d	Fix conversion of GB18030 text (and add test suite) - Truncated multi-byte characters are treated as an error - Reject GB18030 4-byte codes which translate to (non-existent) Unicode codepoints above 0x10FFFF - Add a number of missing mappings from the GB18030 standards (These mappings are supported by iconv. I don't know why they were missing from mbstring.)	2021-07-19 12:17:00 +02:00
Alex Dowad	340164bcc9	Reduce size of conversion tables for CP936	2021-07-19 12:17:00 +02:00
Alex Dowad	73c6a5b89d	Fix conversion of Big5 and CP950 text (and add test suite) - Truncated multi-byte characters are treated as an error - Follow recommended mappings from Unicode consortium	2021-07-19 12:17:00 +02:00
Nikita Popov	639015845f	Deprecate calling mb_check_encoding() without argument Part of https://wiki.php.net/rfc/deprecations_php_8_1.	2021-07-08 15:34:49 +02:00
Alex Dowad	b626e893ff	Fix conversion of ISO-2022-KR text (and add test suite) - Truncated multi-byte characters are treated as an error - Truncated or unrecognized escape sequences are treated as an error - ASCII control characters are not allowed to appear in the middle of a multi-byte character	2021-07-05 16:28:16 +02:00
Alex Dowad	658db1f6ea	Code cleanup in mbfilter_uhc.c	2021-07-05 16:28:16 +02:00
Alex Dowad	0a8c00755d	Fix conversion of EUC-JP-2004 text (and add test suite) - Truncated multi-byte characters are treated as an error now - Invalid multi-byte characters are treated as an error rather than being quietly swallowed - ASCII control characters are not allowed to appear in the middle of a multi-byte character	2021-07-05 16:28:16 +02:00
Alex Dowad	ff85ed8adc	Fix conversion of EUC-TW text (and add test suite) - Treat text which ends abruptly in the middle of a multi-byte character as erroneous. - Don't allow ASCII control characters to appear in the middle of a multi-byte character. - If an illegal byte appears in the middle of a multi-byte character, go back to the initial state rather than trying to finish the multi-byte character. - There was a bug in the file with the conversion tables, which set the 'maximum codepoint which can be converted using table A2' using the size of table A1, not table A2. This meant that several hundred Unicode codepoints which should have been able to be converted to EUC-TW were flagged as erroneous instead. - When a sequence which cannot possibly be a prefix of a valid multi-byte character is found, immediately flag it as an error, rather than waiting to read more bytes first. - Allow characters in CNS-11643 plane 1 to be encoded as 4-byte sequences (although they can also be encoded as 2-byte sequences). This is allowed by the standard for EUC-TW text.	2021-06-29 12:25:21 +02:00
Alex Dowad	8b25e38b21	Fix conversion of EUC-CN text (and add test suite) - Flag truncated multi-byte characters as erroneous. - Don't allow ASCII control characters to appear in the middle of a multi-byte character. - There was a bug whereby some unrecognized Unicode codepoints would be passed through unchanged to the output when converting Unicode to EUC-CN. - Stick to the original EUC-CN standard, rather than CP936 (an extended version invented by MS).	2021-06-29 12:25:21 +02:00
Alex Dowad	69c979aaea	Fix conversion of EUC-KR text (and add test suite) - Treat truncated multi-byte characters as an error. - Don't allow ASCII control characters to appear in the middle of a multi-byte character. - There was also a bug whereby some unrecognized Unicode codepoints would be passed through to the output unchanged when converting Unicode to EUC-KR.	2021-06-29 12:25:21 +02:00
Alex Dowad	ddea06699b	Remove table generation scripts which have not been used for years	2021-06-29 12:25:21 +02:00
Alex Dowad	ebae1a4524	Fix conversion of CP936 text (and add test suite) - Treat truncated multi-byte characters as an error. - Don't allow ASCII control characters to appear in the middle of a multi-byte character. - Adjust some mappings to match recommendations in conversion table from Unicode Consortium.	2021-06-29 12:25:21 +02:00
Alex Dowad	1e5c3c13fd	Fix conversion of HZ text (and add test suite) - Treat truncated multi-byte characters as an error. - Don't allow ASCII control characters to appear in the middle of a multi-byte character. - Handle ~ escapes according to the HZ standard (RFC 1843). - Treat unrecognized ~ escapes as an error. - Multi-byte characters (between ~{ ~} escapes) are GB2312, not CP936. (CP936 is an extended version from MicroSoft, but the RFC does not state that this extended version of GB should be used.)	2021-06-29 12:25:21 +02:00
Patrick Allaert	aff365871a	Fixed some spaces used instead of tabs	2021-06-29 11:30:26 +02:00
Alex Dowad	b1ab76f742	Minor formatting tweaks in mbfilter_euc_kr.c	2021-06-17 13:12:40 +02:00
Alex Dowad	958ef47d2b	When flushing CP5022x conversion filter, also flush next filter in chain All the mbstring encoding conversion filters do this. I missed it when adding a flush function for CP5022x.	2021-06-17 13:12:40 +02:00
Alex Dowad	caeaa662ab	Strict conversion of UHC text to Unicode Previously, mbstring would accept a lot of things which were not valid UHC text. No more. - Don't allow single-byte control characters to appear where the 2nd byte of a multi-byte character should be. - Validate that the 2nd byte of a multi-byte character is in the expected range. - Treat it as an error if a multi-byte character is truncated. Also add a test suite to confirm that UHC conversion (both to and from Unicode) works according to spec.	2021-06-17 13:12:40 +02:00
Alex Dowad	4550036d96	Minor formatting tweaks in mbfilter_uhc.c	2021-06-17 13:12:40 +02:00
Alex Dowad	9868c17368	Mark CP932 and CP51932 encoding tests as 'slow tests'	2021-06-17 13:12:40 +02:00
Alex Dowad	e2459857af	Remove duplicate implementation of CP932 from mbstring Sigh. Double sigh. After fruitlessly searching the Internet for information on this mysterious text encoding called "SJIS-open", I wrote a script to try converting every Unicode codepoint from 0-0xFFFF and compare the results from different variants of Shift-JIS, to see which one "SJIS-open" would be most similar to. The result? It's just CP932. There is no difference at all. So why do we have two implementations of CP932 in mbstring? In case somebody, somewhere is using "SJIS-open" (or its aliases "SJIS-win" or "SJIS-ms"), add these as aliases to CP932 so existing code will continue to work.	2021-06-17 13:12:40 +02:00
Alex Dowad	7502c86342	Add test suite for UTF-{7,8,16,32} Also fix a couple small problems with UTF-32 and UTF-8 support: - UTF-32 would pass very large codepoints (>= 0x80000000), which are not valid. - UTF-8 would sometimes emit two error marker characters for a single bad input byte.	2021-06-17 13:12:40 +02:00
Nikita Popov	a06d015e61	Remove unnecessary mbstring skipifs These functions are always available (if the extension is available at all).	2021-06-14 15:27:28 +02:00
Nikita Popov	6600ad6067	Add some missing EXTENSIONS sections to misc tests	2021-06-14 14:52:44 +02:00
Nikita Popov	4083600bd5	Port mbstring to use EXTENSIONS	2021-06-11 14:00:43 +02:00
Nikita Popov	39131219e8	Migrate more SKIPIF -> EXTENSIONS (#7139 ) This is a mix of more automated and manual migration. It should remove all applicable extension_loaded() checks outside of skipif.inc files.	2021-06-11 12:58:44 +02:00
Nikita Popov	7485978339	Migrate SKIPIF -> EXTENSIONS (#7138 ) This is an automated migration of most SKIPIF extension_loaded checks.	2021-06-11 11:57:42 +02:00
Ayesh Karunaratne	b8e380ab09	Update deprecation message for incompatible float to int conversion Updates the deprecation message for implicit incompatible float to int conversion from: ``` Implicit conversion from non-compatible float %.H to int in %s on line %d ``` to ``` Implicit conversion from float %.H to int loses precision in %s on line %d ``` Related: #6661	2021-06-07 14:36:11 +02:00
George Peter Banyard	b6958bb847	Implement "Deprecate implicit non-integer-compatible float to int conversions" RFC. (#6661 ) RFC: https://wiki.php.net/rfc/implicit-float-int-deprecate Co-authored-by: Nikita Popov <nikita.ppv@gmail.com>	2021-05-31 15:48:45 +01:00
George Peter Banyard	e7135cb817	Use zend_string_equals_* API in a couple of more place Closes GH-6979	2021-05-14 13:45:17 +01:00
George Peter Banyard	aca6aefd85	Remove 'register' type qualifier (#6980 ) The compiler should be smart enough to optimize this on its own	2021-05-14 13:38:01 +01:00
George Peter Banyard	c40231afbf	Mark various functions with void arguments. This fixes a bunch of [-Wstrict-prototypes] warning, because in C func() and func(void) have different semantics.	2021-05-12 14:55:53 +01:00
KsaR	01b3fc03c3	Update http->https in license (#6945 ) 1. Update: http://www.php.net/license/3_01.txt to https, as there is anyway server header "Location:" to https. 2. Update few license 3.0 to 3.01 as 3.0 states "php 5.1.1, 4.1.1, and earlier". 3. In some license comments is "at through the world-wide-web" while most is without "at", so deleted. 4. fixed indentation in some files before \|	2021-05-06 12:16:35 +02:00
Christoph M. Becker	592cfa309e	Merge branch 'PHP-8.0' * PHP-8.0: Fix #81011: mb_convert_encoding removes references from arrays	2021-05-04 18:40:23 +02:00
Christoph M. Becker	d1c0cbdcb1	Merge branch 'PHP-7.4' into PHP-8.0 * PHP-7.4: Fix #81011: mb_convert_encoding removes references from arrays	2021-05-04 18:39:39 +02:00
Christoph M. Becker	0cafd53d18	Fix #81011 : mb_convert_encoding removes references from arrays We need to dereference references. Closes GH-6938.	2021-05-04 18:37:40 +02:00
Alex Dowad	7159907d30	Fix mbstring support for ISO-2022-JP-MS encoding - Treat it as error if multi-byte string or escape sequence is truncated - Don't allow 'control' characters or escape sequences to appear in the middle of a multi-byte char As with ISO-2022-JP-KDDI, the main reference used to develop the tests was the behavior of the existing code. It would have been better to have some independent reference which we could cross-check our code against, but I couldn't find one.	2021-04-15 15:52:31 +02:00
Alex Dowad	570e89a9f3	Fix mbstring support for ISO-2022-JP-KDDI encoding - Treat it as an error if a multi-byte character or escape sequence is truncated - When converting other encodings to ISO-2022-JP-KDDI, don't swallow trailing hash characters or digits - Don't allow 'control' characters to appear in the middle of a multi-byte char Note: I was not able to find any kind of official or even semi-official specification for this legacy encoding. Therefore, the test suite for ISO-2022-JP-KDDI is based largely on the behavior of the existing code. Verifying the correctness of program code in this way is very questionable. In a sense, all you are proving is that the code "does what it does". However, the test suite will still expose any unintended _changes_ to behavior.	2021-04-15 15:52:31 +02:00
Alex Dowad	f5f3ee7aee	Add test suite for mUTF-7 (IMAP) encoding	2021-04-15 15:52:31 +02:00
Alex Dowad	78dc160e3b	Catch and handle errors in mUTF-7 (IMAP) conversion	2021-04-15 15:52:31 +02:00

1 2 3 4 5 ...

2084 Commits