archived-php-src

mirror of https://github.com/php/php-src.git synced 2026-04-22 07:28:09 +02:00

Author	SHA1	Message	Date
Alex Dowad	76a92c26e3	mb_decode_numericentity decodes valid entities which are truncated at end of string Since mb_decode_numericentity does not require all HTML entities to end with ';', but allows them to be terminated by ANY non-digit character, it doesn't make sense that valid entities which butt up against the end of the input string are not converted. As it turned out, supporting this case also made it possible to simplify the code nicely.	2022-07-18 15:11:47 +02:00
Alex Dowad	91969e908f	New implementation of mb_{de,en}code_numericentity This new implementation uses the new encoding conversion filters. Aside from fewer LOC and (hopefully) improved readability, the differences are as follows: BEHAVIOR CHANGES: - The old implementation used signed arithmetic when operating on the 'convmap'. This meant that results could be surprising when using convmap entries with 1 in the MSB. Further, types like 'int' were used rather than those with a specific bit width, such as 'int32_t'. This meant that results could also depend on the platform width of an 'int'. Now unsigned arithmetic is used, with explicit bit widths. - Similarly, while converting decimal numeric entities, the legacy implementation would ensure that the value never overflowed INT_MAX, and if it did, the entity would be treated as invalid and passed through unconverted. However, that again means that results depend on the platform size of an 'int'. So now, we use a value with explicit bit width (32 bits) to hold the value of a deconverted decimal entity, and ensure that the entity value does not overflow that. Further, because we are using an UNSIGNED 32-bit value rather than a signed one, the ceiling for how large a decimal entity can be is higher now. All of this will probably not affect anyone, since Unicode codepoints above U+10FFFF are invalid anyways. To see the difference, you need to be using a text encoding like UCS-4, which allows huge 'codepoints'. - If it saw something which looked like a hex entity, but turned out not to be a valid numeric entity, the old implementation would sometimes convert the hexadecimal digits a-f to A-F (uppercase). The new implementation passes invalid numeric entities through without performing case conversion. - The old implementation of mb_encode_numericentity was limited in how many decimal/hex digits it could emit. If a text encoding like UCS-4 was in use, where 'codepoints' can have huge values (larger than the valid range stipulated by the Unicode standard), it would not error out on a 'codepoint' whose value was too large for it, but would rather mangle the value and emit a numeric entity which decoded to some other random codepoint. The new implementation is able to emit enough digits to express any value which fits in 32 bits. PERFORMANCE: Based on micro-benchmarks run on my development machine: Decoding numeric HTML entities is about 4 times faster, for both decimal and hexadecimal entities, across a variety of input string lengths. Encoding is about 3 times faster.	2022-07-18 15:11:30 +02:00
Nikita Popov	39131219e8	Migrate more SKIPIF -> EXTENSIONS (#7139 ) This is a mix of more automated and manual migration. It should remove all applicable extension_loaded() checks outside of skipif.inc files.	2021-06-11 12:58:44 +02:00
Gabriel Caruso	ded3d984c6	Use EXPECT instead of EXPECTF when possible EXPECTF logic in run-tests.php is considerable, so let's avoid it.	2018-02-20 21:53:48 +01:00
Rui Hirokawa	cb17a1b216	added tests for #40685 .	2011-09-24 02:11:48 +00:00

5 Commits