archived-php-src

mirror of https://github.com/php/php-src.git synced 2026-04-14 11:32:11 +02:00

Author	SHA1	Message	Date
Alex Dowad	5370f344d2	mb_strimwidth inserts error markers in invalid input string (for backwards compatibility) The old implementation did this. It also did the same to the trim marker, if the trim marker was invalid in the specified encoding, but I have not imitated that behavior (for performance).	2022-08-02 11:07:06 +02:00
Alex Dowad	78ee18413f	Move kana conversion function to mbfilter_cp5022x.c ...To avoid a dependency from libmbfl to mbstring. Thanks to Nikita Popov for pointing this issue out.	2022-08-02 11:07:06 +02:00
Alex Dowad	e1351eb0a6	Fix legacy text conversion filter for UTF-16 Make necessary changes to filter state before using CK macro.	2022-08-02 11:07:06 +02:00
Alex Dowad	219fff376b	Fix legacy text conversion filter for UTF7-IMAP Make necessary updates to filter state before using CK macro.	2022-08-02 11:07:06 +02:00
Alex Dowad	0a6ea5bd4e	Fix legacy text conversion filter for UCS-4 If a downstream filter returns -1 (error), the CK macro will make the UCS-4 conversion filter also immediately return. This means that any necessary updates to the filter state have to be done before using CK, or it will be left in an invalid state and will not behave correctly when flushed.	2022-08-02 11:07:06 +02:00
Alex Dowad	44b4fb2c36	Fix legacy text conversion filter for CP50220 In my recent commit which replaced the implementation of mb_convert_kana, the commit message noted that mb_convert_kana previously had a bug whereby null bytes would be 'swallowed' and not passed to the output. This was actually the reason.	2022-08-02 11:07:06 +02:00
Alex Dowad	7299096095	New implementation of mb_strimwidth This new implementation of mb_strimwidth uses the new text encoding conversion filters. Changes from the previous implementation: • mb_strimwidth allows a negative 'from' argument, which should count backwards from the end of the string. However, the implementation of this feature was buggy (starting right from when it was first implemented). It used the following code: if ((from < 0) \|\| (width < 0)) { swidth = mbfl_strwidth(&string); } if (from < 0) { from += swidth; } Do you see the bug? 'from' is a count of CODEPOINTS, but 'swidth' is a count of TERMINAL COLUMNS. Adding those two together does not make sense. If there were no fullwidth characters in the input string, then the two counts coincide and the feature would work correctly. However, each fullwidth character would throw the result off by one, causing more characters to be skipped than was requested. • mb_strimwidth also allows a negative 'width' argument, which again counts backwards from the end of the string; in this case, it is not determining the START of the portion which we want to extract, but rather, the END of that portion. Perhaps unsurprisingly, this feature was also buggy. Code: if (width < 0) { width = swidth + width - from; } 'swidth + width' is fine here; the problem is '- from'. Again, that is subtracting a count of CODEPOINTS from a count of TERMINAL COLUMNS. In this case, we really need to count the terminal width of the string prefix skipped over by 'from', and subtract that rather than the number of codepoints which are being skipped. As a result, if a 'from' count was passed along with a negative 'width', for every fullwidth character in the skipped prefix, the result of mb_strimwidth was one terminal column wider than requested. Since these situations were covered by unit tests, you might wonder why the bugs were not caught. Well, as far as I can see, it looks like the author of the 'tests' just captured the actual output of mb_strimwidth and defined it as 'correct'. The tests were written in such a way that it was difficult to examine them and see whether they made sense or not; but a careful examination of the inputs and outputs clearly shows that the legacy tests did not conform to the documented contract of mb_strimwidth. • The old implementation would always pass the input string through decoding/encoding filters before returning it to the caller, even if it fit within the specified width. This means that invalid byte sequences would be converted to error markers. For performance, the new implementation returns the very same string which was passed in if it does not exceed the specified width. This means that erroneous byte sequences are not converted to error markers unless it is necessary to trim the string. • The same applies to the 'trim marker' string. • The old implementation was buggy in the (unusual) case that the trim marker is wider than the requested maximum width of the result. It did an unsigned subtraction of the requested width and the width of the trim marker. If the width of the trim marker was greater, that subtraction would underflow and yield a huge number. As a result, mb_strimwidth would then pass the input string through, even if it was far wider than the requested maximum width. In that case, since the input string is wider than the requested width, and NONE of it will fit together with the trim marker, the new implementation returns just the trim marker. This is the one case where the output can be wider than the requested width: when BOTH the input string and also the trim marker are too wide. • Since it passed the input string and trim marker through decoding/encoding filters, when using "Quoted-Printable" as the encoding, newlines could be inserted into the trim marker to maintain the maximum line length for QP. This is an extremely bizarre use case and I don't think there is any point in worrying about it. QP will be removed from mbstring in time, anyways. PERFORMANCE: • From micro-benchmarking with various input string lengths and text encodings, it appears that the new implementation is 2-3x faster for UTF-8 and UTF-16. For legacy Japanese text encodings like ISO-2022-JP or SJIS, the new implementation is perhaps 25% faster. • Note that correctly implementing negative 'from' and 'width' arguments imposes a small performance burden in such cases; one which the old implementation did not pay. This slightly skews benchmarking results in favor of the old implementation. However, even so, the new implementation is faster in all cases which I tested.	2022-08-02 11:07:06 +02:00
Alex Dowad	94fde1566f	Move implementation of mb_strlen to mbstring.c mbfl_strlen (in mbfilter.c) is still being used in a couple of places but will go away soon.	2022-08-02 11:07:06 +02:00
Christoph M. Becker	3e922bf08f	Merge branch 'PHP-8.1' * PHP-8.1: Fix GH-9008: mb_detect_encoding(): wrong results with null $encodings	2022-07-20 17:01:42 +02:00
Christoph M. Becker	c2bdaa48e1	Fix GH-9008: mb_detect_encoding(): wrong results with null $encodings Passing `null` to `$encodings` is supposed to behave like passing the result of `mb_detect_order()`. Therefore, we need to remove the non- encodings from the `elist` in this case as well. Thus, we duplicate the global `elist`, so we can modify it. Closes GH-9063.	2022-07-20 16:58:55 +02:00
Alex Dowad	6d525a425e	Fix legacy conversion filter for ISO-2022-KR	2022-07-20 07:44:20 +02:00
Alex Dowad	8a915ed26c	Fix legacy conversion filter for SJIS-2004	2022-07-20 07:44:20 +02:00
Alex Dowad	d8a61cef4f	Fix legacy conversion filter for ISO-2022-JP-KDDI	2022-07-20 07:44:20 +02:00
Alex Dowad	9ac49c0dd3	New implementation of mb_convert_kana mb_convert_kana now uses the new text encoding conversion filters. Microbenchmarking shows speed gains of 50%-150% across various text encodings and input string lengths. The behavior is the same as the old mb_convert_kana except for one fix: if the 'zero codepoint' U+0000 appeared in the input, the old implementation would sometimes drop it, not passing it through to the output. This is now fixed.	2022-07-20 07:44:19 +02:00
Máté Kocsis	e328c68305	Rename @cname to @cvalue in stubs (#9043 ) @cname currently refers to the constant name in C. However, it is not always a (constant) name, but sometimes a function invocation, so naming it as @cvalue would be more appropriate.	2022-07-19 15:11:42 +02:00
Eric Norris	09237f6126	Update request startup error messages	2022-07-18 23:19:59 +01:00
Alex Dowad	76a92c26e3	mb_decode_numericentity decodes valid entities which are truncated at end of string Since mb_decode_numericentity does not require all HTML entities to end with ';', but allows them to be terminated by ANY non-digit character, it doesn't make sense that valid entities which butt up against the end of the input string are not converted. As it turned out, supporting this case also made it possible to simplify the code nicely.	2022-07-18 15:11:47 +02:00
Alex Dowad	5d6bd557b3	mb_decode_numericentity converts entities which immediately follow a valid/invalid entity Thanks to Kamil Tieleka for suggesting that some of the behaviors of the legacy implementation which the new mb_decode_numericentity implementation took care to maintain were actually bugs and should be fixed. Thanks also to Trevor Rowbotham for providing a link to the HTML specification, showing how HTML numeric entities should be interpreted. mb_decode_numericentity now processes numeric entities in the following situations where the old implementation would not: - &<ENTITY> (for example, &A) - &#<ENTITY> - &#x<ENTITY> - <VALID BUT UNTERMINATED DECIMAL ENTITY><ENTITY> (for example, &#65A) - <VALID BUT UNTERMINATED HEX ENTITY><ENTITY> - <INVALID AND UNTERMINATED DECIMAL ENTITY><ENTITY> (it does not matter why the first entity is invalid; the value could be too big, it could have too many digits, or it could not match the 'convmap' parameter) - <INVALID AND UNTERMINATED HEX ENTITY><ENTITY> This is consistent with the way that web browsers process HTML entities.	2022-07-18 15:11:32 +02:00
Alex Dowad	30bfeef48d	mbfl_strwidth does not need to use legacy conversion filters now ...Because we have the new (faster) conversion filters now for ALL text encodings supported by mbstring.	2022-07-18 15:11:32 +02:00
Alex Dowad	40f5048aa7	Fix new conversion filter for UUEncode This code (written by yours truly) was very broken on input strings long enough to require processing in multiple chunks. Fuzzing revealed this very quickly; after initial rework, further fuzzing also found a couple of very obscure bugs in corner cases.	2022-07-18 15:11:32 +02:00
Alex Dowad	5fee30b630	Fix new conversion filter for QPrint (same order of check as legacy code) Because of checking for maximum line length before certain other checks, the new conversion filter for QPrint could produce different results from the old one in some cases. This was discovered while fuzzing the new implementation of mb_decode_numericentity.	2022-07-18 15:11:32 +02:00
Alex Dowad	3cf432798e	Fix new conversion filter for CP50220 (multi-codepoint kana at end of buffer) If two codepoints which needed to be collapsed into a single kuten code were separated, with one at the end of one buffer and the other at the beginning of the next buffer, they were not converted correctly. This was discovered while fuzzing the new implementation of mb_decode_numericentity.	2022-07-18 15:11:31 +02:00
Alex Dowad	7559bf77d2	Fix new conversion filters for mobile SJIS variants ('0' at end of buffer) Previously, I had adjusted this code so that if a character which could be part of a special Docomo/Softbank/KDDI 'keypad' emoji appeared at the end of one buffer, and the 'keypad' character appeared at the beginning of the next, they would still be combined. However, this broke the handling of such a character appearing at the end of one buffer, and a character which is NOT 'keypad' appearing at the beginning of the next. This was found while fuzzing the new implementation of mb_decode_numericentity.	2022-07-18 15:11:31 +02:00
Alex Dowad	fa83a8e15e	Fix new conversion filter for HTML entities While fuzzing the new mb_decode_numericentity implementation, I discovered that the fast conversion filter for 'HTML-ENTITIES' did not correctly handle an empty named entity ('&;'), nor did it correctly handle invalid named entities whose names were a prefix of a valid entity. Also, it did not correctly handle the case where a named entity is truncated and another named entity starts abruptly.	2022-07-18 15:11:31 +02:00
Alex Dowad	9c3972fb3d	Fix legacy conversion filter for HZ	2022-07-18 15:11:31 +02:00
Alex Dowad	1526bab6d0	Fix legacy conversion filter for GB18030	2022-07-18 15:11:31 +02:00
Alex Dowad	6938e35122	Fix legacy conversion filter for CP50220	2022-07-18 15:11:31 +02:00
Alex Dowad	1662f7f79f	Fix legacy conversion filter for UTF-7	2022-07-18 15:11:31 +02:00
Alex Dowad	c8e4f313fa	Fix legacy conversion filter for ISO-2022-KR When I was working on this code before, it really, really looked like the index into `uhc3_ucs_table` could never overrun the size of the table. Why did I get this wrong? Don't know. Anyways, libfuzzer tore away my illusions and unequivocally demonstrated that the index CAN be larger than the size of the table.	2022-07-18 15:11:31 +02:00
Alex Dowad	cebb8009c6	Fix legacy conversion filters for... almost all 8-bit text encodings	2022-07-18 15:11:31 +02:00
Alex Dowad	2eff19e38f	Fix legacy conversion filter for HTML entities	2022-07-18 15:11:31 +02:00
Alex Dowad	87b71595ba	Fix legacy conversion filter for Base64	2022-07-18 15:11:31 +02:00
Alex Dowad	7ece8f18b0	Fix legacy conversion filter for MacJapanese	2022-07-18 15:11:31 +02:00
Alex Dowad	d7bab66135	Fix legacy conversion filter for SJIS-2004	2022-07-18 15:11:31 +02:00
Alex Dowad	31cbb7a3a5	Fix legacy conversion filter for QPrint	2022-07-18 15:11:30 +02:00
Alex Dowad	048f6cbcde	Fix legacy conversion filter for JIS	2022-07-18 15:11:30 +02:00
Alex Dowad	91969e908f	New implementation of mb_{de,en}code_numericentity This new implementation uses the new encoding conversion filters. Aside from fewer LOC and (hopefully) improved readability, the differences are as follows: BEHAVIOR CHANGES: - The old implementation used signed arithmetic when operating on the 'convmap'. This meant that results could be surprising when using convmap entries with 1 in the MSB. Further, types like 'int' were used rather than those with a specific bit width, such as 'int32_t'. This meant that results could also depend on the platform width of an 'int'. Now unsigned arithmetic is used, with explicit bit widths. - Similarly, while converting decimal numeric entities, the legacy implementation would ensure that the value never overflowed INT_MAX, and if it did, the entity would be treated as invalid and passed through unconverted. However, that again means that results depend on the platform size of an 'int'. So now, we use a value with explicit bit width (32 bits) to hold the value of a deconverted decimal entity, and ensure that the entity value does not overflow that. Further, because we are using an UNSIGNED 32-bit value rather than a signed one, the ceiling for how large a decimal entity can be is higher now. All of this will probably not affect anyone, since Unicode codepoints above U+10FFFF are invalid anyways. To see the difference, you need to be using a text encoding like UCS-4, which allows huge 'codepoints'. - If it saw something which looked like a hex entity, but turned out not to be a valid numeric entity, the old implementation would sometimes convert the hexadecimal digits a-f to A-F (uppercase). The new implementation passes invalid numeric entities through without performing case conversion. - The old implementation of mb_encode_numericentity was limited in how many decimal/hex digits it could emit. If a text encoding like UCS-4 was in use, where 'codepoints' can have huge values (larger than the valid range stipulated by the Unicode standard), it would not error out on a 'codepoint' whose value was too large for it, but would rather mangle the value and emit a numeric entity which decoded to some other random codepoint. The new implementation is able to emit enough digits to express any value which fits in 32 bits. PERFORMANCE: Based on micro-benchmarks run on my development machine: Decoding numeric HTML entities is about 4 times faster, for both decimal and hexadecimal entities, across a variety of input string lengths. Encoding is about 3 times faster.	2022-07-18 15:11:30 +02:00
jcm	dbdef4a55c	QA -mb_convert_encoding_array - error for object item in array Closes GH-9023.	2022-07-15 17:34:35 +02:00
Christoph M. Becker	d6fc165028	Drop useless TODO comment Cf. <https://github.com/php/php-src/pull/9018#issuecomment-1185481492>.	2022-07-15 14:23:59 +02:00
jcm	30d89b19cf	QA - mb_http_input - function returns FALSE for type 'L' or 'l' Closes GH-9018.	2022-07-15 14:22:39 +02:00
Arnaud Le Blanc	4df3dd7679	Reduce memory allocated by var_export, json_encode, serialize, and other (#8902 ) smart_str uses an over-allocated string to optimize for append operations. Functions that use smart_str tend to return the over-allocated string directly. This results in unnecessary memory usage, especially for small strings. The overhead can be up to 231 bytes for strings smaller than that, and 4095 for other strings. This can be avoided for strings smaller than `4096 - zend_string header size - 1` by reallocating the string. This change introduces `smart_str_trim_to_size()`, and calls it in `smart_str_extract()`. Functions that use `smart_str` are updated to use `smart_str_extract()`. Fixes GH-8896	2022-07-08 14:47:46 +02:00
Máté Kocsis	56137cd26e	Declare ext/mbstring constants in stubs (#8798 )	2022-06-23 17:34:08 +02:00
Alex Dowad	880803a21e	Use fast conversion filters to implement php_mb_ord Even for single-character strings, this is about 50% faster for ASCII, UTF-8, and UTF-16. For long strings, the performance gain is enormous, since the old code would convert the ENTIRE string, just to pick out the first codepoint.	2022-06-12 15:24:41 +02:00
Alex Dowad	9468fa7ff2	mbfl_strlen does not need to use old conversion filters any more	2022-06-12 15:24:41 +02:00
Alex Dowad	950a7db9fe	Use fast text conversion filters to implement mb_check_encoding Benchmarking reveals that this is about 8% slower for UTF-8 strings which have a bad codepoint at the very beginning of the string. For good strings, or those where the first bad codepoint is much later in the string, it is significantly faster (2-3 times faster in many cases).	2022-06-12 15:24:41 +02:00
Alex Dowad	8533fccd63	Assert minimum size of wchar buffer in text conversion filters In all text conversion filters which require the wchar buffer used for output to have some minimum size, it's better to include an assertion; this will help us to catch bugs, and will also help future readers to understand what we expect of the function arguments. For UTF-7 and UTF7-IMAP, these assertions were already there, but I have added comments explaining why the minimum size is what it is.	2022-06-12 15:24:40 +02:00
Alex Dowad	871e61f942	Fully use available buffer space where converting Base64 I didn't think this through carefully enough when first writing this code, but it's not necessary to reserve space for the 1-2 wchars which may be emitted before exiting the function. Why? Well, we are guaranteed that when we enter the function, there are at least 3 spaces in the wchar buffer. The only way those can be consumed is if wchars are emitted in the main 'while' loop, but if it does emit any wchars, it will set 'bits' to zero at the same time, which means the final part will not emit anything. 'bits' can be incremented again by the main loop, but the main loop only runs while there are still at least 3 spaces in the buffer. So basically, we are guaranteed that when the main loop terminates, either there are 3 or more spaces remaining in the wchar buffer, or else 'bits' is zero, or both.	2022-06-12 15:24:40 +02:00
Alex Dowad	13479ee2bd	Restore backwards-compatible mappings for 0x5C/0x7E in SJIS (for fast conversion filter) In `d62f535caa`, the legacy mbstring conversion filters for Shift-JIS was updated to restore backwards-compatible mappings for 0x5C/0x7E. Make the same change to the newer fast conversion filters.	2022-06-11 17:09:16 +02:00
Christoph M. Becker	85a95a2982	Merge branch 'PHP-8.1' * PHP-8.1: Restore backwards-compatible mappings of 0x5C and 0x7E in SJIS	2022-06-11 16:32:33 +02:00
Alex Dowad	d62f535caa	Restore backwards-compatible mappings of 0x5C and 0x7E in SJIS According to the relevant Japan Industrial Standards Committee standards, SJIS 0x5C is a Yen sign, and 0x7E is an overline. However, this conflicts with the implementation of SJIS in various legacy software (notably Microsoft products), where SJIS 0x5C and 0x7E are taken as equivalent to the same ASCII bytes. Prior to PHP 8.1, mbstring's implementation of SJIS handled these bytes compatibly with Microsoft products. This was changed in PHP 8.1.0, in an attempt to comply with the JISC specifications. However, after discussion with various concerned Japanese developers, it seems that the historical behavior was more useful in the majority of applications which process SJIS-encoded text. Since we are now treating SJIS 0x5C as equivalent to U+005C and 0x7E as equivalent to U+007E, it does not make sense to convert U+203E (OVERLINE) to 0x7E, nor does it make sense to convert U+00A5 (YEN SIGN) to 0x5C. Restore the mappings for those codepoints from before PHP 8.1.0. Fixes GH-8281.	2022-06-11 16:31:47 +02:00

1 2 3 4 5 ...

2293 Commits