archived-php-src

mirror of https://github.com/php/php-src.git synced 2026-04-29 11:13:36 +02:00

Author	SHA1	Message	Date
pakutoma	b721d0f71e	Fix phpGH-10648: add check function pointer into mbfl_encoding Previously, mbstring used the same logic for encoding validation as for encoding conversion. However, there are cases where we want to use different logic for validation and conversion. For example, if a string ends up with missing input required by the encoding, or if a character is input that is invalid as an encoding but can be converted, the conversion should succeed and the validation should fail. To achieve this, a function pointer mb_check_fn has been added to struct mbfl_encoding to implement the logic used for validation. Also, added implementation of validation logic for UTF-7, UTF7-IMAP, ISO-2022-JP and JIS. (The same change has already been made to PHP 8.2 and 8.3; see `6fc8d014df`. This commit is backporting the change to PHP 8.1.)	2023-03-25 09:52:10 +02:00
NathanFreeman	fa0401b0b5	Fix GH-9535 (unintended behavior change for mb_strcut in PHP 8.1) The existing implementation of mb_strcut extracts part of a multi-byte encoded string by pulling out raw bytes and then running them through a conversion filter to ensure that the output is valid in the requested encoding. If the conversion filter emits error markers when doing the final 'flush' operation which ends the conversion of the extracted bytes, these error markers may (in some cases) be included in the output. The conversion operation does not respect the value of mb_substitute_character; rather, it always uses '?' as an error marker. So this issue manifests itself as unwanted '?' characters being inserted into the output. This issue has existed for a long time, but became noticeable in PHP 8.1 because for at least some of the supported text encodings, mbstring is now more strict about emitting error markers when strings end in an illegal state. The simplest fix is to suppress error markers during the final flush operation. While working on a fix for this problem, another problem with mb_strcut was discovered; since it decides when to stop consuming bytes from the input by looking at the byte length of its OUTPUT, anything which causes extra bytes to be emitted to the output may cause mb_strcut to not consume all the bytes in the requested range. The one case where we DO emit extra output bytes is for encodings which have a selectable mode, like ISO-2022-JP; if a string in such an encoding ends in a mode which is not the default, we emit an ending escape sequence which changes back to the default mode. This is done so that concatenating strings in such encodings is safe. However, as mentioned, this can cause the output of mb_strcut to be shorter than it logically should be. This bug has existed for a long time, and fixing it now will be a BC break, so we may not fix it right away. Therefore, tests for THIS fix which don't pass because of that OTHER bug have been split out into a separate test file (gh9535b.phpt), and that file has been marked XFAIL.	2022-11-13 14:37:55 +02:00
Alex Dowad	371367ce3e	Reintroduce legacy 'SJIS-win' text encoding in mbstring In `e2459857af`, I combined mbstring's "SJIS-win" text encoding into CP932. This was done after doing some testing which appeared to show that the mappings for "SJIS-win" were the same as those for "CP932". Later, it was found that there was actually a small difference prior to `e2459857af` when converting Unicode to CP932. The mappings for the following two codepoints were different: CP932 SJIS-win U+203E 0x7E 0x81 0x50 U+00A5 0x5C 0x81 0x8F As shown, mbstring's "CP932" mapped Unicode's 'OVERLINE' and 'YEN SIGN' to the ASCII bytes which have conflicting uses in most legacy Japanese text encodings. "SJIS-win" mapped these to equivalent JIS X 0208 fullwidth characters. Since e2459867af was not intended to cause any user-visible change in behavior, I am rolling back the merge of "CP932" and "SJIS-win". It seems doubtful whether these two text encodings should be kept separate or merged in a future release. An extensive discussion of the related historical background and compatibility issues involved can be found in this GitHub thread: https://github.com/php/php-src/issues/8308	2022-08-16 20:18:54 +02:00
Christoph M. Becker	5003831260	Merge branch 'PHP-8.0' into PHP-8.1 * PHP-8.0: Fix GH-8208: mb_encode_mimeheader: $indent functionality broken	2022-03-17 17:34:31 +01:00
Christoph M. Becker	d0417ebc93	Fix GH-8208: mb_encode_mimeheader: $indent functionality broken We also need to factor in the indent, when getting the encoder result. Closes GH-8213.	2022-03-17 17:31:58 +01:00
Christoph M. Becker	929d847152	Fix #81693 : mb_check_encoding(7bit) segfaults `php_mb_check_encoding()` now uses conversion to `mbfl_encoding_wchar`. Since `mbfl_encoding_7bit` has no `input_filter`, no filter can be found. Since we don't actually need to convert to wchar, we encode to 8bit. Closes GH-7712.	2021-12-03 22:49:47 +01:00
Alex Dowad	28b346bc06	Improve detection accuracy of mb_detect_encoding Originally, `mb_detect_encoding` essentially just checked all candidate encodings to see which ones the input string was valid in. However, it was only able to do this for a limited few of all the text encodings which are officially supported by mbstring. In `3e7acf901d`, I modified it so it could 'detect' any text encoding supported by mbstring. While this is arguably an improvement, if the only text encodings one is interested in are those which `mb_detect_encoding` could originally handle, the old `mb_detect_encoding` may have been preferable. Because the new one has more possible encodings which it can guess, it also has more chances to get the answer wrong. This commit adjusts the detection heuristics to provide accurate detection in a wider variety of scenarios. While the previous detection code would frequently confuse UTF-32BE with UTF-32LE or UTF-16BE with UTF-16LE, the adjusted code is extremely accurate in those cases. Detection for Chinese text in Chinese encodings like GB18030 or BIG5 and for Japanese text in Japanese encodings like EUC-JP or SJIS is greatly improved. Detection of UTF-7 is also greatly improved. An 8KB table, with one bit for each codepoint from U+0000 up to U+FFFF, is used to achieve this. One significant constraint is that the heuristics are completely based on looking at each codepoint in a string in isolation, treating some codepoints as 'likely' and others as 'unlikely'. It might still be possible to achieve great gains in detection accuracy by looking at sequences of codepoints rather than individual codepoints. However, this might require huge tables. Further, we might need a huge corpus of text in various languages to derive those tables. Accuracy is still dismal when trying to distinguish single-byte encodings like ISO-8859-1, ISO-8859-2, KOI8-R, and so on. This is because the valid bytes in these encodings are basically all the same, and all valid bytes decode to 'likely' codepoints, so our method of detection (which is based on rating codepoints as likely or unlikely) cannot tell any difference between the candidates at all. It just selects the first encoding in the provided list of candidates. Speaking of which, if one wants to get good results from `mb_detect_encoding`, it is important to order the list of candidate encodings according to your prior belief of which are more likely to be correct. When the function cannot tell any difference between two candidates, it returns whichever appeared earlier in the array.	2021-10-19 18:05:51 +02:00
Alex Dowad	c25a1ef8d0	Bug #81390 : mb_detect_encoding should not prematurely stop processing input As a performance optimization, mb_detect_encoding tries to stop processing the input string early when there is only one 'candidate' encoding which the input string is valid in. However, the code which keeps count of how many candidate encodings have already been rejected was buggy. This caused mb_detect_encoding to prematurely stop processing the input when it should have continued. As a result, it did not notice that in the test case provided by Alec, the input string was not valid in UTF-16.	2021-09-20 11:21:39 +02:00
Alex Dowad	6acd4f7f3a	Optimize text encoding detection for speed (eliminate Unicode property lookups) ...By just testing the input codepoints if they are within a few fixed ranges instead. This avoids hash lookups in property tables. From (micro-)benchmarking on my PC, this looks to be a bit less than 4x faster than the existing code.	2021-09-20 11:20:53 +02:00
Colin O'Dell	fe36b81d5e	Update Unicode tables to 14.0.0 Closes GH-7502.	2021-09-20 09:58:20 +02:00
Alex Dowad	f303fc8a9b	Use bool in mbfl_filt_conv_output_hex (rather than int)	2021-08-31 13:41:34 +02:00
Alex Dowad	776296e12f	mbstring no longer provides 'long' substitutions for erroneous input bytes Previously, mbstring had a special mode whereby it would convert erroneous input byte sequences to output like "BAD+XXXX", where "XXXX" would be the erroneous bytes expressed in hexadecimal. This mode could be enabled by calling `mb_substitute_character("long")`. However, accurately reproducing input byte sequences from the cached state of a conversion filter is often tricky, and this significantly complicates the implementation. Further, the means used for passing the erroneous bytes through to where the "BAD+XXXX" text is generated only allows for up to 3 bytes to be passed, meaning that some erroneous byte sequences are truncated anyways. More to the point, a search of publically available PHP code indicates that nobody is really using this feature anyways. Incidentally, this feature also provided error output like "JIS+XXXX" if the input 'should have' represented a JISX 0208 codepoint, but it decodes to a codepoint which does not exist in the JISX 0208 charset. Similarly, specific error output was provided for non-existent JISX 0212 codepoints, and likewise for JISX 0213, CP932, and a few other charsets. All of that is now consigned to the flames. However, "long" error markers also include a somewhat more useful "U+XXXX" marker for Unicode codepoints which were successfully decoded from the input text, but cannot be represented in the output encoding. Those are still supported. With this change, there is no need to use a variety of special values in the high bits of a wchar to represent different types of error values. We can (and will) just use a single error value. This will be equal to -1. One complicating factor: Text conversion functions return an integer to indicate whether the conversion operation should be immediately aborted, and the magic 'abort' marker is -1. Also, almost all of these functions would return the received byte/codepoint to indicate success. That doesn't work with the new error value; if an input filter detects an error and passes -1 to the output filter, and the output filter returns it back, that would be taken to mean 'abort'. Therefore, amend all these functions to return 0 for success.	2021-08-31 13:41:34 +02:00
Alex Dowad	97b7fc893c	Output illegal character marker for 4-byte illegal characters > 0x7FFFFFFF Some text encodings supported by mbstring (such as UCS-4) accept 4-byte characters. When mbstring encounters an illegal byte sequence for the encoding it is using, it should emit an 'illegal character' marker, which can either be a single character like '?', an HTML hexadecimal entity, or a marker string like 'BAD+XXXX'. Because of the use of signed integers to hold 4-byte characters, illegal 4-byte sequences with a 'negative' value (one with the high bit set) were not handled correctly when emitting the illegal char marker. The result is that such illegal sequences were just skipped over (and the marker was not emitted to the output). Fix that.	2021-08-30 16:29:58 +02:00
Nikita Popov	634f2e21d3	Don't expose wchar encoding to users (#7415 ) The "wchar" encoding isn't really an encoding -- it's what we internally use as the representation of decoded characters. In practice, it tends to behave a lot like the 8bit encoding when used from userland, because input code units end up being treated as code points. This patch removes the wchar encoding from the public encoding list and reserves it for internal use only.	2021-08-30 11:11:33 +02:00
Nikita Popov	43cb2548f7	Flush filter during non-strict encoding detection If we reach the end of the string without reducing to a single encoding, then we should flush to check whether the last character is incomplete.	2021-08-27 14:48:32 +02:00
Nikita Popov	a1c1ee6a48	Don't use opaque for encoding detection score opaque is used by the htmlentities filter, which means that we end up trying to free the score value as a pointer. Don't try to be overly tricky here and simply allocate a separate structure to hold the number of illegal characters and the score.	2021-07-28 10:54:27 +02:00
Nikita Popov	9d0db2e98a	Fixed bug #81298 Creation of the filter may fail for some special encodings, for which detection is not supported.	2021-07-28 10:11:46 +02:00
Alex Dowad	26fc7c4256	Fix typo in mbfilter.h As pointed out by Bruno Haible (https://haible.de/bruno).	2021-07-19 12:17:00 +02:00
Alex Dowad	e2459857af	Remove duplicate implementation of CP932 from mbstring Sigh. Double sigh. After fruitlessly searching the Internet for information on this mysterious text encoding called "SJIS-open", I wrote a script to try converting every Unicode codepoint from 0-0xFFFF and compare the results from different variants of Shift-JIS, to see which one "SJIS-open" would be most similar to. The result? It's just CP932. There is no difference at all. So why do we have two implementations of CP932 in mbstring? In case somebody, somewhere is using "SJIS-open" (or its aliases "SJIS-win" or "SJIS-ms"), add these as aliases to CP932 so existing code will continue to work.	2021-06-17 13:12:40 +02:00
George Peter Banyard	c40231afbf	Mark various functions with void arguments. This fixes a bunch of [-Wstrict-prototypes] warning, because in C func() and func(void) have different semantics.	2021-05-12 14:55:53 +01:00
Alex Dowad	319a340843	Simplify code for working with halfwidth/fullwidth kana conversion filter There's no need to dynamically allocate a struct to hold the 'mode' parameter; just store it directly in `filt->opaque`. Some other things were also being done in an unnecessarily roundabout way. Also, the 'copy' function for CP50220 conversion filters was both broken and unnecessary. Broken, because it malloc'd memory which was never freed by anything. Unnecessary, because the point of the copy is so that various algorithms can try running bytes through a conversion filter and see how many output bytes or characters result, and then back out by restoring the filters to their previous state. But here's the thing; CP50220 conversion filters don't hold cached bytes, which is the main thing which would need to be restored to a previous state.	2021-04-15 15:52:31 +02:00
Alex Dowad	a900ec3397	Remove unneeded 'filter_ctor' member from mbfl_convert_filter struct This function pointer is only called when initializing the struct. After that nothing is done with it. Therefore, there is no need to keep it in the struct.	2021-04-15 15:52:31 +02:00
Alex Dowad	d8c785b894	Update 'East Asian Width' table to comply with Unicode 13.0 Instead of manually maintaining the data in eaw_table.h, it is now automatically generated by ucgendat/ucgendat.php, using the EastAsianWidth.txt file from the Unicode Consortium. Something must be said about the deleted test case. Back in 2004, someone noticed that `mb_strwidth` didn't comply with Unicode 4.0. A test case was added to expose the problem. Well, time keeps moving on, and with the changing years, new Unicodes are born and old Unicodes die. Some characters which were counted as double-width in Unicode 4.0 are no longer such in Unicode 13.0, which renders the test case obsolete. At the same time, make a couple of spelling/grammar fixes in ucgendat.php.	2021-01-19 20:38:44 +02:00
Alex Dowad	a06c20a17c	Remove useless constant MBFL_ENCTYPE_MBCS This flag indicated that an encoding was 'multi-byte'; it can use a variable number of bytes to encode each character. As it turns out, we don't actually need to check this flag anywhere, so it's better to remove it.	2021-01-15 21:55:41 +02:00
Alex Dowad	34ece40872	Remove useless mbstring encoding 'JIS-ms' MicroSoft invented three encodings very similar to ISO-2022-JP/JIS7/JIS8, called CP50220, CP50221, and CP50222. All three are supported by mbstring. Since these encodings are very similar, some code can be shared. Actually, conversion of CP50220/1/2 to Unicode is exactly the same operation; it's when converting from Unicode to CP50220/1/2 that some small differences arise in how certain katakana are handled. The most important common code was a function called `mbfl_filt_wchar_jis_ms`. The `jis_ms` part doubtless refers to the fact that these encodings are modified versions of 'JIS' invented by 'MS'. mbstring also went a step further and exported 'JIS-ms' to userland as a separate encoding from CP50220/1/2. If users requested 'JIS-ms' conversion, they got something like CP50220/1/2, minus their special ways of handling half-width katakana when converting from Unicode. But... that 'encoding' is not something which actually exists in the world outside of mbstring. CP50220/1/2 do exist in MicroSoft software, but not 'JIS-ms'. For a text encoding conversion library, inventing new variant encodings and implementing them is not very productive. Our interest is in handling text encodings which real people actually use for... you know, storing actual text and things like that.	2021-01-15 21:55:41 +02:00
Alex Dowad	fcbe45de10	Remove useless mbstring encoding 'CP50220-raw' CP50220 is a variant of ISO-2022-JP invented by MicroSoft, which handles some Unicode characters which are not representable in ISO-2022-JP by converting them to similar characters which are representable. What, then, is CP50220-raw? An Internet search turns up absolutely nothing. Reference works which I consulted don't say anything about it. Other text conversion libraries don't support it. From looking at the code: It's just the same as CP50220, but it accepts unmapped JIS X 0208 characters passed through from other Japanese encodings and silently encodes them using the usual ISO-2022-JP escape sequence and representation for JIS X 0208 characters. It's hard to see how this could be useful. OK, let me come out and say it: it's _not_ useful. We can confidently jettison this (mis)feature.	2021-01-15 21:55:41 +02:00
Alex Dowad	bbbadae0ae	Combine MBFL_ENCTYPE_MWC2{BE,LE} constants These constants indicate that a text encoding uses 2+ bytes for each character, and is either big endian or little endian (respectively). But nothing in mbstring cares about the difference between MBFL_ENCTYPE_MWC2BE and MBFL_ENCTYPE_MWC2LE. (Actually, nothing cares about whether these flags are set at all... maybe we should just remove them?)	2020-11-25 19:52:19 +02:00
Alex Dowad	72660c416a	Combine MBFL_ENCTYPE_WCS{2,4}{BE,LE} constants These flags identify text encodings in mbstring which use a constant number of bytes per character. While some parts of the code do use these flags, usually to detect cases which can be optimized due to constant-width encoding, nothing cares whether the encodings are 'LE' (little-endian) or 'BE' (big-endian). So we can simplify things by combining constants.	2020-11-25 19:52:19 +02:00
Alex Dowad	e169ad3b61	Consolidate all single-byte encodings in one source file We can squeeze out a lot of duplicated code in this way.	2020-11-11 11:18:59 +02:00
Alex Dowad	b05ad5112a	Don't redundantly flush mbstring filters multiple times Each flush function in a chain of mbstring conversion filters always calls the next flush function in the chain. So it is not necessary to explicitly flush the second filter in a chain. (Due to this bug, in many cases, flush functions were actually being called three times.)	2020-11-11 11:18:58 +02:00
Alex Dowad	3e7acf901d	Remove mbstring identify filters mbstring had an 'identify filter' for almost every supported text encoding which was used when auto-detecting the most likely encoding for a string. It would run over the string and set a 'flag' if it saw anything which did not appear likely to be the encoding in question. One problem with this scheme was that encodings which merely appeared less likely to be the correct one were completely rejected, even if there was no better candidate. Another problem was that the 'identify filters' had a huge amount of code duplication with the 'conversion filters'. Eliminate the identify filters. Instead, when auto-detecting text encoding, use conversion filters to see whether the input string is valid in candidate encodings or not. At the same type, watch the type of codepoints which the string decodes to and mark it as less likely if non-printable characters (ESC, form feed, bell, etc.) or 'private use area' codepoints are seen. Interestingly, one old test case in which JIS text was misidentified as UTF-8 (and this wrong behavior was enshrined in the test) was 'fixed' and the JIS string is now auto-detected as JIS.	2020-11-09 13:45:17 +02:00
Alex Dowad	cc03c54c36	Remove useless byte{2,4}{be,le} encodings from mbstring There is no meaningful difference between these and UCS-{2,4}. They are just a little bit more lax about passing errors silently. They also have no known use. Alias to UCS-{2,4} in case someone, somewhere is using them.	2020-11-09 13:45:16 +02:00
Alex Dowad	9f5a4b3bd9	Fix mbstring support for ARMSCII-8 - Identify filter was completely wrong. - Respect `mb_substitute_character` rather than converting invalid bytes to Unicode 0xFFFD (generic replacement character). - Don't convert Unicode 0xFFFD to a valid ARMSCII-8 character. - When converting ARMSCII-8 to ARMSCII-8, don't pass invalid bytes through silently.	2020-11-02 21:31:06 +02:00
Alex Dowad	e81458862b	Remove dead code from mbfilter_koi8u.c (and do general code cleanup)	2020-11-02 21:31:06 +02:00
Alex Dowad	fde7794556	Remove dead code from mbfilter_iso8859_{2,4,5,9,10,13,14,15,16}.c ...Plus some dead code related to ISO-8859-1.	2020-11-02 21:31:06 +02:00
Alex Dowad	0a8ebb36a5	Remove dead code from mbfilter_koi8r.c	2020-11-02 21:31:06 +02:00
Alex Dowad	b6e75265d0	Remove dead code from mbfilter_cp850.c (and do general code cleanup) Since there are no invalid bytes in CP850, these `if` conditions will never be true.	2020-11-02 21:31:06 +02:00
Alex Dowad	20a404f765	Remove dead code from mbfilter_cp866.c (and do general code cleanup) Since there are no invalid bytes in CP866, these `if` conditions will never be true.	2020-11-02 21:31:06 +02:00
Alex Dowad	e6d17cfe44	Fix mbstring support for CP1254 encoding One funny thing: while the original author used Unicode 0xFFFD (generic replacement character) for invalid bytes in CP1251 and CP1252, for CP1254 they used 0xFFFE, which is not a valid Unicode codepoint at all, but is a reversed byte-order mark. Probably this was by mistake. Anyways, - Fixed identify filter, which was completely wrong. - Don't convert Unicode 0xFFFE to a random (but valid) CP1254 byte. - When converting CP1254 to CP1254, don't pass invalid bytes through silently.	2020-11-02 21:31:05 +02:00
Alex Dowad	44bd5804b0	Fix mbstring support for CP1251 encoding - Identify filter was as wrong as wrong can be. - Invalid CP1251 byte 0x98 was converted to Unicode 0xFFFD (generic replacement character), rather than respecting `mb_substitute_character`. - Unicode 0xFFFD was converted to some random CP1251 byte. - When converting CP1251 to CP1251, don't pass invalid bytes through silently.	2020-11-02 21:31:05 +02:00
Alex Dowad	7047e5d2c4	Add identify filter for UTF-32{,BE,LE}	2020-10-27 10:19:01 +02:00
Alex Dowad	d8895cd054	Improve error handling for UTF-16{,BE,LE} Catch various errors such as the first part of a surrogate pair not being followed by a proper second part, the first part of a surrogate pair appearing at the end of a string, the second part of a surrogate pair appearing out of place, and so on.	2020-10-27 10:19:01 +02:00
Alex Dowad	7b9bed0150	Add identify filter for ISO-8859-16 (Latin-10) encoding Interestingly, it looks like the original author intended to add an identify filter for this encoding, but never did so. The needed struct is there, but was never added to the list of identify filters in mbfl_ident.c.	2020-10-16 20:56:45 +02:00
Alex Dowad	648c1cb51e	Add identify filter for UCS-2, UCS-2BE, and UCS-2LE encodings	2020-10-13 20:26:14 +02:00
Alex Dowad	374f31e364	Add mbstring identify filter for 'binary' encoding	2020-10-13 20:26:13 +02:00
Alex Dowad	97beecc251	Add identify filter for UTF-16, UTF-16LE, UTF-16BE There was one faulty test in the suite which only passed before because UTF-16 had no identify filter. After this was fixed, it exposed the problem with the test.	2020-10-13 20:26:13 +02:00
Alex Dowad	4aa7430f68	Add mbstring identify filter for '7bit' encoding	2020-10-13 06:12:38 +02:00
Alex Dowad	0ffc1f55b3	Refactor mbfl_ident.c, mbfl_encoding.c, mbfl_memory_device.c, mbfl_string.c - Make everything less gratuitously verbose - Don't litter the code with lots of unneeded NULL checks (for things which will never be NULL) - Don't return success/failure code from functions which can never fail - For encoding structs, don't use pointers to pointers to pointers for the list of alias strings. Pointers to pointers (2 levels of indirection) is what actually makes sense. This gets rid of some extraneous dereference operations.	2020-10-13 06:12:38 +02:00
Alex Dowad	3f1851dec2	Avoid compiler warnings related to mbstring flush functions	2020-10-13 06:12:37 +02:00
Remi Collet	b1c5532ad1	fix mbfl function prototypes re-add mbfl_convert_filter_feed API re-add pointer cast	2020-09-15 15:15:06 +02:00

1 2 3 4

195 Commits