archived-php-src

mirror of https://github.com/php/php-src.git synced 2026-04-05 07:02:33 +02:00

Author	SHA1	Message	Date
Dmitry Stogov	1a4f49f1fe	Use cheaper memchr() instead of php_memnstr()	2021-11-10 10:19:49 +03:00
Alex Dowad	9308974f8c	Deprecate use of mbstring to convert text to Base64/QPrint/HTML entities/etc The purpose of mbstring is for working with Unicode and legacy text encodings; but Base64, QPrint, etc. are not text encodings and don't really belong in mbstring. PHP already contains separate implementations of Base64, QPrint, and HTML entities. It will be better to eventually remove these non-encodings from mbstring. Regarding HTML entities... there is a bit more to say. mbstring's implementation of HTML entities is different from the other built-in implementation (htmlspecialchars and htmlentities). Those functions convert <, >, and & to HTML entities, but mbstring does not. It appears that the original author of mbstring intended for something to be done with <, >, and &. He used a table to identify which characters should be converted to HTML entities, and </>/& all have a special value in that table. However, nothing ever checks for that special value, so the characters are passed through unconverted. This seems like a very useless implementation of HTML entities. The most important characters which need to be expressed as entities in HTML documents are those three!	2021-11-01 11:23:21 +02:00
Christoph M. Becker	7c75c61206	Merge branch 'PHP-8.1' * PHP-8.1: Fix #76167: mbstring may use pointer from some previous request	2021-10-25 12:41:46 +02:00
Christoph M. Becker	7fcf17c41e	Merge branch 'PHP-8.0' into PHP-8.1 * PHP-8.0: Fix #76167: mbstring may use pointer from some previous request	2021-10-25 12:41:21 +02:00
Christoph M. Becker	6e6a8443a8	Merge branch 'PHP-7.4' into PHP-8.0 * PHP-7.4: Fix #76167: mbstring may use pointer from some previous request	2021-10-25 12:39:57 +02:00
Christoph M. Becker	d3d6d7906e	Fix #76167 : mbstring may use pointer from some previous request We must not reuse per-request memory across multiple requests, so this check triggered during RINIT makes no sense. As explained in the bug report[1], it can be even harmful, if some request startup fails, and the pointers refer to already freed memory in the next request. [1] <https://bugs.php.net/76167> Closes GH-7604.	2021-10-25 12:37:28 +02:00
Alex Dowad	9962aa9774	Merge branch 'PHP-8.1' * PHP-8.1: mb_detect_encoding will not return non-encodings Improve detection accuracy of mb_detect_encoding	2021-10-19 18:11:35 +02:00
Alex Dowad	a2bc57e0e5	mb_detect_encoding will not return non-encodings Among the text encodings supported by mbstring are several which are not really 'text encodings'. These include Base64, QPrint, UUencode, HTML entities, '7 bit', and '8 bit'. Rather than providing an explicit list of text encodings which they are interested in, users may pass the output of mb_list_encodings to mb_detect_encoding. Since Base64, QPrint, and so on are included in the output of mb_list_encodings, mb_detect_encoding can return one of these as its 'detected encoding' (and in fact, this often happens). Before mb_detect_encoding was enhanced so it could detect any of the supported text encodings, this did not happen, and it is never desired.	2021-10-19 18:05:52 +02:00
Alex Dowad	28b346bc06	Improve detection accuracy of mb_detect_encoding Originally, `mb_detect_encoding` essentially just checked all candidate encodings to see which ones the input string was valid in. However, it was only able to do this for a limited few of all the text encodings which are officially supported by mbstring. In `3e7acf901d`, I modified it so it could 'detect' any text encoding supported by mbstring. While this is arguably an improvement, if the only text encodings one is interested in are those which `mb_detect_encoding` could originally handle, the old `mb_detect_encoding` may have been preferable. Because the new one has more possible encodings which it can guess, it also has more chances to get the answer wrong. This commit adjusts the detection heuristics to provide accurate detection in a wider variety of scenarios. While the previous detection code would frequently confuse UTF-32BE with UTF-32LE or UTF-16BE with UTF-16LE, the adjusted code is extremely accurate in those cases. Detection for Chinese text in Chinese encodings like GB18030 or BIG5 and for Japanese text in Japanese encodings like EUC-JP or SJIS is greatly improved. Detection of UTF-7 is also greatly improved. An 8KB table, with one bit for each codepoint from U+0000 up to U+FFFF, is used to achieve this. One significant constraint is that the heuristics are completely based on looking at each codepoint in a string in isolation, treating some codepoints as 'likely' and others as 'unlikely'. It might still be possible to achieve great gains in detection accuracy by looking at sequences of codepoints rather than individual codepoints. However, this might require huge tables. Further, we might need a huge corpus of text in various languages to derive those tables. Accuracy is still dismal when trying to distinguish single-byte encodings like ISO-8859-1, ISO-8859-2, KOI8-R, and so on. This is because the valid bytes in these encodings are basically all the same, and all valid bytes decode to 'likely' codepoints, so our method of detection (which is based on rating codepoints as likely or unlikely) cannot tell any difference between the candidates at all. It just selects the first encoding in the provided list of candidates. Speaking of which, if one wants to get good results from `mb_detect_encoding`, it is important to order the list of candidate encodings according to your prior belief of which are more likely to be correct. When the function cannot tell any difference between two candidates, it returns whichever appeared earlier in the array.	2021-10-19 18:05:51 +02:00
Alex Dowad	dcaa010fff	Strict validation of conversion flags to mb_convert_kana mb_convert_kana is controlled by user-provided flags, which specify what it should convert and to what. These flags come in inverse pairs, for example "fullwidth numerals to halfwidth numerals" and "halfwidth numerals to fullwidth numerals". It does not make sense to combine inverse flags. But, clever reader of commit logs, you will surely say: What if I want all my halfwidth numerals to become fullwidth, and all my fullwidth numerals to become halfwidth? Much too clever, you are! Let's put aside the fact that this bizarre switch-up is ridiculous and will never be used, and face up to another stark reality: mb_convert_kana does not work for that case, and never has. This was probably never noticed because nobody ever tried. Disallowing useless combinations of flags gives freedom to rearrange the kana conversion code without changing behavior. We can also reject unrecognized flags. This may help users to catch bugs. Interestingly, the existing tests used a 'Z' flag, which is useless (it's not recognized at all).	2021-10-01 19:27:39 +02:00
Alex Dowad	7800491289	Inline SKIP_LONG_HEADER... macro which is only used once I don't find that pulling this code out into a macro makes anything clearer. Not at all.	2021-09-29 18:19:01 +02:00
Alex Dowad	0b32a15eb0	Optimize mb_str{,im}width for performance Rather than doing a linear search of a table of fullwidth codepoint ranges for every input character, 1) Short-cut the search if the codepoint is below the first such range 2) Otherwise, do a binary (rather than linear) search	2021-09-29 18:19:01 +02:00
Alex Dowad	f4365d2c26	Remove unused typedef 'mbfl_encoding_id'	2021-09-29 18:19:01 +02:00
Alex Dowad	3bf431969e	Don't check for impossible error condition in mb_substr_count	2021-09-29 18:19:01 +02:00
Alex Dowad	8c32deb605	Don't check for impossible error condition in mb_strwidth	2021-09-29 18:19:01 +02:00
Alex Dowad	bf78070cbe	Don't check for impossible error condition in mb_strlen	2021-09-29 18:19:01 +02:00
Alex Dowad	d3f56e5ac9	Rename php_mb_mbchar_bytes_ex to php_mb_mbchar_bytes ...And remove the original php_mb_mbchar_bytes, which was not being used.	2021-09-29 18:19:01 +02:00
Alex Dowad	774cd960ab	No need to null-terminate buffer in php_mb_chr `mbfl_buffer_converter_feed_result` will not overrun the specified length.	2021-09-29 18:19:01 +02:00
Alex Dowad	abf83e5079	Rename php_mb_safe_strrchr_ex to php_mb_safe_strrchr ...And remove the original php_mb_safe_strrchr, which was not being used anywhere.	2021-09-29 18:19:01 +02:00
Nikita Popov	c37b35fa41	Merge branch 'PHP-8.1' * PHP-8.1: Use locale-independent case conversion in mb_send_mail()	2021-09-23 17:21:14 +02:00
Nikita Popov	46315defc7	Use locale-independent case conversion in mb_send_mail() Headers should not be processed in a locale-depdendent fashion. Switch from upper to lowercasing because that's the standard for PHP and we provide an ASCII implementation of this operation. This is adapted from GH-7506.	2021-09-23 17:20:54 +02:00
Alex Dowad	4e51810f9b	Optimize mbstring upper/lowercasing: use fast path in more cases The 'fast path' in the uppercase/lowercase functions for Unicode text can be used for a slightly greater range of characters. This is not expected to have a big impact on performance, since the number of characters which will use the 'fast path' is only increased by about 50-60, and these are not very commonly used characters... but still, it doesn't cost anything.	2021-09-20 11:27:54 +02:00
Alex Dowad	36c979e2b6	Use stack-allocated buffer in php_mb_chr	2021-09-20 11:27:54 +02:00
Alex Dowad	07c4b3b8c0	Simplify code for handling mbstring language aliases Rather than using pointers to pointers to pointers (3 levels of indirection), what makes sense is two levels. This reduces unnecessary pointer dereference operations.	2021-09-20 11:27:54 +02:00
Alex Dowad	2f096c4039	Remove useless constant MBFL_ENCTYPE_MWC2	2021-09-20 11:27:54 +02:00
Alex Dowad	1170981b33	Fix mb_str_split on empty strings in variable-length text encodings Previously, when passed an empty string, and given an encoding which uses a variable number of bytes per character (and which doesn't have a 'character length table'), mb_str_split would return an array containing a single empty string, rather than an empty array. The ISO-2022 encodings are among those which were affected by this bug.	2021-09-20 11:27:54 +02:00
Alex Dowad	57eafd44c6	Add more tests for mb_decode_numericentity	2021-09-20 11:27:54 +02:00
Alex Dowad	be11d95170	Add more tests for mb_encode_numericentity	2021-09-20 11:27:54 +02:00
Alex Dowad	68176fdfb1	Use char literals in HTML numeric entity {en,de}coding functions	2021-09-20 11:27:54 +02:00
Alex Dowad	1c905434b9	Add more tests for mb_substr	2021-09-20 11:27:54 +02:00
Alex Dowad	f663344f33	Merge branch 'PHP-8.1' * PHP-8.1: Bug #81390: mb_detect_encoding should not prematurely stop processing input mb_detect_encoding with only one candidate encoding uses mb_check_encoding Optimize text encoding detection for speed (eliminate Unicode property lookups)	2021-09-20 11:27:07 +02:00
Alex Dowad	c25a1ef8d0	Bug #81390 : mb_detect_encoding should not prematurely stop processing input As a performance optimization, mb_detect_encoding tries to stop processing the input string early when there is only one 'candidate' encoding which the input string is valid in. However, the code which keeps count of how many candidate encodings have already been rejected was buggy. This caused mb_detect_encoding to prematurely stop processing the input when it should have continued. As a result, it did not notice that in the test case provided by Alec, the input string was not valid in UTF-16.	2021-09-20 11:21:39 +02:00
Alex Dowad	ca33ab59ad	mb_detect_encoding with only one candidate encoding uses mb_check_encoding ...Because it's about 5% faster.	2021-09-20 11:20:53 +02:00
Alex Dowad	6acd4f7f3a	Optimize text encoding detection for speed (eliminate Unicode property lookups) ...By just testing the input codepoints if they are within a few fixed ranges instead. This avoids hash lookups in property tables. From (micro-)benchmarking on my PC, this looks to be a bit less than 4x faster than the existing code.	2021-09-20 11:20:53 +02:00
Nikita Popov	e740907ec9	Merge branch 'PHP-8.1' * PHP-8.1: Update Unicode tables to 14.0.0	2021-09-20 09:58:32 +02:00
Colin O'Dell	fe36b81d5e	Update Unicode tables to 14.0.0 Closes GH-7502.	2021-09-20 09:58:20 +02:00
Alex Dowad	86a0d4b22d	Add more tests for mb_convert_kana	2021-09-06 13:16:23 +02:00
Alex Dowad	92fb3de9d7	Remove unused MBFL_FILT_TL_*_MASK constants Sending more unused, unneeded, unwanted, unrequired, unloved and uncalled-for code where it belongs.	2021-09-06 13:16:23 +02:00
Alex Dowad	9e1447dbf3	Rename KANA2HIRA and HIRA2KANA constants (for mb_convert_kana) mb_convert_kana is able to convert fullwidth katakana to fullwidth hiragana (and vice versa). The constants referring to these modes had names like MBFL_FILT_TL_ZEN2HAN_KANA2HIRA. The "ZEN2HAN" part of the name is misleading, since these modes do not convert fullwidth (zenkaku) kana to halfwidth (hankaku). The converted characters are fullwidth both before and after the conversion. So... let's name the constants accordingly.	2021-09-06 13:16:23 +02:00
Alex Dowad	c8e65c9d74	Remove COMPAT2 conversion modes for mb_convert_kana mb_convert_kana has conversion modes selected using 'M'/'m', which convert a few various punctuation and symbol characters between 'ordinary' and full-width forms. The constants which refer to these modes have names ending with COMPAT1. Internally, there are similar conversion modes with names ending in COMPAT2. They are like COMPAT1 modes, but they operate on a smaller set of characters. But... that is all just dead code, because there is no way for user code to select the COMPAT2 modes. I have no idea what the original author intended those COMPAT2 modes to actually be used for. Guess it doesn't really matter, anyways. At this point, it's just more food for the flames.	2021-09-06 13:16:23 +02:00
Alex Dowad	d7eb442993	Add more tests for ISO-2022-JP-2004 text conversion	2021-09-06 13:16:23 +02:00
Alex Dowad	907d0c3248	Add more tests for UTF7-IMAP text conversion	2021-09-06 13:16:23 +02:00
Alex Dowad	bf940a13ff	Add another test for SJIS-Mobile text conversion	2021-09-06 13:16:23 +02:00
Alex Dowad	32df61c558	Add more tests for UTF-7 text conversion	2021-09-06 13:16:23 +02:00
Alex Dowad	ae71bfdee7	Add more tests for UCS-4 text conversion	2021-09-06 13:16:23 +02:00
Alex Dowad	fd0e0c7390	Add another test for UCS-2 text conversion	2021-09-06 13:16:23 +02:00
Alex Dowad	edf2bd95d9	Add more tests for ISO-2022-JP and JIS7/8 text conversion	2021-09-06 13:16:23 +02:00
Alex Dowad	6a2dca3420	Add more tests for ISO-2022-JP-KDDI text conversion	2021-09-06 13:16:23 +02:00
Alex Dowad	d2f5a8b328	Add more tests for SJIS-mac text conversion	2021-09-06 13:16:23 +02:00
Alex Dowad	0957f54eb1	Treat truncated escape sequences for CP5022{0,1,2} as error	2021-09-06 13:16:23 +02:00

1 2 3 4 5 ...

2172 Commits