archived-php-src

mirror of https://github.com/php/php-src.git synced 2026-03-24 16:22:37 +01:00

Author	SHA1	Message	Date
Alex Dowad	c34b84ed81	Remove unused conversion code from mbstring Over the last few years, I refactored mbstring to perform encoding conversion a buffer at a time, rather than a single byte at a time. This resulted in a huge performance increase. After the refactoring, the old "byte-at-a-time" code was retained for two reasons: 1) It was used by the mailparse PECL extension. 2) It was used to implement mb_strcut for some text encodings. However, after reviewing mailparse's use of mbstring, it is clear that mailparse only relies on mbstring for decoding of QPrint, and possibly Base64. It does not use the byte-at-a-time conversion code for any other encoding. Further, mb_strcut only relies on the byte-at-a-time conversion code for a limited number of legacy text encodings, such as ISO-2022-JP, HZ, UTF-7, etc. Hence, we can remove over 5000 lines of unused code without breaking anything. This will help to reduce binary size, and make the mbstring codebase easier to navigate for new contributors.	2026-01-13 11:43:44 +09:00
Alex Dowad	11bec6b92f	Remove some now-unused code from mbfl_strcut The legacy mbfl_strcut function is only used to implement mb_strcut for legacy text encodings which 1) do not use a fixed number of bytes per codepoint, 2) do not have an 'mblen_table' which can be used to quickly determine the codepoint length of a byte sequence, and 3) do not have a specialized 'mb_cut' function which implements mb_strcut for that text encoding. Remove unused code from mbfl_strcut, and leave only what is currently needed for the implementation of mb_strcut.	2026-01-13 11:43:44 +09:00
tekimen	edc2671227	ext/mbstring: Update to Unicode 17.0 (#19796 ) Updates UCD to Unicode 17.0 (released 2025 Sep).	2025-09-13 08:07:51 +09:00
Niels Dossche	719419a6e5	Fix unterminated string GCC warnings in mbstring (#19192 ) Necessary for for Werror builds	2025-07-23 11:49:16 +02:00
Peter Kokot	8622362394	Remove unused strcasecmp definition (#17050 ) The strcasecmp usage was removed via `dc5f3b9562`.	2025-03-21 18:30:22 +01:00
Niels Dossche	44a2a5d3e9	Merge branch 'PHP-8.4' * PHP-8.4: Resolve GH-17112 for lower branches	2024-12-11 19:33:03 +01:00
Niels Dossche	ab47c189f3	Merge branch 'PHP-8.3' into PHP-8.4 * PHP-8.3: Resolve GH-17112 for lower branches	2024-12-11 19:32:48 +01:00
Niels Dossche	754aa7706b	Resolve GH-17112 for lower branches See https://github.com/php/php-src/pull/17114#issuecomment-2533050450	2024-12-11 19:32:36 +01:00
Niels Dossche	75e7234f70	Drop redundant macro definitions from mbfl_defs.h These are defined by the standard include headers we already use. These cause conflicts which cause compiler warnings on some toolchains (see GH-17112). Closes GH-17114.	2024-12-11 19:22:32 +01:00
Ayesh Karunaratne	3afb96184e	ext/mbstring: Update to Unicode 16 Updates UCD to Unicode 16.0 (released 2024 Sept). Previously: `0fdffc18`, #7502, #14680 Unicode 16 adds several new character sets and case folding rules. However, the existing ucgendat script can still parse them. This also adds a couple test cases to make sure the new rules for East Asian Wide characters and case folding work correctly. These tests fail on Unicode 15.1 and older because those verisons do not contain those rules.	2024-09-17 10:40:00 +09:00
tekimen	dc5f3b9562	Fix GH-15824 mb_detect_encoding() invalid "UTF8" (#15829 ) I fixed from strcasecmp to strncasecmp. However, strncasecmp is specify size to #3 parameter. Hence, Add check length to mime and aliases. Co-authored-by: Niels Dossche <7771979+nielsdos@users.noreply.github.com>	2024-09-11 09:40:35 +09:00
Peter Kokot	9e94d2b040	Autotools: Refactor builtin checks (#14835 ) This creates a single M4 macro PHP_CHECK_BUILTIN and removes other PHP_CHECK_BUILTIN_* macros. Checks are wrapped in AC_CACHE_CHECK and PHP_HAVE_BUILTIN_* CPP macro definitions are defined to 1 if builtin is found and undefined if not. This also changes all PHP_HAVE_BUILTIN_ symbols to be either undefined or defined (to value 1) and syncs all #if/ifdef/defined usages of them in the php-src code. This way it is simpler to use them because they don't need to be defined to value 0 on Windows, for example. This is done as previous usages in php-src were mixed and on many places they were only checked with ifdef.	2024-07-08 21:25:16 +02:00
Ayesh Karunaratne	421ac9ac28	ext/mbstring: update to Unicode 15 Updates UCD to Unicode 15.1 (released 2023 Sept). The upcoming Unicode 16 version will be released roughly on 2024 Sept. Previously: `0fdffc18`, #7502 UCD 15.1 `DerivedNormalizationProps` contains multiple properties in the same line, which breaks the parser. This also updates the `ucgendat.php` script to allow 2 or three fields in each line, and to look for the `Cased` and `Case_Ignorable` properties in either of the fields to mimic the previous behavior.	2024-06-29 17:24:52 +02:00
Peter Kokot	a82d86479c	Replace WIN32 conditions with _WIN32 or PHP_WIN32 (#14462 ) * Replace WIN32 conditions with _WIN32 or PHP_WIN32 WIN32 is defined by the SDK and not defined all the time on Windows by compilers or the environment. _WIN32 is defined as 1 when the compilation target is 32-bit ARM, 64-bit ARM, x86, or x64. Otherwise, undefined. This syncs these usages one step further. Upstream libgd has replaced WIN32 with _WIN32 via `c60d9fe577` PHP_WIN32 is added to ext/sockets/sockets.stub.php as done in other .stub.php files at this point. Use PHP_WIN32 in ext/random * Use PHP_WIN32 in ext/sockets * Use _WIN32 in xxhash.h as done upstream See https://github.com/Cyan4973/xxHash/pull/931 * Update end comment with PHP_WIN32	2024-06-10 21:59:41 +02:00
Gina Peter Banyard	672539870d	ext/mbstring: Fix some [-Wsign-compare] warnings	2024-06-06 16:18:23 +01:00
Peter Kokot	056c43f848	[skip ci] Sync file permissions in Git repository Git can track executable (0755) and non-executable (0644) file modes. This is a minor file permissions sync across the php-src Git repository. - build/config.guess (0755 as done upstream) - build/config.sub (0755 as done upstream) - ext//?.stub.php (0644) - ext/mbstring/libmbfl/mbfl/mk_eaw_tbl.awk (0755 due to shebang usage)	2024-02-20 17:58:47 +01:00
Peter Kokot	474edd6eb5	Remove unused symbols in ext/mbstring/libmbfl/config.h.w32 (#13152 ) HAVE_WIN32_NATIVE_THREAD, USE_WIN32_NATIVE_THREAD and ENABLE_THREADS were once part of the libmbfl build https://github.com/moriyoshi/libmbfl but are not used anymore.	2024-01-15 10:27:21 +01:00
Niels Dossche	da6766d778	Use more optimal perfect hash table	2023-12-30 18:29:47 +02:00
Niels Dossche	0ea4f39a5c	Add script to aid generation of perfect hash table	2023-12-30 18:29:47 +02:00
Alex Dowad	5fdb27246c	Add mbstring support for GB18030-2022 text encoding The previous version of the GB-18030 standard was published in 2005. This commit adds support for the updated (2022) version of this text encoding. The existing GB18030 implementation has been left unchanged for backwards compatibility; users who want to use the new standard must explicitly indicate the desired text encoding is 'GB18030-2022'. The document which defines GB18030-2022, published by the government of the People's Republic of China, defines three levels of standards compliance. This implementation is intended to achieve Implementation Level 3, which is the highest level of compliance. Experts in the GB18030 standard are requested to assess this implementation and report any deviation from the standard.	2023-12-30 18:29:47 +02:00
Alex Dowad	cffdeb81d5	Add specialized implementation of mb_strcut for GB18030 For GB18030, it is not generally possible to identify character boundaries without scanning through the entire string. Therefore, implement mb_strcut using a similar strategy as the mblen_table based implementation in mbstring.c. The difference is that for GB18030, we need to look at two leading bytes to determine the byte length of a multi-byte character. The new implementation is 4-5x faster for short strings, and more than 10x faster for long strings. (Part of the reason why this new code has such a great performance advantage is because it is replacing code based on the older text conversion filters provided by libmbfl, which were quite slow.) The behavior is the same as before for valid GB18030 strings; for some invalid strings, mb_strcut will choose different 'cut' points as compared to before. (Clang's libFuzzer was used to compare the old and new implementations, searching for test cases where they had different behavior; no such cases were found.)	2023-12-18 17:01:20 +02:00
Alex Dowad	b0f7df1a67	Use optimized implementation of mb_strcut for Japanese mobile vendor UTF-8 variants To facilitate sharing of mb_cut_utf8, I combined mbfilter_utf8.c and mbfilter_utf8_mobile.c into a single source file.	2023-12-07 20:37:15 +02:00
Niels Dossche	91279cfdc0	Use binary search for cp932ext_ucs_table lookups (#12712 ) Use binary search for cp932ext_ucs_table lookups A large amount of time is spent doing a linear search through these tables in the CP932 encoding. Instead of that, we can add sorted versions of these tables that also store the index of the non-sorted version and perform a binary search on those sorted versions. This reduces the time spent from 1.54s to 0.91s for the reference benchmark [1]. [1] https://github.com/php/php-src/issues/12684#issuecomment-1813799924 Fix search bounds	2023-11-18 12:09:12 +01:00
Niels Dossche	3ad422ebd0	Avoid temporary string allocations in php_mb_parse_encoding_list() (#12714 ) This brings execution time down from 0.91s to 0.86s on the reference benchmark [1]. [1] https://github.com/php/php-src/issues/12684#issuecomment-1813799924	2023-11-18 12:08:59 +01:00
Niels Dossche	7658220599	Improve performance of mbfl_name2encoding() by using perfect hashing (#12707 ) mbfl_name2encoding() uses a linear loop through the encodings, comparing the name one by one, which is very slow. For the benchmark [1] just looking up the name takes about 50% of run-time. By using perfect hashing instead, we no longer have to loop over the list, and the number of string comparisons is reduced to just a single one. The perfect hashing table is generated using GNU gperf and amended manually to fit in with mbstring and manually changed to reduce the cache size. [1] https://github.com/php/php-src/issues/12684#issuecomment-1813799924	2023-11-17 19:38:43 +01:00
Alex Dowad	d04854b38c	Add fast mb_strcut implementation for UTF-16 Similar to the fast, specialized mb_strcut implementation for UTF-8 in `1f0cf133db`, this new implementation of mb_strcut for UTF-16 strings just examines a few bytes before each cut point. Even for short strings, the new implementation is around 2x faster. For strings around 10,000 bytes in length, it comes out about 100-500x faster in my microbenchmarks. The new implementation behaves identically to the old one on valid UTF-16 strings; a fuzzer was used to help verify this.	2023-10-28 19:09:08 +02:00
Alex Dowad	0c22276888	PHP_HAVE_BUILTIN_USUB_OVERFLOW macro is defined even if __builtin_usub_overflow not available ...So conditionally including code which uses __builtin_usub_overflow (for performance) if the macro is defined is not correct. We also need to check if the macro is defined as a non-zero value. Apparently this broke the build for a user whose C compiler is GCC 4.9.4. Sorry, user! That was my fault! Thanks to Jakub Zelenka for reporting the issue.	2023-10-23 14:05:48 +01:00
Alex Dowad	1f0cf133db	Add fast mb_strcut implementation for UTF-8 The old implementation runs through the entire string to pick out the part which should be returned by mb_strcut. This creates significant performance overhead. The new specialized implementation of mb_strcut for UTF-8 usually only examines a few bytes around the starting and ending cut points, meaning it generally runs in constant time. For UTF-8 strings just a few bytes long, the new implementation is around 10% faster (according to microbenchmarks which I ran locally). For strings around 10,000 bytes in length, it is 50-300x faster. (Yes, that is 300x and not 300%.) The new implementation behaves identically to the old one on VALID UTF-8 strings; a fuzzer was used to help ensure this is the case. On invalid UTF-8 strings, there is a difference: in some cases, the old implementation will pass invalid byte sequences through unchanged, while in others it will remove them. The new implementation has behavior which is perhaps slightly more predictable: it simply backs up the starting and ending cut points to the preceding "starter byte" (one which is not a UTF-8 continuation byte).	2023-10-04 09:10:38 +02:00
Alex Dowad	a57fdea149	Add assertion to mb_utf7imap_to_wchar to catch buffer overrun I don't believe such a buffer overrun will ever occur, but just in case the code is changed in the future, it will be good to have an assertion here to help catch bugs. (A similar assertion is already used in the UTF-7 version of this function.)	2023-10-01 14:43:35 +02:00
Alex Dowad	50ca24251d	PHP_HAVE_BUILTIN_USUB_OVERFLOW macro is defined even if __builtin_usub_overflow not available ...So conditionally including code which uses __builtin_usub_overflow (for performance) if the macro is defined is not correct. We also need to check if the macro is defined as a non-zero value. Apparently this broke the build for a user whose C compiler is GCC 4.9.4. Sorry, user! That was my fault! Thanks to Jakub Zelenka for reporting the issue.	2023-09-08 20:36:24 +02:00
Alex Dowad	6930ef5837	Merge branch 'PHP-8.2' * PHP-8.2: Fix mb_strlen is wrong length for CP932 when 0x80.	2023-05-30 14:02:16 -07:00
Alex Dowad	c33589ea11	Merge branch 'PHP-8.1' into PHP-8.2 * PHP-8.1: Fix mb_strlen is wrong length for CP932 when 0x80.	2023-05-30 13:45:36 -07:00
Yuya Hamada	c50172e812	Fix mb_strlen is wrong length for CP932 when 0x80.	2023-05-30 13:44:30 -07:00
Alex Dowad	18ca489347	Convert mbfilter_conv{,_r}_map_tbl to return bool Thanks to Girgias for pointing this out.	2023-05-20 21:27:48 -07:00
Alex Dowad	8e6be14372	Fix problem with CP949 conversion when 0xC9 precedes byte lower than 0xA1 This bug was introduced in `e837a8800b`. In that commit, I increased the performance of CP949 text conversion, but accidentally broke the case where 0xC9 (illegal byte to start a character) is followed by a valid character with a first byte less than 0xA1. The 'broken' behavior is that both the 0xC9 byte and the following valid character would be converted to error markers.	2023-05-20 21:27:48 -07:00
Alex Dowad	245daedb41	Move kana translation tables to mbfilter_cjk.c These (static) tables were defined in a header file, which was included in two different .c files. That will result in two copies of the tables being included in the PHP binary. But the tables were only used in one of the two .c files. Move it where it is used to avoid needlessly bloating the binary. (I checked in a hex editor and confirmed that while the previous binary contained two copies of these tables, it now only contains one.)	2023-05-20 21:27:48 -07:00
Alex Dowad	175154dbcc	Optimize conversion of CP932 text to Unicode Conversion of CP932 text to UTF-8 using `mb_convert_encoding` is now about 20% faster than before.	2023-05-20 21:27:48 -07:00
Alex Dowad	73633bf1c3	Optimize conversion of SJIS-2004 text to Unicode Conversion of SJIS-2004 text to UTF-8 using `mb_convert_encoding` is now about 60% faster than before. (Many other mbstring functions will also be faster now on SJIS-2004 text.)	2023-05-20 21:27:48 -07:00
Alex Dowad	c717c79a09	Combine CJK encoding conversion code in a single source file This will make it easier to combine duplicated code between all the CJK text encodings (a significant amount is already combined in this commit, such as the repeated definitions of SJIS_DECODE and SJIS_ENCODE), but I hope to remove even more redundancy in the future. The table used to implement mb_strlen for CP932 has been changed to the same table as "SJIS-win".	2023-05-20 21:27:48 -07:00
Ilija Tovilo	aa553af911	Fix segfault in mb_strrpos/mb_strripos with ASCII encoding and negative offset We're setting the encoding from PHP_FUNCTION(mb_strpos), but mbfl_strpos would discard it, setting it to mbfl_encoding_pass, making zend_memnrstr fail due to a null-pointer exception. Fixes GH-11217 Closes GH-11220	2023-05-15 10:36:37 +02:00
Niels Dossche	b915a1d8d7	Fix uninitialised variable warning in mbfilter_sjis.c Compiling in release mode with UBSAN gives me the following compiler warning: ``` In function ‘mb_wchar_to_sjismac’: mbfilter_sjis.c:1419:89: warning: ‘i’ may be used uninitialized [-Wmaybe-uninitialized] 1419 \| buf->state = (i << 24) \| (index << 16) \| (w & 0xFFFF); \| ^~ mbfilter_sjis.c:1398:42: note: ‘i’ was declared here 1398 \| for (int i = 0; i < code_tbl_m_len; i++) { \| ^ ``` Since the if condition will always be taken after the goto, we can get rid of the warning by moving the label inside the if. Signed-off-by: Alex Dowad <alexinbeijing@gmail.com>	2023-04-30 13:51:52 +02:00
Javier Eguiluz	732d92c0e5	[skip ci] Fix various typos and grammar issues (#11143 )	2023-04-28 11:05:32 +02:00
Alex Dowad	6df7557e43	mb_parse_str, mb_http_input, and mb_convert_variables use fast text conversion code for automatic encoding detection For mb_parse_str, when mbstring.http_input (INI parameter) is a list of multiple possible text encodings (which is not the case by default), this new implementation is about 25% faster. When mbstring.http_input is a single value, then nothing is changed. (No automatic encoding detection is done in that case.)	2023-04-12 19:57:52 +02:00
Alex Dowad	c4fb049bf6	For UTF-7, emit error marker if Base64 section ends abruptly after first half of surrogate pair This (rare) situation was already handled correctly for the 1st and 2nd of every 3 codepoints in a Base64-encoded section of a UTF-7 string. However, it was not handled correctly if it happened on the 3rd, 6th, 9th, etc. codepoint of such a Base64-encoded section.	2023-03-27 11:34:11 +02:00
pakutoma	b721d0f71e	Fix phpGH-10648: add check function pointer into mbfl_encoding Previously, mbstring used the same logic for encoding validation as for encoding conversion. However, there are cases where we want to use different logic for validation and conversion. For example, if a string ends up with missing input required by the encoding, or if a character is input that is invalid as an encoding but can be converted, the conversion should succeed and the validation should fail. To achieve this, a function pointer mb_check_fn has been added to struct mbfl_encoding to implement the logic used for validation. Also, added implementation of validation logic for UTF-7, UTF7-IMAP, ISO-2022-JP and JIS. (The same change has already been made to PHP 8.2 and 8.3; see `6fc8d014df`. This commit is backporting the change to PHP 8.1.)	2023-03-25 09:52:10 +02:00
Alex Dowad	345abce590	Fix compile errors caused by missing initializers in `0779950768` When I built and tested `0779950768` locally, the build was successful and all tests passed. However, in CI, some CI jobs are failing due to compile errors. Fix those.	2023-03-24 22:18:18 +02:00
Alex Dowad	0779950768	Merge branch 'PHP-8.2' * PHP-8.2: Fix phpGH-10648: add check function pointer into mbfl_encoding	2023-03-24 21:15:32 +02:00
pakutoma	6fc8d014df	Fix phpGH-10648: add check function pointer into mbfl_encoding Previously, mbstring used the same logic for encoding validation as for encoding conversion. However, there are cases where we want to use different logic for validation and conversion. For example, if a string ends up with missing input required by the encoding, or if a character is input that is invalid as an encoding but can be converted, the conversion should succeed and the validation should fail. To achieve this, a function pointer mb_check_fn has been added to struct mbfl_encoding to implement the logic used for validation. Also, added implementation of validation logic for UTF-7, UTF7-IMAP, ISO-2022-JP and JIS.	2023-03-24 20:34:22 +02:00
Alex Dowad	0ce755be26	Implement mb_encode_mimeheader using fast text conversion filters The behavior of the new mb_encode_mimeheader implementation closely follows the old implementation, except for three points: • The old implementation was missing a call to the mbfl_convert_filter flush function. So it would sometimes truncate the input string just before its end. • The old implementation would drop zero bytes when QPrint-encoding. So for example, if you tried to QPrint-encode the UTF-32BE string "\x00\x00\x12\x34", its QPrint-encoding would be "=12=34", which does not decode to a valid UTF-32BE string. This is now fixed. • In some rare corner cases, the new implementation will choose to Base64-encode or QPrint-encode the input string, where the old implementation would have just added newlines to it. Specifically, this can happen when there is a non-space ASCII character, followed by a large number of ASCII spaces, followed by a non-ASCII character. The new implementation is around 2.5-8x faster than the old one, depending on the text encoding and transfer encoding used. Performance gains are greater with Base64 transfer encoding than with QPrint transfer encoding; this is not because QPrint-encoding bytes is slow, but because QPrint-encoded output is much bigger than Base64-encoded output and takes more lines, so we have to go through the process of finding the right place to break a line many more times.	2023-03-15 15:53:08 +02:00
Alex Dowad	e447036dc6	Merge branch 'PHP-8.2' * PHP-8.2: Propagate error checks for mbfl_filt_conv_illegal_output() Use CK() macro to check the output function in mbfilter_unicode2sjis_emoji_sb() Make error checks on encoding methods for docomo, kddi, sb consistent	2023-03-02 23:12:09 +02:00

1 2 3 4 5 ...

663 Commits