1
0
mirror of https://github.com/php/php-src.git synced 2026-03-24 16:22:37 +01:00

663 Commits

Author SHA1 Message Date
Alex Dowad
c34b84ed81 Remove unused conversion code from mbstring
Over the last few years, I refactored mbstring to perform encoding conversion
a buffer at a time, rather than a single byte at a time. This resulted in a
huge performance increase.

After the refactoring, the old "byte-at-a-time" code was retained for two
reasons:

1) It was used by the mailparse PECL extension.
2) It was used to implement mb_strcut for some text encodings.

However, after reviewing mailparse's use of mbstring, it is clear that
mailparse only relies on mbstring for decoding of QPrint, and possibly
Base64. It does not use the byte-at-a-time conversion code for any
other encoding.

Further, mb_strcut only relies on the byte-at-a-time conversion code
for a limited number of legacy text encodings, such as ISO-2022-JP,
HZ, UTF-7, etc.

Hence, we can remove over 5000 lines of unused code without breaking
anything. This will help to reduce binary size, and make the mbstring
codebase easier to navigate for new contributors.
2026-01-13 11:43:44 +09:00
Alex Dowad
11bec6b92f Remove some now-unused code from mbfl_strcut
The legacy mbfl_strcut function is only used to implement mb_strcut
for legacy text encodings which 1) do not use a fixed number of bytes
per codepoint, 2) do not have an 'mblen_table' which can be used to
quickly determine the codepoint length of a byte sequence, and 3) do
not have a specialized 'mb_cut' function which implements mb_strcut
for that text encoding.

Remove unused code from mbfl_strcut, and leave only what is currently
needed for the implementation of mb_strcut.
2026-01-13 11:43:44 +09:00
tekimen
edc2671227 ext/mbstring: Update to Unicode 17.0 (#19796)
Updates UCD to Unicode 17.0 (released 2025 Sep).
2025-09-13 08:07:51 +09:00
Niels Dossche
719419a6e5 Fix unterminated string GCC warnings in mbstring (#19192)
Necessary for for Werror builds
2025-07-23 11:49:16 +02:00
Peter Kokot
8622362394 Remove unused strcasecmp definition (#17050)
The strcasecmp usage was removed via
dc5f3b9562.
2025-03-21 18:30:22 +01:00
Niels Dossche
44a2a5d3e9 Merge branch 'PHP-8.4'
* PHP-8.4:
  Resolve GH-17112 for lower branches
2024-12-11 19:33:03 +01:00
Niels Dossche
ab47c189f3 Merge branch 'PHP-8.3' into PHP-8.4
* PHP-8.3:
  Resolve GH-17112 for lower branches
2024-12-11 19:32:48 +01:00
Niels Dossche
754aa7706b Resolve GH-17112 for lower branches
See https://github.com/php/php-src/pull/17114#issuecomment-2533050450
2024-12-11 19:32:36 +01:00
Niels Dossche
75e7234f70 Drop redundant macro definitions from mbfl_defs.h
These are defined by the standard include headers we already use.
These cause conflicts which cause compiler warnings on some toolchains
(see GH-17112).

Closes GH-17114.
2024-12-11 19:22:32 +01:00
Ayesh Karunaratne
3afb96184e ext/mbstring: Update to Unicode 16
Updates UCD to Unicode 16.0 (released 2024 Sept).

Previously: 0fdffc18, #7502, #14680

Unicode 16 adds several new character sets and case folding rules.
However, the existing ucgendat script can still parse them.

This also adds a couple test cases to make sure the new rules for
East Asian Wide characters and case folding work correctly. These
tests fail on Unicode 15.1 and older because those verisons do not
contain those rules.
2024-09-17 10:40:00 +09:00
tekimen
dc5f3b9562 Fix GH-15824 mb_detect_encoding() invalid "UTF8" (#15829)
I fixed from strcasecmp to strncasecmp.
However, strncasecmp is specify size to #3 parameter.
Hence, Add check length to mime and aliases.

Co-authored-by: Niels Dossche <7771979+nielsdos@users.noreply.github.com>
2024-09-11 09:40:35 +09:00
Peter Kokot
9e94d2b040 Autotools: Refactor builtin checks (#14835)
This creates a single M4 macro PHP_CHECK_BUILTIN and removes other
PHP_CHECK_BUILTIN_* macros. Checks are wrapped in AC_CACHE_CHECK and
PHP_HAVE_BUILTIN_* CPP macro definitions are defined to 1 if builtin
is found and undefined if not.

This also changes all PHP_HAVE_BUILTIN_ symbols to be either undefined
or defined (to value 1) and syncs all #if/ifdef/defined usages of them
in the php-src code. This way it is simpler to use them because they
don't need to be defined to value 0 on Windows, for example. This is
done as previous usages in php-src were mixed and on many places they
were only checked with ifdef.
2024-07-08 21:25:16 +02:00
Ayesh Karunaratne
421ac9ac28 ext/mbstring: update to Unicode 15
Updates UCD to Unicode 15.1 (released 2023 Sept). The upcoming
Unicode 16 version will be released roughly on 2024 Sept.

Previously: 0fdffc18, #7502

UCD 15.1 `DerivedNormalizationProps` contains multiple properties in
the same line, which breaks the parser. This also updates the
`ucgendat.php` script to allow 2 or three fields in each line, and to
look for the `Cased` and `Case_Ignorable` properties in either of the
fields to mimic the previous behavior.
2024-06-29 17:24:52 +02:00
Peter Kokot
a82d86479c Replace WIN32 conditions with _WIN32 or PHP_WIN32 (#14462)
* Replace WIN32 conditions with _WIN32 or PHP_WIN32

WIN32 is defined by the SDK and not defined all the time on Windows by
compilers or the environment. _WIN32 is defined as 1 when the
compilation target is 32-bit ARM, 64-bit ARM, x86, or x64. Otherwise,
undefined.

This syncs these usages one step further.

Upstream libgd has replaced WIN32 with _WIN32 via
c60d9fe577

PHP_WIN32 is added to ext/sockets/sockets.stub.php as done in other
*.stub.php files at this point.

* Use PHP_WIN32 in ext/random

* Use PHP_WIN32 in ext/sockets

* Use _WIN32 in xxhash.h as done upstream

See https://github.com/Cyan4973/xxHash/pull/931

* Update end comment with PHP_WIN32
2024-06-10 21:59:41 +02:00
Gina Peter Banyard
672539870d ext/mbstring: Fix some [-Wsign-compare] warnings 2024-06-06 16:18:23 +01:00
Peter Kokot
056c43f848 [skip ci] Sync file permissions in Git repository
Git can track executable (0755) and non-executable (0644) file modes.
This is a minor file permissions sync across the php-src Git repository.

- build/config.guess (0755 as done upstream)
- build/config.sub (0755 as done upstream)
- ext/*/?*.stub.php (0644)
- ext/mbstring/libmbfl/mbfl/mk_eaw_tbl.awk (0755 due to shebang usage)
2024-02-20 17:58:47 +01:00
Peter Kokot
474edd6eb5 Remove unused symbols in ext/mbstring/libmbfl/config.h.w32 (#13152)
HAVE_WIN32_NATIVE_THREAD, USE_WIN32_NATIVE_THREAD and ENABLE_THREADS
were once part of the libmbfl build https://github.com/moriyoshi/libmbfl
but are not used anymore.
2024-01-15 10:27:21 +01:00
Niels Dossche
da6766d778 Use more optimal perfect hash table 2023-12-30 18:29:47 +02:00
Niels Dossche
0ea4f39a5c Add script to aid generation of perfect hash table 2023-12-30 18:29:47 +02:00
Alex Dowad
5fdb27246c Add mbstring support for GB18030-2022 text encoding
The previous version of the GB-18030 standard was published in 2005.
This commit adds support for the updated (2022) version of this text
encoding. The existing GB18030 implementation has been left unchanged
for backwards compatibility; users who want to use the new standard
must explicitly indicate the desired text encoding is 'GB18030-2022'.

The document which defines GB18030-2022, published by the government
of the People's Republic of China, defines three levels of standards
compliance. This implementation is intended to achieve Implementation
Level 3, which is the highest level of compliance.

Experts in the GB18030 standard are requested to assess this
implementation and report any deviation from the standard.
2023-12-30 18:29:47 +02:00
Alex Dowad
cffdeb81d5 Add specialized implementation of mb_strcut for GB18030
For GB18030, it is not generally possible to identify character
boundaries without scanning through the entire string. Therefore,
implement mb_strcut using a similar strategy as the mblen_table based
implementation in mbstring.c. The difference is that for GB18030, we
need to look at two leading bytes to determine the byte length of a
multi-byte character.

The new implementation is 4-5x faster for short strings, and more than
10x faster for long strings. (Part of the reason why this new code has
such a great performance advantage is because it is replacing code
based on the older text conversion filters provided by libmbfl, which
were quite slow.)

The behavior is the same as before for valid GB18030 strings; for
some invalid strings, mb_strcut will choose different 'cut' points
as compared to before. (Clang's libFuzzer was used to compare the
old and new implementations, searching for test cases where they had
different behavior; no such cases were found.)
2023-12-18 17:01:20 +02:00
Alex Dowad
b0f7df1a67 Use optimized implementation of mb_strcut for Japanese mobile vendor UTF-8 variants
To facilitate sharing of mb_cut_utf8, I combined mbfilter_utf8.c and
mbfilter_utf8_mobile.c into a single source file.
2023-12-07 20:37:15 +02:00
Niels Dossche
91279cfdc0 Use binary search for cp932ext*_ucs_table lookups (#12712)
* Use binary search for cp932ext*_ucs_table lookups

A large amount of time is spent doing a linear search through these
tables in the CP932 encoding. Instead of that, we can add sorted
versions of these tables that also store the index of the non-sorted
version and perform a binary search on those sorted versions.

This reduces the time spent from 1.54s to 0.91s for the reference
benchmark [1].

[1] https://github.com/php/php-src/issues/12684#issuecomment-1813799924

* Fix search bounds
2023-11-18 12:09:12 +01:00
Niels Dossche
3ad422ebd0 Avoid temporary string allocations in php_mb_parse_encoding_list() (#12714)
This brings execution time down from 0.91s to 0.86s on the reference
benchmark [1].

[1] https://github.com/php/php-src/issues/12684#issuecomment-1813799924
2023-11-18 12:08:59 +01:00
Niels Dossche
7658220599 Improve performance of mbfl_name2encoding() by using perfect hashing (#12707)
mbfl_name2encoding() uses a linear loop through the encodings, comparing
the name one by one, which is very slow. For the benchmark [1] just
looking up the name takes about 50% of run-time.

By using perfect hashing instead, we no longer have to loop over the
list, and the number of string comparisons is reduced to just a single
one. The perfect hashing table is generated using GNU gperf and amended
manually to fit in with mbstring and manually changed to  reduce the
cache size.

[1] https://github.com/php/php-src/issues/12684#issuecomment-1813799924
2023-11-17 19:38:43 +01:00
Alex Dowad
d04854b38c Add fast mb_strcut implementation for UTF-16
Similar to the fast, specialized mb_strcut implementation for UTF-8
in 1f0cf133db, this new implementation of mb_strcut for UTF-16 strings
just examines a few bytes before each cut point.

Even for short strings, the new implementation is around 2x faster.
For strings around 10,000 bytes in length, it comes out about 100-500x
faster in my microbenchmarks.

The new implementation behaves identically to the old one on valid
UTF-16 strings; a fuzzer was used to help verify this.
2023-10-28 19:09:08 +02:00
Alex Dowad
0c22276888 PHP_HAVE_BUILTIN_USUB_OVERFLOW macro is defined even if __builtin_usub_overflow not available
...So conditionally including code which uses __builtin_usub_overflow
(for performance) if the macro is defined is not correct. We also need
to check if the macro is defined as a non-zero value.

Apparently this broke the build for a user whose C compiler is GCC
4.9.4. Sorry, user! That was my fault!

Thanks to Jakub Zelenka for reporting the issue.
2023-10-23 14:05:48 +01:00
Alex Dowad
1f0cf133db Add fast mb_strcut implementation for UTF-8
The old implementation runs through the entire string to pick out the
part which should be returned by mb_strcut. This creates significant
performance overhead. The new specialized implementation of mb_strcut
for UTF-8 usually only examines a few bytes around the starting and
ending cut points, meaning it generally runs in constant time.

For UTF-8 strings just a few bytes long, the new implementation is
around 10% faster (according to microbenchmarks which I ran locally).
For strings around 10,000 bytes in length, it is 50-300x faster.
(Yes, that is 300x and not 300%.)

The new implementation behaves identically to the old one on VALID
UTF-8 strings; a fuzzer was used to help ensure this is the case.
On invalid UTF-8 strings, there is a difference: in some cases, the
old implementation will pass invalid byte sequences through unchanged,
while in others it will remove them. The new implementation has
behavior which is perhaps slightly more predictable: it simply backs
up the starting and ending cut points to the preceding "starter
byte" (one which is not a UTF-8 continuation byte).
2023-10-04 09:10:38 +02:00
Alex Dowad
a57fdea149 Add assertion to mb_utf7imap_to_wchar to catch buffer overrun
I don't believe such a buffer overrun will ever occur, but just in
case the code is changed in the future, it will be good to have an
assertion here to help catch bugs. (A similar assertion is already
used in the UTF-7 version of this function.)
2023-10-01 14:43:35 +02:00
Alex Dowad
50ca24251d PHP_HAVE_BUILTIN_USUB_OVERFLOW macro is defined even if __builtin_usub_overflow not available
...So conditionally including code which uses __builtin_usub_overflow
(for performance) if the macro is defined is not correct. We also need
to check if the macro is defined as a non-zero value.

Apparently this broke the build for a user whose C compiler is GCC
4.9.4. Sorry, user! That was my fault!

Thanks to Jakub Zelenka for reporting the issue.
2023-09-08 20:36:24 +02:00
Alex Dowad
6930ef5837 Merge branch 'PHP-8.2'
* PHP-8.2:
  Fix mb_strlen is wrong length for CP932 when 0x80.
2023-05-30 14:02:16 -07:00
Alex Dowad
c33589ea11 Merge branch 'PHP-8.1' into PHP-8.2
* PHP-8.1:
  Fix mb_strlen is wrong length for CP932 when 0x80.
2023-05-30 13:45:36 -07:00
Yuya Hamada
c50172e812 Fix mb_strlen is wrong length for CP932 when 0x80. 2023-05-30 13:44:30 -07:00
Alex Dowad
18ca489347 Convert mbfilter_conv{,_r}_map_tbl to return bool
Thanks to Girgias for pointing this out.
2023-05-20 21:27:48 -07:00
Alex Dowad
8e6be14372 Fix problem with CP949 conversion when 0xC9 precedes byte lower than 0xA1
This bug was introduced in e837a8800b. In that commit, I increased the
performance of CP949 text conversion, but accidentally broke the case
where 0xC9 (illegal byte to start a character) is followed by a valid
character with a first byte less than 0xA1. The 'broken' behavior is
that both the 0xC9 byte and the following valid character would be
converted to error markers.
2023-05-20 21:27:48 -07:00
Alex Dowad
245daedb41 Move kana translation tables to mbfilter_cjk.c
These (static) tables were defined in a header file, which was included
in two different .c files. That will result in two copies of the tables
being included in the PHP binary.

But the tables were only used in one of the two .c files. Move it where
it is used to avoid needlessly bloating the binary. (I checked in a
hex editor and confirmed that while the previous binary contained two
copies of these tables, it now only contains one.)
2023-05-20 21:27:48 -07:00
Alex Dowad
175154dbcc Optimize conversion of CP932 text to Unicode
Conversion of CP932 text to UTF-8 using `mb_convert_encoding` is
now about 20% faster than before.
2023-05-20 21:27:48 -07:00
Alex Dowad
73633bf1c3 Optimize conversion of SJIS-2004 text to Unicode
Conversion of SJIS-2004 text to UTF-8 using `mb_convert_encoding` is
now about 60% faster than before. (Many other mbstring functions will
also be faster now on SJIS-2004 text.)
2023-05-20 21:27:48 -07:00
Alex Dowad
c717c79a09 Combine CJK encoding conversion code in a single source file
This will make it easier to combine duplicated code between all the
CJK text encodings (a significant amount is already combined in this
commit, such as the repeated definitions of SJIS_DECODE and
SJIS_ENCODE), but I hope to remove even more redundancy in the future.

The table used to implement mb_strlen for CP932 has been changed to
the same table as "SJIS-win".
2023-05-20 21:27:48 -07:00
Ilija Tovilo
aa553af911 Fix segfault in mb_strrpos/mb_strripos with ASCII encoding and negative offset
We're setting the encoding from PHP_FUNCTION(mb_strpos), but mbfl_strpos would
discard it, setting it to mbfl_encoding_pass, making zend_memnrstr fail due to a
null-pointer exception.

Fixes GH-11217
Closes GH-11220
2023-05-15 10:36:37 +02:00
Niels Dossche
b915a1d8d7 Fix uninitialised variable warning in mbfilter_sjis.c
Compiling in release mode with UBSAN gives me the following compiler warning:
```
In function ‘mb_wchar_to_sjismac’:
mbfilter_sjis.c:1419:89: warning: ‘i’ may be used uninitialized [-Wmaybe-uninitialized]
 1419 | buf->state = (i << 24) | (index << 16) | (w & 0xFFFF);
      |                 ^~
mbfilter_sjis.c:1398:42: note: ‘i’ was declared here
 1398 | for (int i = 0; i < code_tbl_m_len; i++) {
      |          ^
```

Since the if condition will always be taken after the goto, we can get
rid of the warning by moving the label inside the if.

Signed-off-by: Alex Dowad <alexinbeijing@gmail.com>
2023-04-30 13:51:52 +02:00
Javier Eguiluz
732d92c0e5 [skip ci] Fix various typos and grammar issues (#11143) 2023-04-28 11:05:32 +02:00
Alex Dowad
6df7557e43 mb_parse_str, mb_http_input, and mb_convert_variables use fast text conversion code for automatic encoding detection
For mb_parse_str, when mbstring.http_input (INI parameter) is a list of
multiple possible text encodings (which is not the case by default),
this new implementation is about 25% faster.

When mbstring.http_input is a single value, then nothing is changed.
(No automatic encoding detection is done in that case.)
2023-04-12 19:57:52 +02:00
Alex Dowad
c4fb049bf6 For UTF-7, emit error marker if Base64 section ends abruptly after first half of surrogate pair
This (rare) situation was already handled correctly for the 1st and 2nd
of every 3 codepoints in a Base64-encoded section of a UTF-7 string.
However, it was not handled correctly if it happened on the 3rd,
6th, 9th, etc. codepoint of such a Base64-encoded section.
2023-03-27 11:34:11 +02:00
pakutoma
b721d0f71e Fix phpGH-10648: add check function pointer into mbfl_encoding
Previously, mbstring used the same logic for encoding validation as for
encoding conversion.

However, there are cases where we want to use different logic for validation
and conversion. For example, if a string ends up with missing input
required by the encoding, or if a character is input that is invalid
as an encoding but can be converted, the conversion should succeed and
the validation should fail.

To achieve this, a function pointer mb_check_fn has been added to
struct mbfl_encoding to implement the logic used for validation.
Also, added implementation of validation logic for UTF-7, UTF7-IMAP,
ISO-2022-JP and JIS.

(The same change has already been made to PHP 8.2 and 8.3; see
6fc8d014df. This commit is backporting the change to PHP 8.1.)
2023-03-25 09:52:10 +02:00
Alex Dowad
345abce590 Fix compile errors caused by missing initializers in 0779950768
When I built and tested 0779950768 locally, the build was successful
and all tests passed. However, in CI, some CI jobs are failing due to
compile errors. Fix those.
2023-03-24 22:18:18 +02:00
Alex Dowad
0779950768 Merge branch 'PHP-8.2'
* PHP-8.2:
  Fix phpGH-10648: add check function pointer into mbfl_encoding
2023-03-24 21:15:32 +02:00
pakutoma
6fc8d014df Fix phpGH-10648: add check function pointer into mbfl_encoding
Previously, mbstring used the same logic for encoding validation as for
encoding conversion.

However, there are cases where we want to use different logic for validation
and conversion. For example, if a string ends up with missing input
required by the encoding, or if a character is input that is invalid
as an encoding but can be converted, the conversion should succeed and
the validation should fail.

To achieve this, a function pointer mb_check_fn has been added to
struct mbfl_encoding to implement the logic used for validation.
Also, added implementation of validation logic for UTF-7, UTF7-IMAP,
ISO-2022-JP and JIS.
2023-03-24 20:34:22 +02:00
Alex Dowad
0ce755be26 Implement mb_encode_mimeheader using fast text conversion filters
The behavior of the new mb_encode_mimeheader implementation closely
follows the old implementation, except for three points:

• The old implementation was missing a call to the mbfl_convert_filter
  flush function. So it would sometimes truncate the input string just
  before its end.

• The old implementation would drop zero bytes when QPrint-encoding.
  So for example, if you tried to QPrint-encode the UTF-32BE string
  "\x00\x00\x12\x34", its QPrint-encoding would be "=12=34", which
  does not decode to a valid UTF-32BE string. This is now fixed.

• In some rare corner cases, the new implementation will choose to
  Base64-encode or QPrint-encode the input string, where the old
  implementation would have just added newlines to it. Specifically,
  this can happen when there is a non-space ASCII character, followed
  by a large number of ASCII spaces, followed by a non-ASCII character.

The new implementation is around 2.5-8x faster than the old one,
depending on the text encoding and transfer encoding used. Performance
gains are greater with Base64 transfer encoding than with QPrint
transfer encoding; this is not because QPrint-encoding bytes is slow,
but because QPrint-encoded output is much bigger than Base64-encoded
output and takes more lines, so we have to go through the process of
finding the right place to break a line many more times.
2023-03-15 15:53:08 +02:00
Alex Dowad
e447036dc6 Merge branch 'PHP-8.2'
* PHP-8.2:
  Propagate error checks for mbfl_filt_conv_illegal_output()
  Use CK() macro to check the output function in mbfilter_unicode2sjis_emoji_sb()
  Make error checks on encoding methods for docomo, kddi, sb consistent
2023-03-02 23:12:09 +02:00