Over the last few years, I refactored mbstring to perform encoding conversion
a buffer at a time, rather than a single byte at a time. This resulted in a
huge performance increase.
After the refactoring, the old "byte-at-a-time" code was retained for two
reasons:
1) It was used by the mailparse PECL extension.
2) It was used to implement mb_strcut for some text encodings.
However, after reviewing mailparse's use of mbstring, it is clear that
mailparse only relies on mbstring for decoding of QPrint, and possibly
Base64. It does not use the byte-at-a-time conversion code for any
other encoding.
Further, mb_strcut only relies on the byte-at-a-time conversion code
for a limited number of legacy text encodings, such as ISO-2022-JP,
HZ, UTF-7, etc.
Hence, we can remove over 5000 lines of unused code without breaking
anything. This will help to reduce binary size, and make the mbstring
codebase easier to navigate for new contributors.
The legacy mbfl_strcut function is only used to implement mb_strcut
for legacy text encodings which 1) do not use a fixed number of bytes
per codepoint, 2) do not have an 'mblen_table' which can be used to
quickly determine the codepoint length of a byte sequence, and 3) do
not have a specialized 'mb_cut' function which implements mb_strcut
for that text encoding.
Remove unused code from mbfl_strcut, and leave only what is currently
needed for the implementation of mb_strcut.
These are defined by the standard include headers we already use.
These cause conflicts which cause compiler warnings on some toolchains
(see GH-17112).
Closes GH-17114.
Updates UCD to Unicode 16.0 (released 2024 Sept).
Previously: 0fdffc18, #7502, #14680
Unicode 16 adds several new character sets and case folding rules.
However, the existing ucgendat script can still parse them.
This also adds a couple test cases to make sure the new rules for
East Asian Wide characters and case folding work correctly. These
tests fail on Unicode 15.1 and older because those verisons do not
contain those rules.
I fixed from strcasecmp to strncasecmp.
However, strncasecmp is specify size to #3 parameter.
Hence, Add check length to mime and aliases.
Co-authored-by: Niels Dossche <7771979+nielsdos@users.noreply.github.com>
This creates a single M4 macro PHP_CHECK_BUILTIN and removes other
PHP_CHECK_BUILTIN_* macros. Checks are wrapped in AC_CACHE_CHECK and
PHP_HAVE_BUILTIN_* CPP macro definitions are defined to 1 if builtin
is found and undefined if not.
This also changes all PHP_HAVE_BUILTIN_ symbols to be either undefined
or defined (to value 1) and syncs all #if/ifdef/defined usages of them
in the php-src code. This way it is simpler to use them because they
don't need to be defined to value 0 on Windows, for example. This is
done as previous usages in php-src were mixed and on many places they
were only checked with ifdef.
Updates UCD to Unicode 15.1 (released 2023 Sept). The upcoming
Unicode 16 version will be released roughly on 2024 Sept.
Previously: 0fdffc18, #7502
UCD 15.1 `DerivedNormalizationProps` contains multiple properties in
the same line, which breaks the parser. This also updates the
`ucgendat.php` script to allow 2 or three fields in each line, and to
look for the `Cased` and `Case_Ignorable` properties in either of the
fields to mimic the previous behavior.
* Replace WIN32 conditions with _WIN32 or PHP_WIN32
WIN32 is defined by the SDK and not defined all the time on Windows by
compilers or the environment. _WIN32 is defined as 1 when the
compilation target is 32-bit ARM, 64-bit ARM, x86, or x64. Otherwise,
undefined.
This syncs these usages one step further.
Upstream libgd has replaced WIN32 with _WIN32 via
c60d9fe577
PHP_WIN32 is added to ext/sockets/sockets.stub.php as done in other
*.stub.php files at this point.
* Use PHP_WIN32 in ext/random
* Use PHP_WIN32 in ext/sockets
* Use _WIN32 in xxhash.h as done upstream
See https://github.com/Cyan4973/xxHash/pull/931
* Update end comment with PHP_WIN32
Git can track executable (0755) and non-executable (0644) file modes.
This is a minor file permissions sync across the php-src Git repository.
- build/config.guess (0755 as done upstream)
- build/config.sub (0755 as done upstream)
- ext/*/?*.stub.php (0644)
- ext/mbstring/libmbfl/mbfl/mk_eaw_tbl.awk (0755 due to shebang usage)
HAVE_WIN32_NATIVE_THREAD, USE_WIN32_NATIVE_THREAD and ENABLE_THREADS
were once part of the libmbfl build https://github.com/moriyoshi/libmbfl
but are not used anymore.
The previous version of the GB-18030 standard was published in 2005.
This commit adds support for the updated (2022) version of this text
encoding. The existing GB18030 implementation has been left unchanged
for backwards compatibility; users who want to use the new standard
must explicitly indicate the desired text encoding is 'GB18030-2022'.
The document which defines GB18030-2022, published by the government
of the People's Republic of China, defines three levels of standards
compliance. This implementation is intended to achieve Implementation
Level 3, which is the highest level of compliance.
Experts in the GB18030 standard are requested to assess this
implementation and report any deviation from the standard.
For GB18030, it is not generally possible to identify character
boundaries without scanning through the entire string. Therefore,
implement mb_strcut using a similar strategy as the mblen_table based
implementation in mbstring.c. The difference is that for GB18030, we
need to look at two leading bytes to determine the byte length of a
multi-byte character.
The new implementation is 4-5x faster for short strings, and more than
10x faster for long strings. (Part of the reason why this new code has
such a great performance advantage is because it is replacing code
based on the older text conversion filters provided by libmbfl, which
were quite slow.)
The behavior is the same as before for valid GB18030 strings; for
some invalid strings, mb_strcut will choose different 'cut' points
as compared to before. (Clang's libFuzzer was used to compare the
old and new implementations, searching for test cases where they had
different behavior; no such cases were found.)
* Use binary search for cp932ext*_ucs_table lookups
A large amount of time is spent doing a linear search through these
tables in the CP932 encoding. Instead of that, we can add sorted
versions of these tables that also store the index of the non-sorted
version and perform a binary search on those sorted versions.
This reduces the time spent from 1.54s to 0.91s for the reference
benchmark [1].
[1] https://github.com/php/php-src/issues/12684#issuecomment-1813799924
* Fix search bounds
mbfl_name2encoding() uses a linear loop through the encodings, comparing
the name one by one, which is very slow. For the benchmark [1] just
looking up the name takes about 50% of run-time.
By using perfect hashing instead, we no longer have to loop over the
list, and the number of string comparisons is reduced to just a single
one. The perfect hashing table is generated using GNU gperf and amended
manually to fit in with mbstring and manually changed to reduce the
cache size.
[1] https://github.com/php/php-src/issues/12684#issuecomment-1813799924
Similar to the fast, specialized mb_strcut implementation for UTF-8
in 1f0cf133db, this new implementation of mb_strcut for UTF-16 strings
just examines a few bytes before each cut point.
Even for short strings, the new implementation is around 2x faster.
For strings around 10,000 bytes in length, it comes out about 100-500x
faster in my microbenchmarks.
The new implementation behaves identically to the old one on valid
UTF-16 strings; a fuzzer was used to help verify this.
...So conditionally including code which uses __builtin_usub_overflow
(for performance) if the macro is defined is not correct. We also need
to check if the macro is defined as a non-zero value.
Apparently this broke the build for a user whose C compiler is GCC
4.9.4. Sorry, user! That was my fault!
Thanks to Jakub Zelenka for reporting the issue.
The old implementation runs through the entire string to pick out the
part which should be returned by mb_strcut. This creates significant
performance overhead. The new specialized implementation of mb_strcut
for UTF-8 usually only examines a few bytes around the starting and
ending cut points, meaning it generally runs in constant time.
For UTF-8 strings just a few bytes long, the new implementation is
around 10% faster (according to microbenchmarks which I ran locally).
For strings around 10,000 bytes in length, it is 50-300x faster.
(Yes, that is 300x and not 300%.)
The new implementation behaves identically to the old one on VALID
UTF-8 strings; a fuzzer was used to help ensure this is the case.
On invalid UTF-8 strings, there is a difference: in some cases, the
old implementation will pass invalid byte sequences through unchanged,
while in others it will remove them. The new implementation has
behavior which is perhaps slightly more predictable: it simply backs
up the starting and ending cut points to the preceding "starter
byte" (one which is not a UTF-8 continuation byte).
I don't believe such a buffer overrun will ever occur, but just in
case the code is changed in the future, it will be good to have an
assertion here to help catch bugs. (A similar assertion is already
used in the UTF-7 version of this function.)
...So conditionally including code which uses __builtin_usub_overflow
(for performance) if the macro is defined is not correct. We also need
to check if the macro is defined as a non-zero value.
Apparently this broke the build for a user whose C compiler is GCC
4.9.4. Sorry, user! That was my fault!
Thanks to Jakub Zelenka for reporting the issue.
This bug was introduced in e837a8800b. In that commit, I increased the
performance of CP949 text conversion, but accidentally broke the case
where 0xC9 (illegal byte to start a character) is followed by a valid
character with a first byte less than 0xA1. The 'broken' behavior is
that both the 0xC9 byte and the following valid character would be
converted to error markers.
These (static) tables were defined in a header file, which was included
in two different .c files. That will result in two copies of the tables
being included in the PHP binary.
But the tables were only used in one of the two .c files. Move it where
it is used to avoid needlessly bloating the binary. (I checked in a
hex editor and confirmed that while the previous binary contained two
copies of these tables, it now only contains one.)
Conversion of SJIS-2004 text to UTF-8 using `mb_convert_encoding` is
now about 60% faster than before. (Many other mbstring functions will
also be faster now on SJIS-2004 text.)
This will make it easier to combine duplicated code between all the
CJK text encodings (a significant amount is already combined in this
commit, such as the repeated definitions of SJIS_DECODE and
SJIS_ENCODE), but I hope to remove even more redundancy in the future.
The table used to implement mb_strlen for CP932 has been changed to
the same table as "SJIS-win".
We're setting the encoding from PHP_FUNCTION(mb_strpos), but mbfl_strpos would
discard it, setting it to mbfl_encoding_pass, making zend_memnrstr fail due to a
null-pointer exception.
Fixes GH-11217
Closes GH-11220
Compiling in release mode with UBSAN gives me the following compiler warning:
```
In function ‘mb_wchar_to_sjismac’:
mbfilter_sjis.c:1419:89: warning: ‘i’ may be used uninitialized [-Wmaybe-uninitialized]
1419 | buf->state = (i << 24) | (index << 16) | (w & 0xFFFF);
| ^~
mbfilter_sjis.c:1398:42: note: ‘i’ was declared here
1398 | for (int i = 0; i < code_tbl_m_len; i++) {
| ^
```
Since the if condition will always be taken after the goto, we can get
rid of the warning by moving the label inside the if.
Signed-off-by: Alex Dowad <alexinbeijing@gmail.com>
For mb_parse_str, when mbstring.http_input (INI parameter) is a list of
multiple possible text encodings (which is not the case by default),
this new implementation is about 25% faster.
When mbstring.http_input is a single value, then nothing is changed.
(No automatic encoding detection is done in that case.)
This (rare) situation was already handled correctly for the 1st and 2nd
of every 3 codepoints in a Base64-encoded section of a UTF-7 string.
However, it was not handled correctly if it happened on the 3rd,
6th, 9th, etc. codepoint of such a Base64-encoded section.
Previously, mbstring used the same logic for encoding validation as for
encoding conversion.
However, there are cases where we want to use different logic for validation
and conversion. For example, if a string ends up with missing input
required by the encoding, or if a character is input that is invalid
as an encoding but can be converted, the conversion should succeed and
the validation should fail.
To achieve this, a function pointer mb_check_fn has been added to
struct mbfl_encoding to implement the logic used for validation.
Also, added implementation of validation logic for UTF-7, UTF7-IMAP,
ISO-2022-JP and JIS.
(The same change has already been made to PHP 8.2 and 8.3; see
6fc8d014df. This commit is backporting the change to PHP 8.1.)
When I built and tested 0779950768 locally, the build was successful
and all tests passed. However, in CI, some CI jobs are failing due to
compile errors. Fix those.
Previously, mbstring used the same logic for encoding validation as for
encoding conversion.
However, there are cases where we want to use different logic for validation
and conversion. For example, if a string ends up with missing input
required by the encoding, or if a character is input that is invalid
as an encoding but can be converted, the conversion should succeed and
the validation should fail.
To achieve this, a function pointer mb_check_fn has been added to
struct mbfl_encoding to implement the logic used for validation.
Also, added implementation of validation logic for UTF-7, UTF7-IMAP,
ISO-2022-JP and JIS.
The behavior of the new mb_encode_mimeheader implementation closely
follows the old implementation, except for three points:
• The old implementation was missing a call to the mbfl_convert_filter
flush function. So it would sometimes truncate the input string just
before its end.
• The old implementation would drop zero bytes when QPrint-encoding.
So for example, if you tried to QPrint-encode the UTF-32BE string
"\x00\x00\x12\x34", its QPrint-encoding would be "=12=34", which
does not decode to a valid UTF-32BE string. This is now fixed.
• In some rare corner cases, the new implementation will choose to
Base64-encode or QPrint-encode the input string, where the old
implementation would have just added newlines to it. Specifically,
this can happen when there is a non-space ASCII character, followed
by a large number of ASCII spaces, followed by a non-ASCII character.
The new implementation is around 2.5-8x faster than the old one,
depending on the text encoding and transfer encoding used. Performance
gains are greater with Base64 transfer encoding than with QPrint
transfer encoding; this is not because QPrint-encoding bytes is slow,
but because QPrint-encoded output is much bigger than Base64-encoded
output and takes more lines, so we have to go through the process of
finding the right place to break a line many more times.
* PHP-8.2:
Propagate error checks for mbfl_filt_conv_illegal_output()
Use CK() macro to check the output function in mbfilter_unicode2sjis_emoji_sb()
Make error checks on encoding methods for docomo, kddi, sb consistent