1
0
mirror of https://github.com/php/php-src.git synced 2026-04-04 22:52:40 +02:00
Commit Graph

795 Commits

Author SHA1 Message Date
Alex Dowad
5f8993bc28 Merge branch 'PHP-8.1'
* PHP-8.1:
  Reintroduce legacy 'SJIS-win' text encoding in mbstring
2022-08-16 20:47:04 +02:00
Alex Dowad
371367ce3e Reintroduce legacy 'SJIS-win' text encoding in mbstring
In e2459857af, I combined mbstring's "SJIS-win" text encoding
into CP932. This was done after doing some testing which appeared
to show that the mappings for "SJIS-win" were the same as those
for "CP932".

Later, it was found that there was actually a small difference
prior to e2459857af when converting Unicode to CP932. The
mappings for the following two codepoints were different:

        CP932  SJIS-win
U+203E  0x7E   0x81 0x50
U+00A5  0x5C   0x81 0x8F

As shown, mbstring's "CP932" mapped Unicode's 'OVERLINE' and
'YEN SIGN' to the ASCII bytes which have conflicting uses in
most legacy Japanese text encodings. "SJIS-win" mapped these
to equivalent JIS X 0208 fullwidth characters.

Since e2459867af was not intended to cause any user-visible
change in behavior, I am rolling back the merge of "CP932"
and "SJIS-win".

It seems doubtful whether these two text encodings should
be kept separate or merged in a future release. An extensive
discussion of the related historical background and
compatibility issues involved can be found in this
GitHub thread:

https://github.com/php/php-src/issues/8308
2022-08-16 20:18:54 +02:00
Alex Dowad
93207535fa Add test to exercise _php_mb_encoding_handler_ex with multiple possible input encodings
Thanks to Kamil Tekiela for pointing out that there was no test
case for this.
2022-08-16 16:43:38 +02:00
Alex Dowad
d9269becca Fix problems with ISO-2022-KR conversion
• The legacy conversion code did not emit an error marker if an
  escape sequence was truncated.

• BOTH old and new conversion code would shift from KSC5601
  (KS X 1001) mode to ASCII mode on an invalid escape sequence.
  This doesn't make any sense.
2022-08-16 16:43:27 +02:00
Alex Dowad
bfccdbd858 SJIS-Mobile#SOFTBANK string can end immediately after special escape sequence
SJIS-Mobile#SOFTBANK text encoding supports special escape sequences,
which shift the decoder into a mode where each single byte represents
an emoji. To get out of this mode, a 0xF (SHIFT OUT) byte can be
used.

After one of these special escape sequences, the new conversion
code expected to see at least one more byte. However, there doesn't
seem to be any particular reason why it should be treated as an
error condition if a string ends abruptly after one of these
escapes. Well, the escape sequence is useless in that case, but
it is a complete and valid escape sequence.

The legacy conversion code did allow a string to end immediately
after one of these escape sequences. Amend the new code to allow
the same.
2022-08-16 16:43:27 +02:00
Alex Dowad
983a29d3c0 Legacy conversion code for '7bit' to '8bit' inserts error markers
The use of a special 'vtbl' for converting between '7bit' and
'8bit' text meant that '7bit' text would not be converted to
wchars before going to '8bit'. This meant that the special
value MBFL_BAD_INPUT, which we use to flag an erroneous byte
sequence in input text (and which is required by functions
like mb_check_encoding), would pass directly to the output,
instead of being converted to the error marker specified
by mb_substitute_character.

This issue dates back to the time when I removed the mbfl
'identify filters' and made encoding validity checking and
encoding detection rely only on the conversion filters.
2022-08-16 16:43:27 +02:00
Alex Dowad
c6bd08530e Adjust number of error markers emitted for truncated ISO-2022-JP escape sequence
Fuzzing revealed a small difference between the number of error
markers which the legacy ISO-2022-JP and JIS7/8 conversion code
emitted for truncated escape sequences and those emitted by the
new code. The behavior of the old code seems more reasonable
here, so we will imitate it.
2022-08-16 16:43:27 +02:00
Alex Dowad
128768a450 Adjust number of error markers emitted for truncated UTF-8 code units
In 04e59c916f, I amended the UTF-8 conversion code, so that when given
invalid input, it would emit a number of errors markers harmonizing
with the WHATWG's specification of the standard UTF-8 decoding
algorithm. (Which, gentle reader of commit logs, you can find online
at https://encoding.spec.whatwg.org/#utf-8-decoder.) However, the code
in 04e59c916f was faulty in the case that a truncated UTF-8 code unit
starts with 0xF1.

Then, in dc1ba61d09, when making a small refactoring to a different
part of the UTF-8 conversion code, I inexplicably broke part of the
working code, causing the same fault which was already present with
truncated UTF-8 code units starting with 0xF1 to also occur with
0xF2 and 0xF3 as well. I don't remember what inane thoughts I was
thinking when I pulled off this feat of utter mental confusion.

None of these cases were covered by unit tests, by the way.

Thankfully, my trusty fuzzer picked up on this when testing the
new implementation of mb_parse_str (since the legacy UTF-8
conversion filter did not suffer from the same problem, and I was
fuzzing to find any differences in behavior between the old and
new implementations).

Fortuitously, the fuzzer also picked up another issue which was
present in 04e59c916f. I was emitting only one error marker for
truncated code units starting with 0xE0 or 0xED, in cases where
the WHATWG standard indicates two should be emitted. Examples
are 0xE0 0x9F <END OF STRING> or 0xED 0xA0 <END OF STRING>.

Code units starting with 0xE0-0xED should have 3 bytes. If the
first byte is 0xE0, the second MUST be 0xA0 or greater. (Otherwise,
the codepoint could have fit in a two-byte code unit.) And if the
first byte is 0xED, the second MUST be 0x9F or less. According to
the WHATWG algorithm, step 4, if the second byte is outside the
legal range, then the decoder should emit an error... AND
reprocess the out-of-range byte. The reprocessing will then
cause another error. That's why the decoder should indicate two
errors and not one.
2022-08-16 16:43:27 +02:00
Alex Dowad
a4656895dd Imitate legacy behavior when converting non-encodings using mbstring
Fuzzing revealed that something was missed here when making the new
encoding conversion code match the behavior of the old code. In the
next major release of PHP, support for these non-encodings will be
dropped, but in the meantime, it is better to match the legacy
behavior.
2022-08-16 16:43:27 +02:00
Alex Dowad
aeccb139c3 Use new encoding conversion filters for mb_parse_str and php_mb_post_handler
When micro-benchmarking on relatively short ASCII strings, the new
implementation was about 30% faster than the old one.
2022-08-16 16:43:26 +02:00
Christoph M. Becker
d013d94985 Fix GH-9248: Segmentation fault in mb_strimwidth()
We need to initialize the optional argument `trimmarker` with its
default value.

Closes GH-9273.
2022-08-08 18:35:37 +02:00
Alex Dowad
5370f344d2 mb_strimwidth inserts error markers in invalid input string (for backwards compatibility)
The old implementation did this. It also did the same to the
trim marker, if the trim marker was invalid in the specified
encoding, but I have not imitated that behavior (for performance).
2022-08-02 11:07:06 +02:00
Alex Dowad
7299096095 New implementation of mb_strimwidth
This new implementation of mb_strimwidth uses the new text
encoding conversion filters. Changes from the previous
implementation:

• mb_strimwidth allows a negative 'from' argument, which
should count backwards from the end of the string. However,
the implementation of this feature was buggy (starting right
from when it was first implemented).

It used the following code:

    if ((from < 0) || (width < 0)) {
        swidth = mbfl_strwidth(&string);
    }
    if (from < 0) {
        from += swidth;
    }

Do you see the bug? 'from' is a count of CODEPOINTS, but
'swidth' is a count of TERMINAL COLUMNS. Adding those two
together does not make sense. If there were no fullwidth
characters in the input string, then the two counts coincide
and the feature would work correctly. However, each
fullwidth character would throw the result off by one,
causing more characters to be skipped than was requested.

• mb_strimwidth also allows a negative 'width' argument,
which again counts backwards from the end of the string;
in this case, it is not determining the START of the portion
which we want to extract, but rather, the END of that portion.
Perhaps unsurprisingly, this feature was also buggy.

Code:

    if (width < 0) {
        width = swidth + width - from;
    }

'swidth + width' is fine here; the problem is '- from'.
Again, that is subtracting a count of CODEPOINTS from a
count of TERMINAL COLUMNS. In this case, we really need
to count the terminal width of the string prefix skipped
over by 'from', and subtract that rather than the number
of codepoints which are being skipped.

As a result, if a 'from' count was passed along with a
negative 'width', for every fullwidth character in the
skipped prefix, the result of mb_strimwidth was one
terminal column wider than requested.

Since these situations were covered by unit tests, you
might wonder why the bugs were not caught. Well, as far as
I can see, it looks like the author of the 'tests' just
captured the actual output of mb_strimwidth and defined it
as 'correct'. The tests were written in such a way that it
was difficult to examine them and see whether they made
sense or not; but a careful examination of the inputs and
outputs clearly shows that the legacy tests did not conform
to the documented contract of mb_strimwidth.

• The old implementation would always pass the input string
through decoding/encoding filters before returning it to
the caller, even if it fit within the specified width. This
means that invalid byte sequences would be converted to
error markers. For performance, the new implementation
returns the very same string which was passed in if it
does not exceed the specified width. This means that
erroneous byte sequences are not converted to error markers
unless it is necessary to trim the string.

• The same applies to the 'trim marker' string.

• The old implementation was buggy in the (unusual)
case that the trim marker is wider than the requested
maximum width of the result. It did an unsigned subtraction
of the requested width and the width of the trim marker. If the
width of the trim marker was greater, that subtraction would
underflow and yield a huge number. As a result, mb_strimwidth
would then pass the input string through, even if it was
far wider than the requested maximum width.

In that case, since the input string is wider than the
requested width, and NONE of it will fit together with the
trim marker, the new implementation returns just the trim
marker. This is the one case where the output can be wider
than the requested width: when BOTH the input string and
also the trim marker are too wide.

• Since it passed the input string and trim marker through
decoding/encoding filters, when using "Quoted-Printable" as
the encoding, newlines could be inserted into the trim marker
to maintain the maximum line length for QP.

This is an extremely bizarre use case and I don't think there
is any point in worrying about it. QP will be removed from
mbstring in time, anyways.

PERFORMANCE:

• From micro-benchmarking with various input string lengths and
text encodings, it appears that the new implementation is 2-3x
faster for UTF-8 and UTF-16. For legacy Japanese text encodings
like ISO-2022-JP or SJIS, the new implementation is perhaps 25%
faster.

• Note that correctly implementing negative 'from' and 'width'
arguments imposes a small performance burden in such cases; one
which the old implementation did not pay. This slightly skews
benchmarking results in favor of the old implementation. However,
even so, the new implementation is faster in all cases which I
tested.
2022-08-02 11:07:06 +02:00
Christoph M. Becker
3e922bf08f Merge branch 'PHP-8.1'
* PHP-8.1:
  Fix GH-9008: mb_detect_encoding(): wrong results with null $encodings
2022-07-20 17:01:42 +02:00
Christoph M. Becker
c2bdaa48e1 Fix GH-9008: mb_detect_encoding(): wrong results with null $encodings
Passing `null` to `$encodings` is supposed to behave like passing the
result of `mb_detect_order()`.  Therefore, we need to remove the non-
encodings from the `elist` in this case as well.  Thus, we duplicate
the global `elist`, so we can modify it.

Closes GH-9063.
2022-07-20 16:58:55 +02:00
Alex Dowad
6d525a425e Fix legacy conversion filter for ISO-2022-KR 2022-07-20 07:44:20 +02:00
Alex Dowad
9ac49c0dd3 New implementation of mb_convert_kana
mb_convert_kana now uses the new text encoding conversion
filters. Microbenchmarking shows speed gains of 50%-150%
across various text encodings and input string lengths.

The behavior is the same as the old mb_convert_kana
except for one fix: if the 'zero codepoint' U+0000 appeared
in the input, the old implementation would sometimes drop
it, not passing it through to the output. This is now
fixed.
2022-07-20 07:44:19 +02:00
Eric Norris
09237f6126 Update request startup error messages 2022-07-18 23:19:59 +01:00
Alex Dowad
76a92c26e3 mb_decode_numericentity decodes valid entities which are truncated at end of string
Since mb_decode_numericentity does not require all HTML entities
to end with ';', but allows them to be terminated by ANY non-digit
character, it doesn't make sense that valid entities which butt
up against the end of the input string are not converted.

As it turned out, supporting this case also made it possible
to simplify the code nicely.
2022-07-18 15:11:47 +02:00
Alex Dowad
5d6bd557b3 mb_decode_numericentity converts entities which immediately follow a valid/invalid entity
Thanks to Kamil Tieleka for suggesting that some of the behaviors of
the legacy implementation which the new mb_decode_numericentity
implementation took care to maintain were actually bugs and should
be fixed. Thanks also to Trevor Rowbotham for providing a link to
the HTML specification, showing how HTML numeric entities should
be interpreted.

mb_decode_numericentity now processes numeric entities in the
following situations where the old implementation would not:

- &<ENTITY> (for example, &&#65;)
- &#<ENTITY>
- &#x<ENTITY>
- <VALID BUT UNTERMINATED DECIMAL ENTITY><ENTITY> (for example, &#65&#65;)
- <VALID BUT UNTERMINATED HEX ENTITY><ENTITY>
- <INVALID AND UNTERMINATED DECIMAL ENTITY><ENTITY> (it does not matter why
  the first entity is invalid; the value could be too big, it could have
  too many digits, or it could not match the 'convmap' parameter)
- <INVALID AND UNTERMINATED HEX ENTITY><ENTITY>

This is consistent with the way that web browsers process
HTML entities.
2022-07-18 15:11:32 +02:00
Alex Dowad
40f5048aa7 Fix new conversion filter for UUEncode
This code (written by yours truly) was very broken on input
strings long enough to require processing in multiple chunks.
Fuzzing revealed this very quickly; after initial rework,
further fuzzing also found a couple of very obscure bugs in
corner cases.
2022-07-18 15:11:32 +02:00
Alex Dowad
5fee30b630 Fix new conversion filter for QPrint (same order of check as legacy code)
Because of checking for maximum line length *before* certain other checks,
the new conversion filter for QPrint could produce different results from
the old one in some cases. This was discovered while fuzzing the new
implementation of mb_decode_numericentity.
2022-07-18 15:11:32 +02:00
Alex Dowad
3cf432798e Fix new conversion filter for CP50220 (multi-codepoint kana at end of buffer)
If two codepoints which needed to be collapsed into a single kuten code
were separated, with one at the end of one buffer and the other at the
beginning of the next buffer, they were not converted correctly.
This was discovered while fuzzing the new implementation of
mb_decode_numericentity.
2022-07-18 15:11:31 +02:00
Alex Dowad
7559bf77d2 Fix new conversion filters for mobile SJIS variants ('0' at end of buffer)
Previously, I had adjusted this code so that if a character which could
be part of a special Docomo/Softbank/KDDI 'keypad' emoji appeared at
the end of one buffer, and the 'keypad' character appeared at the
beginning of the next, they would still be combined. However, this
broke the handling of such a character appearing at the end of one
buffer, and a character which is NOT 'keypad' appearing at the
beginning of the next.

This was found while fuzzing the new implementation of
mb_decode_numericentity.
2022-07-18 15:11:31 +02:00
Alex Dowad
fa83a8e15e Fix new conversion filter for HTML entities
While fuzzing the new mb_decode_numericentity implementation, I discovered
that the fast conversion filter for 'HTML-ENTITIES' did not correctly
handle an empty named entity ('&;'), nor did it correctly handle
invalid named entities whose names were a prefix of a valid entity.
Also, it did not correctly handle the case where a named entity is
truncated and another named entity starts abruptly.
2022-07-18 15:11:31 +02:00
Alex Dowad
9c3972fb3d Fix legacy conversion filter for HZ 2022-07-18 15:11:31 +02:00
Alex Dowad
1526bab6d0 Fix legacy conversion filter for GB18030 2022-07-18 15:11:31 +02:00
Alex Dowad
6938e35122 Fix legacy conversion filter for CP50220 2022-07-18 15:11:31 +02:00
Alex Dowad
91969e908f New implementation of mb_{de,en}code_numericentity
This new implementation uses the new encoding conversion filters.
Aside from fewer LOC and (hopefully) improved readability,
the differences are as follows:

BEHAVIOR CHANGES:

- The old implementation used signed arithmetic when operating
on the 'convmap'. This meant that results could be surprising when
using convmap entries with 1 in the MSB. Further, types like 'int'
were used rather than those with a specific bit width, such as
'int32_t'. This meant that results could also depend on the
platform width of an 'int'.

Now unsigned arithmetic is used, with explicit bit widths.

- Similarly, while converting decimal numeric entities, the
legacy implementation would ensure that the value never overflowed
INT_MAX, and if it did, the entity would be treated as invalid
and passed through unconverted.

However, that again means that results depend on the platform
size of an 'int'. So now, we use a value with explicit bit width
(32 bits) to hold the value of a deconverted decimal entity, and
ensure that the entity value does not overflow that.

Further, because we are using an UNSIGNED 32-bit value rather
than a signed one, the ceiling for how large a decimal entity
can be is higher now.

All of this will probably not affect anyone, since Unicode
codepoints above U+10FFFF are invalid anyways. To see the
difference, you need to be using a text encoding like UCS-4,
which allows huge 'codepoints'.

- If it saw something which looked like a hex entity, but
turned out not to be a valid numeric entity, the old
implementation would sometimes convert the hexadecimal
digits a-f to A-F (uppercase). The new implementation passes
invalid numeric entities through without performing case
conversion.

- The old implementation of mb_encode_numericentity was
limited in how many decimal/hex digits it could emit.
If a text encoding like UCS-4 was in use, where 'codepoints'
can have huge values (larger than the valid range
stipulated by the Unicode standard), it would not error
out on a 'codepoint' whose value was too large for it,
but would rather mangle the value and emit a numeric
entity which decoded to some other random codepoint.
The new implementation is able to emit enough digits to
express any value which fits in 32 bits.

PERFORMANCE:

Based on micro-benchmarks run on my development machine:

Decoding numeric HTML entities is about 4 times faster, for
both decimal and hexadecimal entities, across a variety of
input string lengths. Encoding is about 3 times faster.
2022-07-18 15:11:30 +02:00
jcm
dbdef4a55c QA -mb_convert_encoding_array - error for object item in array
Closes GH-9023.
2022-07-15 17:34:35 +02:00
jcm
30d89b19cf QA - mb_http_input - function returns FALSE for type 'L' or 'l'
Closes GH-9018.
2022-07-15 14:22:39 +02:00
Christoph M. Becker
85a95a2982 Merge branch 'PHP-8.1'
* PHP-8.1:
  Restore backwards-compatible mappings of 0x5C and 0x7E in SJIS
2022-06-11 16:32:33 +02:00
Alex Dowad
d62f535caa Restore backwards-compatible mappings of 0x5C and 0x7E in SJIS
According to the relevant Japan Industrial Standards Committee standards,
SJIS 0x5C is a Yen sign, and 0x7E is an overline.

However, this conflicts with the implementation of SJIS in various legacy
software (notably Microsoft products), where SJIS 0x5C and 0x7E are taken
as equivalent to the same ASCII bytes.

Prior to PHP 8.1, mbstring's implementation of SJIS handled these bytes
compatibly with Microsoft products. This was changed in PHP 8.1.0, in an
attempt to comply with the JISC specifications. However, after discussion
with various concerned Japanese developers, it seems that the historical
behavior was more useful in the majority of applications which process
SJIS-encoded text.

Since we are now treating SJIS 0x5C as equivalent to U+005C and 0x7E as
equivalent to U+007E, it does not make sense to convert U+203E (OVERLINE)
to 0x7E, nor does it make sense to convert U+00A5 (YEN SIGN) to 0x5C. Restore
the mappings for those codepoints from before PHP 8.1.0.

Fixes GH-8281.
2022-06-11 16:31:47 +02:00
Alex Dowad
2dc9026cbc Restore backwards-compatible mappings of 0x5C and 0x7E in SJIS
According to the relevant Japan Industrial Standards Committee standards,
SJIS 0x5C is a Yen sign, and 0x7E is an overline.

However, this conflicts with the implementation of SJIS in various legacy
software (notably Microsoft products), where SJIS 0x5C and 0x7E are taken
as equivalent to the same ASCII bytes.

Prior to PHP 8.1, mbstring's implementation of SJIS handled these bytes
compatibly with Microsoft products. This was changed in PHP 8.1.0, in an
attempt to comply with the JISC specifications. However, after discussion
with various concerned Japanese developers, it seems that the historical
behavior was more useful in the majority of applications which process
SJIS-encoded text.

Since we are now treating SJIS 0x5C as equivalent to U+005C and 0x7E as
equivalent to U+007E, it does not make sense to convert U+203E (OVERLINE)
to 0x7E, nor does it make sense to convert U+00A5 (YEN SIGN) to 0x5C. Restore
the mappings for those codepoints from before PHP 8.1.0.
2022-06-10 21:04:36 +02:00
Alex Dowad
a789088527 Add more tests for mbstring encoding conversion
When testing the preceding commits, I used a script to generate a large
number of random strings and try to find strings which would yield
different outputs from the new and old encoding conversion code.
Some were found. In most cases, analysis revealed that the new code
was correct and the old code was not.

In all cases where the new code was incorrect, regression tests were
added. However, there may be some value in adding regression tests
for cases where the old code was incorrect as well. That is done here.

This does not cover every case where the new and old code yielded
different results. Some of them were very obscure, and it is proving
difficult even to reproduce them (since I did not keep a record of
all the input strings which triggered the differing output).
2022-05-28 21:53:38 +02:00
Alex Dowad
4afa72126e Implement fast text conversion interface for QPrint 2022-05-28 21:53:37 +02:00
Alex Dowad
3fda9f5095 Implement fast text conversion interface for HTML-ENTITIES 2022-05-28 21:53:37 +02:00
Alex Dowad
85690ae26d Implement fast text conversion interface for Base64 2022-05-28 21:53:37 +02:00
Alex Dowad
7c2587b1f6 Implement fast text conversion interface for UUENCODE 2022-05-28 21:53:37 +02:00
Alex Dowad
321dbd0413 Implement fast text conversion interface for ISO-2022-JP-2004
There were bugs in the legacy implementation. Lots of them.

It did not properly track whether it has switched to JISX 0213 plane 1
or plane 2. If it processes a character in plane 1 and then immediately
one in plane 2, it failed to emit the escape code to switch to plane 2.

Further, when converting codepoints from 0x80-0xFF to ISO-2022-JP-2004,
the legacy implementation would totally disregard which mode it was
operating in. Such codepoints would pass through directly to the output
without any escape sequences being emitted.

If that was not enough, all the legacy implementations of JISX 0213:2004
encodings had another common bug; their 'flush function' did not call
the next flush function in the chain of conversion filters. So if any
of these encodings were converted to an encoding where the flush
function was needed to finish the output string, then the output
would be truncated.
2022-05-28 21:53:36 +02:00
Alex Dowad
67d83f57c1 Implement fast text conversion interface for mobile SJIS variants 2022-05-28 21:53:36 +02:00
Alex Dowad
0d635d93f5 Implement fast text conversion interface for UTF7-IMAP
The old code would convert a 0x00 byte in the input to 0x00 in the
output, but this clearly violates the RFC which defines UTF7-IMAP.
2022-05-28 21:53:35 +02:00
Alex Dowad
6cf30356e0 Implement fast text conversion interface for SJIS-mac 2022-05-28 21:53:35 +02:00
Alex Dowad
c9479899c6 Implement fast text conversion interface for ISO-2022-KR
When working on this, I read RFC 1557 again and realized that the
comment at the top of the file was totally mistaken. Further, the
legacy code did not obey the RFC. (It would emit the "ESC $ ) C"
sequence anywhere, not just at the beginning of a line as the RFC
requires.)

The new code obeys the RFC; one quirk is that it always emits the
escape sequence at the beginning of each output string, even if the
string is completely ASCII (in which case the escape sequence is
allowed, but not required).

The new code doesn't always generate the same number of error markers
for invalid escapes as the old code did.

The old code could not emit the special KDDI emoji for national flags.

Further, there was a bug in the test which the old code used to
determine whether an 0xF byte should be emitted at the end of a string
(to switch back to ASCII mode). As a result, it would not always switch
back to ASCII mode, meaning that it was not always safe to concatenate
the resulting strings.
2022-05-28 21:53:35 +02:00
Alex Dowad
8b70a7db4f Merge branch 'PHP-8.1'
* PHP-8.1:
  mb_detect_encoding recognizes all letters in Hungarian alphabet
2022-05-25 13:10:23 +02:00
Alex Dowad
58d0aad75c mb_detect_encoding recognizes all letters in Hungarian alphabet 2022-05-25 08:22:07 +02:00
Alex Dowad
212b31b51c Merge branch 'PHP-8.1'
* PHP-8.1:
  mb_detect_encoding recognizes all letters in Czech alphabet
2022-05-25 07:53:39 +02:00
Alex Dowad
6a4b6d2344 mb_detect_encoding recognizes all letters in Czech alphabet 2022-05-25 07:52:39 +02:00
Alex Dowad
83db088fc2 Merge branch 'PHP-8.1'
* PHP-8.1:
  Fix mb_detect_encoding's recognition of Slavic names
2022-05-24 15:33:24 +02:00
Alex Dowad
9bb97ee8bc Fix mb_detect_encoding's recognition of Slavic names
Thanks to Côme Chilliet for reporting that mb_detect_encoding was not
detecting the desired text encoding for strings containing š or Ž.
These characters are used in Czech, Serbian, Croatian, Bosnian,
Macedonian, etc. names.
2022-05-24 15:32:20 +02:00