1
0
mirror of https://github.com/php/php-src.git synced 2026-04-29 11:13:36 +02:00
Commit Graph

195 Commits

Author SHA1 Message Date
pakutoma b721d0f71e Fix phpGH-10648: add check function pointer into mbfl_encoding
Previously, mbstring used the same logic for encoding validation as for
encoding conversion.

However, there are cases where we want to use different logic for validation
and conversion. For example, if a string ends up with missing input
required by the encoding, or if a character is input that is invalid
as an encoding but can be converted, the conversion should succeed and
the validation should fail.

To achieve this, a function pointer mb_check_fn has been added to
struct mbfl_encoding to implement the logic used for validation.
Also, added implementation of validation logic for UTF-7, UTF7-IMAP,
ISO-2022-JP and JIS.

(The same change has already been made to PHP 8.2 and 8.3; see
6fc8d014df. This commit is backporting the change to PHP 8.1.)
2023-03-25 09:52:10 +02:00
NathanFreeman fa0401b0b5 Fix GH-9535 (unintended behavior change for mb_strcut in PHP 8.1)
The existing implementation of mb_strcut extracts part of a
multi-byte encoded string by pulling out raw bytes and then running
them through a conversion filter to ensure that the output is valid
in the requested encoding.

If the conversion filter emits error markers when doing the final
'flush' operation which ends the conversion of the extracted bytes,
these error markers may (in some cases) be included in the output.
The conversion operation does not respect the value of
mb_substitute_character; rather, it always uses '?' as an error marker.
So this issue manifests itself as unwanted '?' characters being
inserted into the output.

This issue has existed for a long time, but became noticeable in PHP
8.1 because for at least some of the supported text encodings, mbstring
is now more strict about emitting error markers when strings end in an
illegal state.

The simplest fix is to suppress error markers during the final flush
operation.

While working on a fix for this problem, another problem with mb_strcut
was discovered; since it decides when to stop consuming bytes from
the input by looking at the byte length of its OUTPUT, anything which
causes extra bytes to be emitted to the output may cause mb_strcut to
not consume all the bytes in the requested range.

The one case where we DO emit extra output bytes is for encodings
which have a selectable mode, like ISO-2022-JP; if a string in such
an encoding ends in a mode which is not the default, we emit an ending
escape sequence which changes back to the default mode. This is done
so that concatenating strings in such encodings is safe.

However, as mentioned, this can cause the output of mb_strcut to be
shorter than it logically should be. This bug has existed for a long
time, and fixing it now will be a BC break, so we may not fix it right
away.

Therefore, tests for THIS fix which don't pass because of that OTHER
bug have been split out into a separate test file (gh9535b.phpt), and
that file has been marked XFAIL.
2022-11-13 14:37:55 +02:00
Alex Dowad 371367ce3e Reintroduce legacy 'SJIS-win' text encoding in mbstring
In e2459857af, I combined mbstring's "SJIS-win" text encoding
into CP932. This was done after doing some testing which appeared
to show that the mappings for "SJIS-win" were the same as those
for "CP932".

Later, it was found that there was actually a small difference
prior to e2459857af when converting Unicode to CP932. The
mappings for the following two codepoints were different:

        CP932  SJIS-win
U+203E  0x7E   0x81 0x50
U+00A5  0x5C   0x81 0x8F

As shown, mbstring's "CP932" mapped Unicode's 'OVERLINE' and
'YEN SIGN' to the ASCII bytes which have conflicting uses in
most legacy Japanese text encodings. "SJIS-win" mapped these
to equivalent JIS X 0208 fullwidth characters.

Since e2459867af was not intended to cause any user-visible
change in behavior, I am rolling back the merge of "CP932"
and "SJIS-win".

It seems doubtful whether these two text encodings should
be kept separate or merged in a future release. An extensive
discussion of the related historical background and
compatibility issues involved can be found in this
GitHub thread:

https://github.com/php/php-src/issues/8308
2022-08-16 20:18:54 +02:00
Christoph M. Becker 5003831260 Merge branch 'PHP-8.0' into PHP-8.1
* PHP-8.0:
  Fix GH-8208: mb_encode_mimeheader: $indent functionality broken
2022-03-17 17:34:31 +01:00
Christoph M. Becker d0417ebc93 Fix GH-8208: mb_encode_mimeheader: $indent functionality broken
We also need to factor in the indent, when getting the encoder result.

Closes GH-8213.
2022-03-17 17:31:58 +01:00
Christoph M. Becker 929d847152 Fix #81693: mb_check_encoding(7bit) segfaults
`php_mb_check_encoding()` now uses conversion to `mbfl_encoding_wchar`.
Since `mbfl_encoding_7bit` has no `input_filter`, no filter can be
found.  Since we don't actually need to convert to wchar, we encode to
8bit.

Closes GH-7712.
2021-12-03 22:49:47 +01:00
Alex Dowad 28b346bc06 Improve detection accuracy of mb_detect_encoding
Originally, `mb_detect_encoding` essentially just checked all candidate
encodings to see which ones the input string was valid in. However, it
was only able to do this for a limited few of all the text encodings
which are officially supported by mbstring.

In 3e7acf901d, I modified it so it could 'detect' any text encoding
supported by mbstring. While this is arguably an improvement, if the
only text encodings one is interested in are those which
`mb_detect_encoding` could originally handle, the old
`mb_detect_encoding` may have been preferable. Because the new one has
more possible encodings which it can guess, it also has more chances to
get the answer wrong.

This commit adjusts the detection heuristics to provide accurate
detection in a wider variety of scenarios. While the previous detection
code would frequently confuse UTF-32BE with UTF-32LE or UTF-16BE with
UTF-16LE, the adjusted code is extremely accurate in those cases.
Detection for Chinese text in Chinese encodings like GB18030 or BIG5
and for Japanese text in Japanese encodings like EUC-JP or SJIS is
greatly improved. Detection of UTF-7 is also greatly improved. An 8KB
table, with one bit for each codepoint from U+0000 up to U+FFFF, is
used to achieve this.

One significant constraint is that the heuristics are completely based
on looking at each codepoint in a string in isolation, treating some
codepoints as 'likely' and others as 'unlikely'. It might still be
possible to achieve great gains in detection accuracy by looking at
sequences of codepoints rather than individual codepoints. However,
this might require huge tables. Further, we might need a huge corpus
of text in various languages to derive those tables.

Accuracy is still dismal when trying to distinguish single-byte
encodings like ISO-8859-1, ISO-8859-2, KOI8-R, and so on. This is
because the valid bytes in these encodings are basically all the same,
and all valid bytes decode to 'likely' codepoints, so our method of
detection (which is based on rating codepoints as likely or unlikely)
cannot tell any difference between the candidates at all. It just
selects the first encoding in the provided list of candidates.

Speaking of which, if one wants to get good results from
`mb_detect_encoding`, it is important to order the list of candidate
encodings according to your prior belief of which are more likely to
be correct. When the function cannot tell any difference between two
candidates, it returns whichever appeared earlier in the array.
2021-10-19 18:05:51 +02:00
Alex Dowad c25a1ef8d0 Bug #81390: mb_detect_encoding should not prematurely stop processing input
As a performance optimization, mb_detect_encoding tries to stop
processing the input string early when there is only one 'candidate'
encoding which the input string is valid in. However, the code which
keeps count of how many candidate encodings have already been rejected
was buggy. This caused mb_detect_encoding to prematurely stop
processing the input when it should have continued.

As a result, it did not notice that in the test case provided by Alec,
the input string was not valid in UTF-16.
2021-09-20 11:21:39 +02:00
Alex Dowad 6acd4f7f3a Optimize text encoding detection for speed (eliminate Unicode property lookups)
...By just testing the input codepoints if they are within a few fixed
ranges instead. This avoids hash lookups in property tables.

From (micro-)benchmarking on my PC, this looks to be a bit less than 4x
faster than the existing code.
2021-09-20 11:20:53 +02:00
Colin O'Dell fe36b81d5e Update Unicode tables to 14.0.0
Closes GH-7502.
2021-09-20 09:58:20 +02:00
Alex Dowad f303fc8a9b Use bool in mbfl_filt_conv_output_hex (rather than int) 2021-08-31 13:41:34 +02:00
Alex Dowad 776296e12f mbstring no longer provides 'long' substitutions for erroneous input bytes
Previously, mbstring had a special mode whereby it would convert
erroneous input byte sequences to output like "BAD+XXXX", where "XXXX"
would be the erroneous bytes expressed in hexadecimal. This mode could
be enabled by calling `mb_substitute_character("long")`.

However, accurately reproducing input byte sequences from the cached
state of a conversion filter is often tricky, and this significantly
complicates the implementation. Further, the means used for passing
the erroneous bytes through to where the "BAD+XXXX" text is generated
only allows for up to 3 bytes to be passed, meaning that some erroneous
byte sequences are truncated anyways.

More to the point, a search of publically available PHP code indicates
that nobody is really using this feature anyways.

Incidentally, this feature also provided error output like "JIS+XXXX"
if the input 'should have' represented a JISX 0208 codepoint, but it
decodes to a codepoint which does not exist in the JISX 0208 charset.
Similarly, specific error output was provided for non-existent
JISX 0212 codepoints, and likewise for JISX 0213, CP932, and a few
other charsets. All of that is now consigned to the flames.

However, "long" error markers also include a somewhat more useful
"U+XXXX" marker for Unicode codepoints which were successfully
decoded from the input text, but cannot be represented in the output
encoding. Those are still supported.

With this change, there is no need to use a variety of special values
in the high bits of a wchar to represent different types of error
values. We can (and will) just use a single error value. This will be
equal to -1.

One complicating factor: Text conversion functions return an integer to
indicate whether the conversion operation should be immediately
aborted, and the magic 'abort' marker is -1. Also, almost all of these
functions would return the received byte/codepoint to indicate success.
That doesn't work with the new error value; if an input filter detects
an error and passes -1 to the output filter, and the output filter
returns it back, that would be taken to mean 'abort'.

Therefore, amend all these functions to return 0 for success.
2021-08-31 13:41:34 +02:00
Alex Dowad 97b7fc893c Output illegal character marker for 4-byte illegal characters > 0x7FFFFFFF
Some text encodings supported by mbstring (such as UCS-4) accept 4-byte
characters. When mbstring encounters an illegal byte sequence for the
encoding it is using, it should emit an 'illegal character' marker,
which can either be a single character like '?', an HTML hexadecimal
entity, or a marker string like 'BAD+XXXX'.

Because of the use of signed integers to hold 4-byte characters,
illegal 4-byte sequences with a 'negative' value (one with the high
bit set) were not handled correctly when emitting the illegal char
marker. The result is that such illegal sequences were just skipped
over (and the marker was not emitted to the output). Fix that.
2021-08-30 16:29:58 +02:00
Nikita Popov 634f2e21d3 Don't expose wchar encoding to users (#7415)
The "wchar" encoding isn't really an encoding -- it's what we
internally use as the representation of decoded characters.

In practice, it tends to behave a lot like the 8bit encoding when
used from userland, because input code units end up being treated
as code points.

This patch removes the wchar encoding from the public encoding
list and reserves it for internal use only.
2021-08-30 11:11:33 +02:00
Nikita Popov 43cb2548f7 Flush filter during non-strict encoding detection
If we reach the end of the string without reducing to a single
encoding, then we should flush to check whether the last character
is incomplete.
2021-08-27 14:48:32 +02:00
Nikita Popov a1c1ee6a48 Don't use opaque for encoding detection score
opaque is used by the htmlentities filter, which means that we
end up trying to free the score value as a pointer. Don't try to
be overly tricky here and simply allocate a separate structure
to hold the number of illegal characters and the score.
2021-07-28 10:54:27 +02:00
Nikita Popov 9d0db2e98a Fixed bug #81298
Creation of the filter may fail for some special encodings, for
which detection is not supported.
2021-07-28 10:11:46 +02:00
Alex Dowad 26fc7c4256 Fix typo in mbfilter.h
As pointed out by Bruno Haible (https://haible.de/bruno).
2021-07-19 12:17:00 +02:00
Alex Dowad e2459857af Remove duplicate implementation of CP932 from mbstring
Sigh. Double sigh. After fruitlessly searching the Internet for information on
this mysterious text encoding called "SJIS-open", I wrote a script to try
converting every Unicode codepoint from 0-0xFFFF and compare the results from
different variants of Shift-JIS, to see which one "SJIS-open" would be most
similar to.

The result? It's just CP932. There is no difference at all. So why do we have
two implementations of CP932 in mbstring?

In case somebody, somewhere is using "SJIS-open" (or its aliases "SJIS-win" or
"SJIS-ms"), add these as aliases to CP932 so existing code will continue to
work.
2021-06-17 13:12:40 +02:00
George Peter Banyard c40231afbf Mark various functions with void arguments.
This fixes a bunch of [-Wstrict-prototypes] warning,
because in C func() and func(void) have different semantics.
2021-05-12 14:55:53 +01:00
Alex Dowad 319a340843 Simplify code for working with halfwidth/fullwidth kana conversion filter
There's no need to dynamically allocate a struct to hold the 'mode' parameter;
just store it directly in `filt->opaque`. Some other things were also being done
in an unnecessarily roundabout way.

Also, the 'copy' function for CP50220 conversion filters was *both* broken
and unnecessary. Broken, because it malloc'd memory which was never freed by
anything. Unnecessary, because the point of the copy is so that various
algorithms can try running bytes through a conversion filter and see how many
output bytes or characters result, and then back out by restoring the filters
to their previous state. But here's the thing; CP50220 conversion filters don't
hold cached bytes, which is the main thing which would need to be restored to a
previous state.
2021-04-15 15:52:31 +02:00
Alex Dowad a900ec3397 Remove unneeded 'filter_ctor' member from mbfl_convert_filter struct
This function pointer is only called when initializing the struct. After that
nothing is done with it. Therefore, there is no need to keep it in the struct.
2021-04-15 15:52:31 +02:00
Alex Dowad d8c785b894 Update 'East Asian Width' table to comply with Unicode 13.0
Instead of manually maintaining the data in eaw_table.h, it is now automatically
generated by ucgendat/ucgendat.php, using the EastAsianWidth.txt file from
the Unicode Consortium.

Something must be said about the deleted test case. Back in 2004, someone
noticed that `mb_strwidth` didn't comply with Unicode 4.0. A test case was
added to expose the problem. Well, time keeps moving on, and with the changing
years, new Unicodes are born and old Unicodes die. Some characters which were
counted as double-width in Unicode 4.0 are no longer such in Unicode 13.0,
which renders the test case obsolete.

At the same time, make a couple of spelling/grammar fixes in ucgendat.php.
2021-01-19 20:38:44 +02:00
Alex Dowad a06c20a17c Remove useless constant MBFL_ENCTYPE_MBCS
This flag indicated that an encoding was 'multi-byte'; it can use a variable
number of bytes to encode each character. As it turns out, we don't actually
need to check this flag anywhere, so it's better to remove it.
2021-01-15 21:55:41 +02:00
Alex Dowad 34ece40872 Remove useless mbstring encoding 'JIS-ms'
MicroSoft invented three encodings very similar to ISO-2022-JP/JIS7/JIS8, called
CP50220, CP50221, and CP50222. All three are supported by mbstring.

Since these encodings are very similar, some code can be shared. Actually,
conversion of CP50220/1/2 to Unicode is exactly the same operation; it's when
converting from Unicode to CP50220/1/2 that some small differences arise in how
certain katakana are handled.

The most important common code was a function called `mbfl_filt_wchar_jis_ms`.
The `jis_ms` part doubtless refers to the fact that these encodings are modified
versions of 'JIS' invented by 'MS'. mbstring also went a step further and exported
'JIS-ms' to userland as a separate encoding from CP50220/1/2. If users requested
'JIS-ms' conversion, they got something like CP50220/1/2, minus their special
ways of handling half-width katakana when converting from Unicode.

But... that 'encoding' is not something which actually exists in the world outside
of mbstring. CP50220/1/2 do exist in MicroSoft software, but not 'JIS-ms'.

For a text encoding conversion library, inventing new variant encodings and
implementing them is not very productive. Our interest is in handling text
encodings which real people actually use for... you know, storing actual text
and things like that.
2021-01-15 21:55:41 +02:00
Alex Dowad fcbe45de10 Remove useless mbstring encoding 'CP50220-raw'
CP50220 is a variant of ISO-2022-JP invented by MicroSoft, which handles some
Unicode characters which are not representable in ISO-2022-JP by converting
them to similar characters which are representable.

What, then, is CP50220-raw? An Internet search turns up absolutely nothing.
Reference works which I consulted don't say anything about it. Other text
conversion libraries don't support it.

From looking at the code: It's just the same as CP50220, but it accepts
unmapped JIS X 0208 characters passed through from other Japanese encodings
and silently encodes them using the usual ISO-2022-JP escape sequence and
representation for JIS X 0208 characters.

It's hard to see how this could be useful. OK, let me come out and say it:
it's _not_ useful. We can confidently jettison this (mis)feature.
2021-01-15 21:55:41 +02:00
Alex Dowad bbbadae0ae Combine MBFL_ENCTYPE_MWC2{BE,LE} constants
These constants indicate that a text encoding uses 2+ bytes for each character,
and is either big endian or little endian (respectively). But nothing in
mbstring cares about the difference between MBFL_ENCTYPE_MWC2BE and
MBFL_ENCTYPE_MWC2LE.

(Actually, nothing cares about whether these flags are set at all...
maybe we should just remove them?)
2020-11-25 19:52:19 +02:00
Alex Dowad 72660c416a Combine MBFL_ENCTYPE_WCS{2,4}{BE,LE} constants
These flags identify text encodings in mbstring which use a constant number of
bytes per character. While some parts of the code do use these flags, usually
to detect cases which can be optimized due to constant-width encoding, nothing
cares whether the encodings are 'LE' (little-endian) or 'BE' (big-endian).

So we can simplify things by combining constants.
2020-11-25 19:52:19 +02:00
Alex Dowad e169ad3b61 Consolidate all single-byte encodings in one source file
We can squeeze out a lot of duplicated code in this way.
2020-11-11 11:18:59 +02:00
Alex Dowad b05ad5112a Don't redundantly flush mbstring filters multiple times
Each flush function in a chain of mbstring conversion filters always
calls the next flush function in the chain. So it is not necessary to
explicitly flush the second filter in a chain. (Due to this bug, in many
cases, flush functions were actually being called three times.)
2020-11-11 11:18:58 +02:00
Alex Dowad 3e7acf901d Remove mbstring identify filters
mbstring had an 'identify filter' for almost every supported text encoding
which was used when auto-detecting the most likely encoding for a string.
It would run over the string and set a 'flag' if it saw anything which
did not appear likely to be the encoding in question.

One problem with this scheme was that encodings which merely appeared
less likely to be the correct one were completely rejected, even if there
was no better candidate. Another problem was that the 'identify filters'
had a huge amount of code duplication with the 'conversion filters'.

Eliminate the identify filters. Instead, when auto-detecting text
encoding, use conversion filters to see whether the input string is valid
in candidate encodings or not. At the same type, watch the type of
codepoints which the string decodes to and mark it as less likely if
non-printable characters (ESC, form feed, bell, etc.) or 'private use
area' codepoints are seen.

Interestingly, one old test case in which JIS text was misidentified
as UTF-8 (and this wrong behavior was enshrined in the test) was 'fixed'
and the JIS string is now auto-detected as JIS.
2020-11-09 13:45:17 +02:00
Alex Dowad cc03c54c36 Remove useless byte{2,4}{be,le} encodings from mbstring
There is no meaningful difference between these and UCS-{2,4}. They are
just a little bit more lax about passing errors silently. They also have
no known use.

Alias to UCS-{2,4} in case someone, somewhere is using them.
2020-11-09 13:45:16 +02:00
Alex Dowad 9f5a4b3bd9 Fix mbstring support for ARMSCII-8
- Identify filter was completely wrong.
- Respect `mb_substitute_character` rather than converting invalid bytes to
  Unicode 0xFFFD (generic replacement character).
- Don't convert Unicode 0xFFFD to a valid ARMSCII-8 character.
- When converting ARMSCII-8 to ARMSCII-8, don't pass invalid bytes through
  silently.
2020-11-02 21:31:06 +02:00
Alex Dowad e81458862b Remove dead code from mbfilter_koi8u.c (and do general code cleanup) 2020-11-02 21:31:06 +02:00
Alex Dowad fde7794556 Remove dead code from mbfilter_iso8859_{2,4,5,9,10,13,14,15,16}.c
...Plus some dead code related to ISO-8859-1.
2020-11-02 21:31:06 +02:00
Alex Dowad 0a8ebb36a5 Remove dead code from mbfilter_koi8r.c 2020-11-02 21:31:06 +02:00
Alex Dowad b6e75265d0 Remove dead code from mbfilter_cp850.c (and do general code cleanup)
Since there are no invalid bytes in CP850, these `if` conditions will never
be true.
2020-11-02 21:31:06 +02:00
Alex Dowad 20a404f765 Remove dead code from mbfilter_cp866.c (and do general code cleanup)
Since there are no invalid bytes in CP866, these `if` conditions will never
be true.
2020-11-02 21:31:06 +02:00
Alex Dowad e6d17cfe44 Fix mbstring support for CP1254 encoding
One funny thing: while the original author used Unicode 0xFFFD (generic
replacement character) for invalid bytes in CP1251 and CP1252, for CP1254
they used 0xFFFE, which is not a valid Unicode codepoint at all, but is a
reversed byte-order mark. Probably this was by mistake.

Anyways,

- Fixed identify filter, which was completely wrong.
- Don't convert Unicode 0xFFFE to a random (but valid) CP1254 byte.
- When converting CP1254 to CP1254, don't pass invalid bytes through silently.
2020-11-02 21:31:05 +02:00
Alex Dowad 44bd5804b0 Fix mbstring support for CP1251 encoding
- Identify filter was as wrong as wrong can be.
- Invalid CP1251 byte 0x98 was converted to Unicode 0xFFFD (generic
  replacement character), rather than respecting `mb_substitute_character`.
- Unicode 0xFFFD was converted to some random CP1251 byte.
- When converting CP1251 to CP1251, don't pass invalid bytes through silently.
2020-11-02 21:31:05 +02:00
Alex Dowad 7047e5d2c4 Add identify filter for UTF-32{,BE,LE} 2020-10-27 10:19:01 +02:00
Alex Dowad d8895cd054 Improve error handling for UTF-16{,BE,LE}
Catch various errors such as the first part of a surrogate pair not being
followed by a proper second part, the first part of a surrogate pair appearing
at the end of a string, the second part of a surrogate pair appearing out
of place, and so on.
2020-10-27 10:19:01 +02:00
Alex Dowad 7b9bed0150 Add identify filter for ISO-8859-16 (Latin-10) encoding
Interestingly, it looks like the original author intended to add an identify filter
for this encoding, but never did so. The needed struct is there, but was never added
to the list of identify filters in mbfl_ident.c.
2020-10-16 20:56:45 +02:00
Alex Dowad 648c1cb51e Add identify filter for UCS-2, UCS-2BE, and UCS-2LE encodings 2020-10-13 20:26:14 +02:00
Alex Dowad 374f31e364 Add mbstring identify filter for 'binary' encoding 2020-10-13 20:26:13 +02:00
Alex Dowad 97beecc251 Add identify filter for UTF-16, UTF-16LE, UTF-16BE
There was one faulty test in the suite which only passed before because UTF-16 had no
identify filter. After this was fixed, it exposed the problem with the test.
2020-10-13 20:26:13 +02:00
Alex Dowad 4aa7430f68 Add mbstring identify filter for '7bit' encoding 2020-10-13 06:12:38 +02:00
Alex Dowad 0ffc1f55b3 Refactor mbfl_ident.c, mbfl_encoding.c, mbfl_memory_device.c, mbfl_string.c
- Make everything less gratuitously verbose
- Don't litter the code with lots of unneeded NULL checks (for things which
  will never be NULL)
- Don't return success/failure code from functions which can never fail
- For encoding structs, don't use pointers to pointers to pointers for the
  list of alias strings. Pointers to pointers (2 levels of indirection)
  is what actually makes sense. This gets rid of some extraneous
  dereference operations.
2020-10-13 06:12:38 +02:00
Alex Dowad 3f1851dec2 Avoid compiler warnings related to mbstring flush functions 2020-10-13 06:12:37 +02:00
Remi Collet b1c5532ad1 fix mbfl function prototypes
re-add mbfl_convert_filter_feed API
re-add pointer cast
2020-09-15 15:15:06 +02:00