1
0
mirror of https://github.com/php/php-src.git synced 2026-04-25 08:58:28 +02:00
Commit Graph

221 Commits

Author SHA1 Message Date
Alex Dowad 5f8993bc28 Merge branch 'PHP-8.1'
* PHP-8.1:
  Reintroduce legacy 'SJIS-win' text encoding in mbstring
2022-08-16 20:47:04 +02:00
Alex Dowad 371367ce3e Reintroduce legacy 'SJIS-win' text encoding in mbstring
In e2459857af, I combined mbstring's "SJIS-win" text encoding
into CP932. This was done after doing some testing which appeared
to show that the mappings for "SJIS-win" were the same as those
for "CP932".

Later, it was found that there was actually a small difference
prior to e2459857af when converting Unicode to CP932. The
mappings for the following two codepoints were different:

        CP932  SJIS-win
U+203E  0x7E   0x81 0x50
U+00A5  0x5C   0x81 0x8F

As shown, mbstring's "CP932" mapped Unicode's 'OVERLINE' and
'YEN SIGN' to the ASCII bytes which have conflicting uses in
most legacy Japanese text encodings. "SJIS-win" mapped these
to equivalent JIS X 0208 fullwidth characters.

Since e2459867af was not intended to cause any user-visible
change in behavior, I am rolling back the merge of "CP932"
and "SJIS-win".

It seems doubtful whether these two text encodings should
be kept separate or merged in a future release. An extensive
discussion of the related historical background and
compatibility issues involved can be found in this
GitHub thread:

https://github.com/php/php-src/issues/8308
2022-08-16 20:18:54 +02:00
Alex Dowad 983a29d3c0 Legacy conversion code for '7bit' to '8bit' inserts error markers
The use of a special 'vtbl' for converting between '7bit' and
'8bit' text meant that '7bit' text would not be converted to
wchars before going to '8bit'. This meant that the special
value MBFL_BAD_INPUT, which we use to flag an erroneous byte
sequence in input text (and which is required by functions
like mb_check_encoding), would pass directly to the output,
instead of being converted to the error marker specified
by mb_substitute_character.

This issue dates back to the time when I removed the mbfl
'identify filters' and made encoding validity checking and
encoding detection rely only on the conversion filters.
2022-08-16 16:43:27 +02:00
Alex Dowad a4656895dd Imitate legacy behavior when converting non-encodings using mbstring
Fuzzing revealed that something was missed here when making the new
encoding conversion code match the behavior of the old code. In the
next major release of PHP, support for these non-encodings will be
dropped, but in the meantime, it is better to match the legacy
behavior.
2022-08-16 16:43:27 +02:00
Alex Dowad 7299096095 New implementation of mb_strimwidth
This new implementation of mb_strimwidth uses the new text
encoding conversion filters. Changes from the previous
implementation:

• mb_strimwidth allows a negative 'from' argument, which
should count backwards from the end of the string. However,
the implementation of this feature was buggy (starting right
from when it was first implemented).

It used the following code:

    if ((from < 0) || (width < 0)) {
        swidth = mbfl_strwidth(&string);
    }
    if (from < 0) {
        from += swidth;
    }

Do you see the bug? 'from' is a count of CODEPOINTS, but
'swidth' is a count of TERMINAL COLUMNS. Adding those two
together does not make sense. If there were no fullwidth
characters in the input string, then the two counts coincide
and the feature would work correctly. However, each
fullwidth character would throw the result off by one,
causing more characters to be skipped than was requested.

• mb_strimwidth also allows a negative 'width' argument,
which again counts backwards from the end of the string;
in this case, it is not determining the START of the portion
which we want to extract, but rather, the END of that portion.
Perhaps unsurprisingly, this feature was also buggy.

Code:

    if (width < 0) {
        width = swidth + width - from;
    }

'swidth + width' is fine here; the problem is '- from'.
Again, that is subtracting a count of CODEPOINTS from a
count of TERMINAL COLUMNS. In this case, we really need
to count the terminal width of the string prefix skipped
over by 'from', and subtract that rather than the number
of codepoints which are being skipped.

As a result, if a 'from' count was passed along with a
negative 'width', for every fullwidth character in the
skipped prefix, the result of mb_strimwidth was one
terminal column wider than requested.

Since these situations were covered by unit tests, you
might wonder why the bugs were not caught. Well, as far as
I can see, it looks like the author of the 'tests' just
captured the actual output of mb_strimwidth and defined it
as 'correct'. The tests were written in such a way that it
was difficult to examine them and see whether they made
sense or not; but a careful examination of the inputs and
outputs clearly shows that the legacy tests did not conform
to the documented contract of mb_strimwidth.

• The old implementation would always pass the input string
through decoding/encoding filters before returning it to
the caller, even if it fit within the specified width. This
means that invalid byte sequences would be converted to
error markers. For performance, the new implementation
returns the very same string which was passed in if it
does not exceed the specified width. This means that
erroneous byte sequences are not converted to error markers
unless it is necessary to trim the string.

• The same applies to the 'trim marker' string.

• The old implementation was buggy in the (unusual)
case that the trim marker is wider than the requested
maximum width of the result. It did an unsigned subtraction
of the requested width and the width of the trim marker. If the
width of the trim marker was greater, that subtraction would
underflow and yield a huge number. As a result, mb_strimwidth
would then pass the input string through, even if it was
far wider than the requested maximum width.

In that case, since the input string is wider than the
requested width, and NONE of it will fit together with the
trim marker, the new implementation returns just the trim
marker. This is the one case where the output can be wider
than the requested width: when BOTH the input string and
also the trim marker are too wide.

• Since it passed the input string and trim marker through
decoding/encoding filters, when using "Quoted-Printable" as
the encoding, newlines could be inserted into the trim marker
to maintain the maximum line length for QP.

This is an extremely bizarre use case and I don't think there
is any point in worrying about it. QP will be removed from
mbstring in time, anyways.

PERFORMANCE:

• From micro-benchmarking with various input string lengths and
text encodings, it appears that the new implementation is 2-3x
faster for UTF-8 and UTF-16. For legacy Japanese text encodings
like ISO-2022-JP or SJIS, the new implementation is perhaps 25%
faster.

• Note that correctly implementing negative 'from' and 'width'
arguments imposes a small performance burden in such cases; one
which the old implementation did not pay. This slightly skews
benchmarking results in favor of the old implementation. However,
even so, the new implementation is faster in all cases which I
tested.
2022-08-02 11:07:06 +02:00
Alex Dowad 9ac49c0dd3 New implementation of mb_convert_kana
mb_convert_kana now uses the new text encoding conversion
filters. Microbenchmarking shows speed gains of 50%-150%
across various text encodings and input string lengths.

The behavior is the same as the old mb_convert_kana
except for one fix: if the 'zero codepoint' U+0000 appeared
in the input, the old implementation would sometimes drop
it, not passing it through to the output. This is now
fixed.
2022-07-20 07:44:19 +02:00
Alex Dowad 30bfeef48d mbfl_strwidth does not need to use legacy conversion filters now
...Because we have the new (faster) conversion filters now for
ALL text encodings supported by mbstring.
2022-07-18 15:11:32 +02:00
Alex Dowad 91969e908f New implementation of mb_{de,en}code_numericentity
This new implementation uses the new encoding conversion filters.
Aside from fewer LOC and (hopefully) improved readability,
the differences are as follows:

BEHAVIOR CHANGES:

- The old implementation used signed arithmetic when operating
on the 'convmap'. This meant that results could be surprising when
using convmap entries with 1 in the MSB. Further, types like 'int'
were used rather than those with a specific bit width, such as
'int32_t'. This meant that results could also depend on the
platform width of an 'int'.

Now unsigned arithmetic is used, with explicit bit widths.

- Similarly, while converting decimal numeric entities, the
legacy implementation would ensure that the value never overflowed
INT_MAX, and if it did, the entity would be treated as invalid
and passed through unconverted.

However, that again means that results depend on the platform
size of an 'int'. So now, we use a value with explicit bit width
(32 bits) to hold the value of a deconverted decimal entity, and
ensure that the entity value does not overflow that.

Further, because we are using an UNSIGNED 32-bit value rather
than a signed one, the ceiling for how large a decimal entity
can be is higher now.

All of this will probably not affect anyone, since Unicode
codepoints above U+10FFFF are invalid anyways. To see the
difference, you need to be using a text encoding like UCS-4,
which allows huge 'codepoints'.

- If it saw something which looked like a hex entity, but
turned out not to be a valid numeric entity, the old
implementation would sometimes convert the hexadecimal
digits a-f to A-F (uppercase). The new implementation passes
invalid numeric entities through without performing case
conversion.

- The old implementation of mb_encode_numericentity was
limited in how many decimal/hex digits it could emit.
If a text encoding like UCS-4 was in use, where 'codepoints'
can have huge values (larger than the valid range
stipulated by the Unicode standard), it would not error
out on a 'codepoint' whose value was too large for it,
but would rather mangle the value and emit a numeric
entity which decoded to some other random codepoint.
The new implementation is able to emit enough digits to
express any value which fits in 32 bits.

PERFORMANCE:

Based on micro-benchmarks run on my development machine:

Decoding numeric HTML entities is about 4 times faster, for
both decimal and hexadecimal entities, across a variety of
input string lengths. Encoding is about 3 times faster.
2022-07-18 15:11:30 +02:00
Alex Dowad 880803a21e Use fast conversion filters to implement php_mb_ord
Even for single-character strings, this is about 50% faster for
ASCII, UTF-8, and UTF-16. For long strings, the performance gain is
enormous, since the old code would convert the ENTIRE string, just
to pick out the first codepoint.
2022-06-12 15:24:41 +02:00
Alex Dowad 9468fa7ff2 mbfl_strlen does not need to use old conversion filters any more 2022-06-12 15:24:41 +02:00
Alex Dowad 0154a5ac9f Use fast text conversion filters to implement php_mb_convert_encoding_ex 2022-05-28 21:53:38 +02:00
Alex Dowad e4b9aa1870 Add assertions to help catch buffer overflows in mbstring text conversion code 2022-05-28 21:53:38 +02:00
Alex Dowad 06a15e6395 Implement fast text conversion interface for '8bit' 2022-05-28 21:53:37 +02:00
Christoph M. Becker 20c0eb47df Merge branch 'PHP-8.1'
* PHP-8.1:
  Fix GH-8208: mb_encode_mimeheader: $indent functionality broken
2022-03-17 17:35:06 +01:00
Christoph M. Becker 5003831260 Merge branch 'PHP-8.0' into PHP-8.1
* PHP-8.0:
  Fix GH-8208: mb_encode_mimeheader: $indent functionality broken
2022-03-17 17:34:31 +01:00
Christoph M. Becker d0417ebc93 Fix GH-8208: mb_encode_mimeheader: $indent functionality broken
We also need to factor in the indent, when getting the encoder result.

Closes GH-8213.
2022-03-17 17:31:58 +01:00
Alex Dowad 3c73225125 New internal interface for fast text conversion in mbstring
When converting text to/from wchars, mbstring makes one function call
for each and every byte or wchar to be converted. Typically, each of
these conversion functions contains a state machine, and its state has
to be restored and then saved for every single one of these calls.
It doesn't take much to see that this is grossly inefficient.

Instead of converting one byte or wchar on each call, the new
conversion functions will either fill up or drain a whole buffer of
wchars on each call. In benchmarks, this is about 3-10× faster.

Adding the new, faster conversion functions for all supported legacy
text encodings still needs some work. Also, all the code which uses
the old-style conversion functions needs to be converted to use the
new ones. After that, the old code can be dropped. (The mailparse
extension will also have to be fixed up so it will still compile.)
2021-12-21 08:33:11 +02:00
Christoph M. Becker 97f78b3bb7 Merge branch 'PHP-8.1'
* PHP-8.1:
  Fix #81693: mb_check_encoding(7bit) segfaults
2021-12-03 22:50:27 +01:00
Christoph M. Becker 929d847152 Fix #81693: mb_check_encoding(7bit) segfaults
`php_mb_check_encoding()` now uses conversion to `mbfl_encoding_wchar`.
Since `mbfl_encoding_7bit` has no `input_filter`, no filter can be
found.  Since we don't actually need to convert to wchar, we encode to
8bit.

Closes GH-7712.
2021-12-03 22:49:47 +01:00
Alex Dowad 9962aa9774 Merge branch 'PHP-8.1'
* PHP-8.1:
  mb_detect_encoding will not return non-encodings
  Improve detection accuracy of mb_detect_encoding
2021-10-19 18:11:35 +02:00
Alex Dowad 28b346bc06 Improve detection accuracy of mb_detect_encoding
Originally, `mb_detect_encoding` essentially just checked all candidate
encodings to see which ones the input string was valid in. However, it
was only able to do this for a limited few of all the text encodings
which are officially supported by mbstring.

In 3e7acf901d, I modified it so it could 'detect' any text encoding
supported by mbstring. While this is arguably an improvement, if the
only text encodings one is interested in are those which
`mb_detect_encoding` could originally handle, the old
`mb_detect_encoding` may have been preferable. Because the new one has
more possible encodings which it can guess, it also has more chances to
get the answer wrong.

This commit adjusts the detection heuristics to provide accurate
detection in a wider variety of scenarios. While the previous detection
code would frequently confuse UTF-32BE with UTF-32LE or UTF-16BE with
UTF-16LE, the adjusted code is extremely accurate in those cases.
Detection for Chinese text in Chinese encodings like GB18030 or BIG5
and for Japanese text in Japanese encodings like EUC-JP or SJIS is
greatly improved. Detection of UTF-7 is also greatly improved. An 8KB
table, with one bit for each codepoint from U+0000 up to U+FFFF, is
used to achieve this.

One significant constraint is that the heuristics are completely based
on looking at each codepoint in a string in isolation, treating some
codepoints as 'likely' and others as 'unlikely'. It might still be
possible to achieve great gains in detection accuracy by looking at
sequences of codepoints rather than individual codepoints. However,
this might require huge tables. Further, we might need a huge corpus
of text in various languages to derive those tables.

Accuracy is still dismal when trying to distinguish single-byte
encodings like ISO-8859-1, ISO-8859-2, KOI8-R, and so on. This is
because the valid bytes in these encodings are basically all the same,
and all valid bytes decode to 'likely' codepoints, so our method of
detection (which is based on rating codepoints as likely or unlikely)
cannot tell any difference between the candidates at all. It just
selects the first encoding in the provided list of candidates.

Speaking of which, if one wants to get good results from
`mb_detect_encoding`, it is important to order the list of candidate
encodings according to your prior belief of which are more likely to
be correct. When the function cannot tell any difference between two
candidates, it returns whichever appeared earlier in the array.
2021-10-19 18:05:51 +02:00
Alex Dowad 0b32a15eb0 Optimize mb_str{,im}width for performance
Rather than doing a linear search of a table of fullwidth codepoint
ranges for every input character,

1) Short-cut the search if the codepoint is below the first such range
2) Otherwise, do a binary (rather than linear) search
2021-09-29 18:19:01 +02:00
Alex Dowad f4365d2c26 Remove unused typedef 'mbfl_encoding_id' 2021-09-29 18:19:01 +02:00
Alex Dowad 3bf431969e Don't check for impossible error condition in mb_substr_count 2021-09-29 18:19:01 +02:00
Alex Dowad 8c32deb605 Don't check for impossible error condition in mb_strwidth 2021-09-29 18:19:01 +02:00
Alex Dowad bf78070cbe Don't check for impossible error condition in mb_strlen 2021-09-29 18:19:01 +02:00
Alex Dowad 07c4b3b8c0 Simplify code for handling mbstring language aliases
Rather than using pointers to pointers to pointers (3 levels of indirection), what
makes sense is two levels. This reduces unnecessary pointer dereference operations.
2021-09-20 11:27:54 +02:00
Alex Dowad 2f096c4039 Remove useless constant MBFL_ENCTYPE_MWC2 2021-09-20 11:27:54 +02:00
Alex Dowad 68176fdfb1 Use char literals in HTML numeric entity {en,de}coding functions 2021-09-20 11:27:54 +02:00
Alex Dowad f663344f33 Merge branch 'PHP-8.1'
* PHP-8.1:
  Bug #81390: mb_detect_encoding should not prematurely stop processing input
  mb_detect_encoding with only one candidate encoding uses mb_check_encoding
  Optimize text encoding detection for speed (eliminate Unicode property lookups)
2021-09-20 11:27:07 +02:00
Alex Dowad c25a1ef8d0 Bug #81390: mb_detect_encoding should not prematurely stop processing input
As a performance optimization, mb_detect_encoding tries to stop
processing the input string early when there is only one 'candidate'
encoding which the input string is valid in. However, the code which
keeps count of how many candidate encodings have already been rejected
was buggy. This caused mb_detect_encoding to prematurely stop
processing the input when it should have continued.

As a result, it did not notice that in the test case provided by Alec,
the input string was not valid in UTF-16.
2021-09-20 11:21:39 +02:00
Alex Dowad 6acd4f7f3a Optimize text encoding detection for speed (eliminate Unicode property lookups)
...By just testing the input codepoints if they are within a few fixed
ranges instead. This avoids hash lookups in property tables.

From (micro-)benchmarking on my PC, this looks to be a bit less than 4x
faster than the existing code.
2021-09-20 11:20:53 +02:00
Nikita Popov e740907ec9 Merge branch 'PHP-8.1'
* PHP-8.1:
  Update Unicode tables to 14.0.0
2021-09-20 09:58:32 +02:00
Colin O'Dell fe36b81d5e Update Unicode tables to 14.0.0
Closes GH-7502.
2021-09-20 09:58:20 +02:00
Alex Dowad a312620607 Remove redundant NULL checks in mbstring
Whoever originally wrote mbstring seems to have a deathly fear of NULL
pointers lurking behind every corner. A common pattern is that one
function will check if a pointer is NULL, then pass it to another
function, which will again check if it is NULL, then pass to yet another
function, which will yet again check if it is NULL... it's NULL checks
all the way down.

Remove all the NULL checks in places where pointers could not possibly
be NULL.
2021-09-06 13:16:23 +02:00
Alex Dowad 626f0fec54 Remove some dead code from mbstring
mbstring has a great deal of dead code. Some common types are:

- Default switch clauses which will never be taken
- If clauses intended to convert codepoints which were not present in
  a conversion table... but the codepoint in question *is* in the table,
  so the if clause is not needed.
- Bounds checks in places where it is not possible for a value to ever
  be out of bounds.
- Checks to see if an unmatched Unicode codepoint is in CP932 extension
  range 3... but every codepoint in range 3 is also in range 2, so no
  codepoint will ever be matched and converted by that code.
2021-09-06 13:16:23 +02:00
Alex Dowad f303fc8a9b Use bool in mbfl_filt_conv_output_hex (rather than int) 2021-08-31 13:41:34 +02:00
Alex Dowad 776296e12f mbstring no longer provides 'long' substitutions for erroneous input bytes
Previously, mbstring had a special mode whereby it would convert
erroneous input byte sequences to output like "BAD+XXXX", where "XXXX"
would be the erroneous bytes expressed in hexadecimal. This mode could
be enabled by calling `mb_substitute_character("long")`.

However, accurately reproducing input byte sequences from the cached
state of a conversion filter is often tricky, and this significantly
complicates the implementation. Further, the means used for passing
the erroneous bytes through to where the "BAD+XXXX" text is generated
only allows for up to 3 bytes to be passed, meaning that some erroneous
byte sequences are truncated anyways.

More to the point, a search of publically available PHP code indicates
that nobody is really using this feature anyways.

Incidentally, this feature also provided error output like "JIS+XXXX"
if the input 'should have' represented a JISX 0208 codepoint, but it
decodes to a codepoint which does not exist in the JISX 0208 charset.
Similarly, specific error output was provided for non-existent
JISX 0212 codepoints, and likewise for JISX 0213, CP932, and a few
other charsets. All of that is now consigned to the flames.

However, "long" error markers also include a somewhat more useful
"U+XXXX" marker for Unicode codepoints which were successfully
decoded from the input text, but cannot be represented in the output
encoding. Those are still supported.

With this change, there is no need to use a variety of special values
in the high bits of a wchar to represent different types of error
values. We can (and will) just use a single error value. This will be
equal to -1.

One complicating factor: Text conversion functions return an integer to
indicate whether the conversion operation should be immediately
aborted, and the magic 'abort' marker is -1. Also, almost all of these
functions would return the received byte/codepoint to indicate success.
That doesn't work with the new error value; if an input filter detects
an error and passes -1 to the output filter, and the output filter
returns it back, that would be taken to mean 'abort'.

Therefore, amend all these functions to return 0 for success.
2021-08-31 13:41:34 +02:00
Alex Dowad 97b7fc893c Output illegal character marker for 4-byte illegal characters > 0x7FFFFFFF
Some text encodings supported by mbstring (such as UCS-4) accept 4-byte
characters. When mbstring encounters an illegal byte sequence for the
encoding it is using, it should emit an 'illegal character' marker,
which can either be a single character like '?', an HTML hexadecimal
entity, or a marker string like 'BAD+XXXX'.

Because of the use of signed integers to hold 4-byte characters,
illegal 4-byte sequences with a 'negative' value (one with the high
bit set) were not handled correctly when emitting the illegal char
marker. The result is that such illegal sequences were just skipped
over (and the marker was not emitted to the output). Fix that.
2021-08-30 16:29:58 +02:00
Nikita Popov 634f2e21d3 Don't expose wchar encoding to users (#7415)
The "wchar" encoding isn't really an encoding -- it's what we
internally use as the representation of decoded characters.

In practice, it tends to behave a lot like the 8bit encoding when
used from userland, because input code units end up being treated
as code points.

This patch removes the wchar encoding from the public encoding
list and reserves it for internal use only.
2021-08-30 11:11:33 +02:00
Nikita Popov 43cb2548f7 Flush filter during non-strict encoding detection
If we reach the end of the string without reducing to a single
encoding, then we should flush to check whether the last character
is incomplete.
2021-08-27 14:48:32 +02:00
Nikita Popov a1c1ee6a48 Don't use opaque for encoding detection score
opaque is used by the htmlentities filter, which means that we
end up trying to free the score value as a pointer. Don't try to
be overly tricky here and simply allocate a separate structure
to hold the number of illegal characters and the score.
2021-07-28 10:54:27 +02:00
Nikita Popov 9d0db2e98a Fixed bug #81298
Creation of the filter may fail for some special encodings, for
which detection is not supported.
2021-07-28 10:11:46 +02:00
Alex Dowad 26fc7c4256 Fix typo in mbfilter.h
As pointed out by Bruno Haible (https://haible.de/bruno).
2021-07-19 12:17:00 +02:00
Alex Dowad e2459857af Remove duplicate implementation of CP932 from mbstring
Sigh. Double sigh. After fruitlessly searching the Internet for information on
this mysterious text encoding called "SJIS-open", I wrote a script to try
converting every Unicode codepoint from 0-0xFFFF and compare the results from
different variants of Shift-JIS, to see which one "SJIS-open" would be most
similar to.

The result? It's just CP932. There is no difference at all. So why do we have
two implementations of CP932 in mbstring?

In case somebody, somewhere is using "SJIS-open" (or its aliases "SJIS-win" or
"SJIS-ms"), add these as aliases to CP932 so existing code will continue to
work.
2021-06-17 13:12:40 +02:00
George Peter Banyard c40231afbf Mark various functions with void arguments.
This fixes a bunch of [-Wstrict-prototypes] warning,
because in C func() and func(void) have different semantics.
2021-05-12 14:55:53 +01:00
Alex Dowad 319a340843 Simplify code for working with halfwidth/fullwidth kana conversion filter
There's no need to dynamically allocate a struct to hold the 'mode' parameter;
just store it directly in `filt->opaque`. Some other things were also being done
in an unnecessarily roundabout way.

Also, the 'copy' function for CP50220 conversion filters was *both* broken
and unnecessary. Broken, because it malloc'd memory which was never freed by
anything. Unnecessary, because the point of the copy is so that various
algorithms can try running bytes through a conversion filter and see how many
output bytes or characters result, and then back out by restoring the filters
to their previous state. But here's the thing; CP50220 conversion filters don't
hold cached bytes, which is the main thing which would need to be restored to a
previous state.
2021-04-15 15:52:31 +02:00
Alex Dowad a900ec3397 Remove unneeded 'filter_ctor' member from mbfl_convert_filter struct
This function pointer is only called when initializing the struct. After that
nothing is done with it. Therefore, there is no need to keep it in the struct.
2021-04-15 15:52:31 +02:00
Alex Dowad d8c785b894 Update 'East Asian Width' table to comply with Unicode 13.0
Instead of manually maintaining the data in eaw_table.h, it is now automatically
generated by ucgendat/ucgendat.php, using the EastAsianWidth.txt file from
the Unicode Consortium.

Something must be said about the deleted test case. Back in 2004, someone
noticed that `mb_strwidth` didn't comply with Unicode 4.0. A test case was
added to expose the problem. Well, time keeps moving on, and with the changing
years, new Unicodes are born and old Unicodes die. Some characters which were
counted as double-width in Unicode 4.0 are no longer such in Unicode 13.0,
which renders the test case obsolete.

At the same time, make a couple of spelling/grammar fixes in ucgendat.php.
2021-01-19 20:38:44 +02:00
Alex Dowad a06c20a17c Remove useless constant MBFL_ENCTYPE_MBCS
This flag indicated that an encoding was 'multi-byte'; it can use a variable
number of bytes to encode each character. As it turns out, we don't actually
need to check this flag anywhere, so it's better to remove it.
2021-01-15 21:55:41 +02:00