1
0
mirror of https://github.com/php/php-src.git synced 2026-04-25 08:58:28 +02:00
Commit Graph

409 Commits

Author SHA1 Message Date
Alex Dowad b626e893ff Fix conversion of ISO-2022-KR text (and add test suite)
- Truncated multi-byte characters are treated as an error
- Truncated or unrecognized escape sequences are treated as an error
- ASCII control characters are not allowed to appear in the middle
  of a multi-byte character
2021-07-05 16:28:16 +02:00
Alex Dowad 658db1f6ea Code cleanup in mbfilter_uhc.c 2021-07-05 16:28:16 +02:00
Alex Dowad 0a8c00755d Fix conversion of EUC-JP-2004 text (and add test suite)
- Truncated multi-byte characters are treated as an error now
- Invalid multi-byte characters are treated as an error rather than
  being quietly swallowed
- ASCII control characters are not allowed to appear in the middle
  of a multi-byte character
2021-07-05 16:28:16 +02:00
Alex Dowad ff85ed8adc Fix conversion of EUC-TW text (and add test suite)
- Treat text which ends abruptly in the middle of a multi-byte
  character as erroneous.
- Don't allow ASCII control characters to appear in the middle of a
  multi-byte character.
- If an illegal byte appears in the middle of a multi-byte character,
  go back to the initial state rather than trying to finish the
  multi-byte character.
- There was a bug in the file with the conversion tables, which set the
  'maximum codepoint which can be converted using table A2' using the
  size of table A1, not table A2. This meant that several hundred
  Unicode codepoints which should have been able to be converted to
  EUC-TW were flagged as erroneous instead.
- When a sequence which cannot possibly be a prefix of a valid
  multi-byte character is found, immediately flag it as an error, rather
  than waiting to read more bytes first.
- Allow characters in CNS-11643 plane 1 to be encoded as 4-byte
  sequences (although they can also be encoded as 2-byte sequences).
  This is allowed by the standard for EUC-TW text.
2021-06-29 12:25:21 +02:00
Alex Dowad 8b25e38b21 Fix conversion of EUC-CN text (and add test suite)
- Flag truncated multi-byte characters as erroneous.
- Don't allow ASCII control characters to appear in the middle of a
  multi-byte character.
- There was a bug whereby some unrecognized Unicode codepoints would be
  passed through unchanged to the output when converting Unicode to
  EUC-CN.
- Stick to the original EUC-CN standard, rather than CP936 (an extended
  version invented by MS).
2021-06-29 12:25:21 +02:00
Alex Dowad 69c979aaea Fix conversion of EUC-KR text (and add test suite)
- Treat truncated multi-byte characters as an error.
- Don't allow ASCII control characters to appear in the middle of a
  multi-byte character.
- There was also a bug whereby some unrecognized Unicode codepoints
  would be passed through to the output unchanged when converting
  Unicode to EUC-KR.
2021-06-29 12:25:21 +02:00
Alex Dowad ddea06699b Remove table generation scripts which have not been used for years 2021-06-29 12:25:21 +02:00
Alex Dowad ebae1a4524 Fix conversion of CP936 text (and add test suite)
- Treat truncated multi-byte characters as an error.
- Don't allow ASCII control characters to appear in the middle of a
  multi-byte character.
- Adjust some mappings to match recommendations in conversion table
  from Unicode Consortium.
2021-06-29 12:25:21 +02:00
Alex Dowad 1e5c3c13fd Fix conversion of HZ text (and add test suite)
- Treat truncated multi-byte characters as an error.
- Don't allow ASCII control characters to appear in the middle of a
  multi-byte character.
- Handle ~ escapes according to the HZ standard (RFC 1843).
- Treat unrecognized ~ escapes as an error.
- Multi-byte characters (between ~{ ~} escapes) are GB2312, not CP936.
  (CP936 is an extended version from MicroSoft, but the RFC does not
  state that this extended version of GB should be used.)
2021-06-29 12:25:21 +02:00
Alex Dowad b1ab76f742 Minor formatting tweaks in mbfilter_euc_kr.c 2021-06-17 13:12:40 +02:00
Alex Dowad 958ef47d2b When flushing CP5022x conversion filter, also flush next filter in chain
All the mbstring encoding conversion filters do this. I missed it when
adding a flush function for CP5022x.
2021-06-17 13:12:40 +02:00
Alex Dowad caeaa662ab Strict conversion of UHC text to Unicode
Previously, mbstring would accept a lot of things which were not valid
UHC text. No more.

- Don't allow single-byte control characters to appear where the 2nd
  byte of a multi-byte character should be.
- Validate that the 2nd byte of a multi-byte character is in the
  expected range.
- Treat it as an error if a multi-byte character is truncated.

Also add a test suite to confirm that UHC conversion (both to and from
Unicode) works according to spec.
2021-06-17 13:12:40 +02:00
Alex Dowad 4550036d96 Minor formatting tweaks in mbfilter_uhc.c 2021-06-17 13:12:40 +02:00
Alex Dowad e2459857af Remove duplicate implementation of CP932 from mbstring
Sigh. Double sigh. After fruitlessly searching the Internet for information on
this mysterious text encoding called "SJIS-open", I wrote a script to try
converting every Unicode codepoint from 0-0xFFFF and compare the results from
different variants of Shift-JIS, to see which one "SJIS-open" would be most
similar to.

The result? It's just CP932. There is no difference at all. So why do we have
two implementations of CP932 in mbstring?

In case somebody, somewhere is using "SJIS-open" (or its aliases "SJIS-win" or
"SJIS-ms"), add these as aliases to CP932 so existing code will continue to
work.
2021-06-17 13:12:40 +02:00
Alex Dowad 7502c86342 Add test suite for UTF-{7,8,16,32}
Also fix a couple small problems with UTF-32 and UTF-8 support:

- UTF-32 would pass very large codepoints (>= 0x80000000), which are
  not valid.
- UTF-8 would sometimes emit two error marker characters for a single
  bad input byte.
2021-06-17 13:12:40 +02:00
George Peter Banyard c40231afbf Mark various functions with void arguments.
This fixes a bunch of [-Wstrict-prototypes] warning,
because in C func() and func(void) have different semantics.
2021-05-12 14:55:53 +01:00
KsaR 01b3fc03c3 Update http->https in license (#6945)
1. Update: http://www.php.net/license/3_01.txt to https, as there is anyway server header "Location:" to https.
2. Update few license 3.0 to 3.01 as 3.0 states "php 5.1.1, 4.1.1, and earlier".
3. In some license comments is "at through the world-wide-web" while most is without "at", so deleted.
4. fixed indentation in some files before |
2021-05-06 12:16:35 +02:00
Alex Dowad 7159907d30 Fix mbstring support for ISO-2022-JP-MS encoding
- Treat it as error if multi-byte string or escape sequence is truncated
- Don't allow 'control' characters or escape sequences to appear in the middle
  of a multi-byte char

As with ISO-2022-JP-KDDI, the main reference used to develop the tests was
the behavior of the existing code. It would have been better to have some
independent reference which we could cross-check our code against, but I
couldn't find one.
2021-04-15 15:52:31 +02:00
Alex Dowad 570e89a9f3 Fix mbstring support for ISO-2022-JP-KDDI encoding
- Treat it as an error if a multi-byte character or escape sequence is truncated
- When converting other encodings to ISO-2022-JP-KDDI, don't swallow trailing
  hash characters or digits
- Don't allow 'control' characters to appear in the middle of a multi-byte char

Note: I was not able to find any kind of official or even semi-official
specification for this legacy encoding. Therefore, the test suite for
ISO-2022-JP-KDDI is based largely on the behavior of the existing code.

Verifying the correctness of program code in this way is very questionable.
In a sense, all you are proving is that the code "does what it does". However,
the test suite will still expose any unintended _changes_ to behavior.
2021-04-15 15:52:31 +02:00
Alex Dowad 78dc160e3b Catch and handle errors in mUTF-7 (IMAP) conversion 2021-04-15 15:52:31 +02:00
Alex Dowad cef4b94eef Code cleanup in mbfilter_utf7imap.c 2021-04-15 15:52:31 +02:00
Alex Dowad 8abc5e6827 Catch and handle errors in UTF-7 text conversion 2021-04-15 15:52:31 +02:00
Alex Dowad 689978a63b Code cleanup in mbfilter_utf7.c 2021-04-15 15:52:31 +02:00
Alex Dowad ebe6500a0b Fix error reporting bug for Unicode -> CP50220 conversion
To detect errors in conversion from Unicode to another text encoding, each
mbstring conversion filter object maintains a count of 'bad' characters. After
a conversion operation finishes, this count is checked to see if there was any
error.

The problem with CP50220 was that mbstring used a chain of two conversion filter
objects. The 'bad character count' would be incremented on the second object in
the chain, but this didn't do anything, as only the count on the first such
object is ever checked.

Fix this by implementing the conversion using a single conversion filter object,
rather than a chain of two. This is possible because of the recent refactoring,
which pulled out the needed logic for CP50220 conversion into a helper function.
2021-04-15 15:52:31 +02:00
Alex Dowad 1f130d4e58 Refactor mbfl_filt_tl_jisx0201_jisx0208 by moving kana conversion into helper function
This will enable us to simplify the code for CP50220 conversion, which also relies
on this same kana conversion logic.
2021-04-15 15:52:31 +02:00
Alex Dowad 319a340843 Simplify code for working with halfwidth/fullwidth kana conversion filter
There's no need to dynamically allocate a struct to hold the 'mode' parameter;
just store it directly in `filt->opaque`. Some other things were also being done
in an unnecessarily roundabout way.

Also, the 'copy' function for CP50220 conversion filters was *both* broken
and unnecessary. Broken, because it malloc'd memory which was never freed by
anything. Unnecessary, because the point of the copy is so that various
algorithms can try running bytes through a conversion filter and see how many
output bytes or characters result, and then back out by restoring the filters
to their previous state. But here's the thing; CP50220 conversion filters don't
hold cached bytes, which is the main thing which would need to be restored to a
previous state.
2021-04-15 15:52:31 +02:00
Alex Dowad a900ec3397 Remove unneeded 'filter_ctor' member from mbfl_convert_filter struct
This function pointer is only called when initializing the struct. After that
nothing is done with it. Therefore, there is no need to keep it in the struct.
2021-04-15 15:52:31 +02:00
Alex Dowad affc3076f3 Remove unused 'next_filter' member from mbfl_filt_tl_jisx0201_jisx0208_param struct 2021-04-15 15:52:31 +02:00
Alex Dowad 636251a522 Remove useless function mbfl_filt_tl_jisx0201_jisx0208_init
This constructor function doesn't do anything different than the generic one.
There's no need to invoke it, either, when initializing a CP50220 conversion
filter.
2021-04-15 15:52:31 +02:00
George Peter Banyard 5caaf40b43 Introduce pseudo-keyword ZEND_FALLTHROUGH
And use it instead of comments
2021-04-07 00:46:29 +01:00
Alex Dowad d8c785b894 Update 'East Asian Width' table to comply with Unicode 13.0
Instead of manually maintaining the data in eaw_table.h, it is now automatically
generated by ucgendat/ucgendat.php, using the EastAsianWidth.txt file from
the Unicode Consortium.

Something must be said about the deleted test case. Back in 2004, someone
noticed that `mb_strwidth` didn't comply with Unicode 4.0. A test case was
added to expose the problem. Well, time keeps moving on, and with the changing
years, new Unicodes are born and old Unicodes die. Some characters which were
counted as double-width in Unicode 4.0 are no longer such in Unicode 13.0,
which renders the test case obsolete.

At the same time, make a couple of spelling/grammar fixes in ucgendat.php.
2021-01-19 20:38:44 +02:00
Alex Dowad a06c20a17c Remove useless constant MBFL_ENCTYPE_MBCS
This flag indicated that an encoding was 'multi-byte'; it can use a variable
number of bytes to encode each character. As it turns out, we don't actually
need to check this flag anywhere, so it's better to remove it.
2021-01-15 21:55:41 +02:00
Alex Dowad 6cbeb6476e Remove unused macros from mbfilter_cp51932.c, mbfilter_iso2022jp_mobile.c 2021-01-15 21:55:41 +02:00
Alex Dowad 34ece40872 Remove useless mbstring encoding 'JIS-ms'
MicroSoft invented three encodings very similar to ISO-2022-JP/JIS7/JIS8, called
CP50220, CP50221, and CP50222. All three are supported by mbstring.

Since these encodings are very similar, some code can be shared. Actually,
conversion of CP50220/1/2 to Unicode is exactly the same operation; it's when
converting from Unicode to CP50220/1/2 that some small differences arise in how
certain katakana are handled.

The most important common code was a function called `mbfl_filt_wchar_jis_ms`.
The `jis_ms` part doubtless refers to the fact that these encodings are modified
versions of 'JIS' invented by 'MS'. mbstring also went a step further and exported
'JIS-ms' to userland as a separate encoding from CP50220/1/2. If users requested
'JIS-ms' conversion, they got something like CP50220/1/2, minus their special
ways of handling half-width katakana when converting from Unicode.

But... that 'encoding' is not something which actually exists in the world outside
of mbstring. CP50220/1/2 do exist in MicroSoft software, but not 'JIS-ms'.

For a text encoding conversion library, inventing new variant encodings and
implementing them is not very productive. Our interest is in handling text
encodings which real people actually use for... you know, storing actual text
and things like that.
2021-01-15 21:55:41 +02:00
Alex Dowad fcbe45de10 Remove useless mbstring encoding 'CP50220-raw'
CP50220 is a variant of ISO-2022-JP invented by MicroSoft, which handles some
Unicode characters which are not representable in ISO-2022-JP by converting
them to similar characters which are representable.

What, then, is CP50220-raw? An Internet search turns up absolutely nothing.
Reference works which I consulted don't say anything about it. Other text
conversion libraries don't support it.

From looking at the code: It's just the same as CP50220, but it accepts
unmapped JIS X 0208 characters passed through from other Japanese encodings
and silently encodes them using the usual ISO-2022-JP escape sequence and
representation for JIS X 0208 characters.

It's hard to see how this could be useful. OK, let me come out and say it:
it's _not_ useful. We can confidently jettison this (mis)feature.
2021-01-15 21:55:41 +02:00
Alex Dowad 888f5d7729 CP5022{0,1,2}: treat truncated multibyte characters as error 2021-01-15 21:55:41 +02:00
Alex Dowad 0ec34da8e0 CP5022{0,1,2}: treat unrecognized escapes as error 2021-01-15 08:30:36 +02:00
Alex Dowad a50607d11d CP5022{0,1,2}: use JISX0201 for U+203E (overline)
Same issue as d497c0e96f addressed for JIS7/JIS8, but for CP5022{0,1,2} this time.
2021-01-15 08:30:30 +02:00
Alex Dowad 5e5243ab65 CP5022{0,1,2}: convert Unicode codepoints in 'user' area (0xE000-E757) correctly
Unicode has a range of 'private' codepoints which individual applications can
use for their own purposes. When they were inventing CP932, MicroSoft mapped
these 'private' or 'user' codepoints to ten new rows added to the JIS X 0208
character table. (JIS X 0208 is based on a 94x94 table; MS used rows 95-114
for private characters.)

`mbfl_filt_conv_wchar_jis_ms` converted these private codepoints to rows 85-94
rather than 95-114. The code included a link to a document on the OpenGroup
web site, dating back to 1996 [1], which proposed mapping private codepoints to
these rows. However, that is not consistent with what mbstring does when
converting CP5022x to Unicode.

There seems to be a dearth of information on CP5022x on the web. However, I
did find one (Japanese-language) page on CP50221, which states that it maps
kuten codes 0x7F21-0x927E to the 'private' Unicode codepoints [2].

As a side note, using rows higher than 95 does seem to defeat one purpose of
using an ISO-2022-JP variant: ISO-2022-JP was specifically designed to be
"7-bit clean", but once you go beyond row 95, the ku codes are 0x80 and up,
so 8 bits are needed.

[1] https://web.archive.org/web/20000229180004/http://www.opengroup.or.jp/jvc/cde/ucs-conv.html
[2] https://www.wdic.org/w/WDIC/Microsoft%20Windows%20Codepage%20%3A%2050221
2021-01-15 08:26:46 +02:00
Alex Dowad 6e9c8386cb CP5022{0,1,2}: convert characters in ku 0x2D (13th row) correctly
Essentially, CP5022{0,1,2} are to CP932 as ISO-2022-JP is to Shift-JIS.
As Shift-JIS and ISO-2022-JP both encode characters from the JIS X 0208 charset,
CP932 and CP5022x both encode characters from JIS X 0208 _plus_ extra characters
added as MicroSoft vendor extensions.

Among the added characters are a number of symbols which MS put in the 13th row
of the 94x94 character table. (In JIS X 0208, that row is empty.)

mbfilter_cp50220x.c had an `if` clause which was intended to handle the
conversion of characters in that 13th row, but it was dead code, as the previous
clause was always true in those cases. The solution is to reverse the order of
those two clauses (just as they already appeared in mbfilter_cp932.c).
2021-01-15 08:26:38 +02:00
Alex Dowad cdd0724291 Stricter handling of erroneous input when converting CP5022{0,1,2} text encoding
Don't allow escape sequences to start in the middle of a multibyte character.
Also, don't silently pass through illegal bytes which appear where the 2nd
byte of a multibyte character should be.
2021-01-15 08:25:44 +02:00
Alex Dowad 4299e2de42 JIS7/JIS8 encoding: treat truncated multibyte characters as error 2021-01-14 22:34:16 +02:00
Alex Dowad b67e358e75 JIS7/JIS8 encoding: handle invalid 2nd byte for Kanji correctly
Previously, in ISO-2022-JP/JIS7/JIS8, if an escape sequence (starting with 0x1B)
appeared where the 2nd byte of a multibyte character should have been, mbstring
would forget all about the truncated multibyte character and happily accept the
escape sequence. However, such sequences are not legal and should be flagged as
errors.

Also, any other illegal bytes appearing where the 2nd byte of a multibyte
character was expected were just passed through quietly to the output. Fix that.

Also add a test suite for both ISO-2022-JP and JIS7/JIS8. (These are extremely
similar encodings; JIS7 and JIS8 are variants of ISO-2022-JP. mbstring's 'JIS'
is actually a combination of JIS7 _and_ JIS8, since the extensions which each
one adds to ISO-2022-JP are disjoint.)
2021-01-14 22:31:31 +02:00
Alex Dowad d497c0e96f JIS7/JIS8 encoding: use JISX0201 for U+203E (overline)
In other legacy Japanese encodings like Shift-JIS, we are now using a specific
JISX 0208 character for the Unicode overline (U+203E). Previously, the single
byte 0x7E was used, but an ASCII 0x7E does not represent an overline, so this
was changed.

However, JIS7/JIS8 can represent characters in the JISX 0201 character set as
well. That character set also includes an overline character, which takes less
bytes to encode than the corresponding JISX 0208 character, so we'll use it.

This is what mbstring had been doing for a long time; but it changed as a
side effect of the recent changes to how U+203E is encoded in Shift-JIS, etc.
So change it back.
2021-01-14 22:26:24 +02:00
Alex Dowad 40384da36a JIS7/JIS8 encoding: treat unrecognized escapes as error 2021-01-14 22:26:24 +02:00
Alex Dowad c11e12ffe0 Add comment explaining why ISO-2022-JP-2004, etc strings end with ESC ( B
These encodings have multiple modes which can be selected via escape sequences.
The default starting mode is ASCII. If a string _ends_ in a different mode, we
emit a 'redundant' escape sequence to switch back to ASCII.

If the resulting string is never concatenated with other strings, that extra
escape sequence serves no purpose. But if the resulting string is concatenated
with other strings of the same encoding, it ensures that the resulting string
will be valid.
2021-01-14 22:26:24 +02:00
Alex Dowad 4b95fdf2ca ISO-2022-JP-2004 conversion: handle invalid characters correctly 2021-01-14 22:26:24 +02:00
Alex Dowad e14bdc041a ISO-2022-JP-2004 conversion: treat unrecognized escapes as error 2021-01-14 22:26:24 +02:00
Alex Dowad 4d65c2a992 ISO-2022-JP-2004 conversion: represent backslash and tilde as ASCII
This issue dates back to some commits I merged recently, which made encodings
like Shift-JIS-2004 use appropriate JIS X 0208 characters to represent
backslashes and tildes, rather than single-byte characters which are used in
those encodings with a different meaning (for example, in these encodings,
0x5C is used for a halfwidth Yen sign, rather than a backslash).

There was an unintended side effect: ISO-2022-JP-2004 was also made to
represent backslashes and tildes using JIS X 0208 characters. However,
ISO-2022-JP explicitly includes ASCII as one of its selectable character sets,
and ISO-2022-JP-2004 is just an extension of ISO-2022-JP. So when converting
text to ISO-2022-JP-2004, we can convert Unicode backslashes and tildes to ASCII
rather than using the corresponding JIS X 0208 characters.
2021-01-14 22:26:24 +02:00
Alex Dowad c9fea7db72 Convert U+00AF (MACRON) to 0x8150 (FULLWIDTH MACRON) in some SJIS variants
Except for vanilla Shift-JIS, where 0x7E is a halfwidth overline/macron.
As for Shift-JIS-2004, it has an added character (byte sequence 0x854A)
which was defined as a halfwidth macron in JIS X 0213:2000, so we use that.
2020-11-25 20:51:45 +02:00