1
0
mirror of https://github.com/php/php-src.git synced 2026-04-25 08:58:28 +02:00
Commit Graph

122045 Commits

Author SHA1 Message Date
Alex Dowad 4f3bd2e235 Convert U+203E (OVERLINE) to 0x8150 (FULLWIDTH MACRON) in some SJIS variants
Converting U+203E to 0x7E was especially wrong for CP932, where 0x7E
represents a tilde.

For vanilla Shift-JIS and Shift-JIS-2004, converting to 0x7E is acceptable,
since 0x7E does represent an overline/macron in those encodings.

Follow the same principle in CP51932, which is closely related to CP932.
2020-11-25 20:51:45 +02:00
Alex Dowad 0d0029d729 0x7E is not a tilde in Shift-JIS{,-2004} 2020-11-25 20:51:45 +02:00
Alex Dowad e4ee979111 0x5C is not a Yen sign in CP932 (or CP51932)
When Microsoft created CP932 (their version of Shift-JIS), they explicitly
used bytes 0-0x7F to represent ASCII characters rather than JIS X 0201
characters.

So when converting Unicode to CP932, it is not correct to convert U+00A5
to CP932 0x5C. Fortunately, CP932 does have a multi-byte FULLWIDTH YEN SIGN
character which we can use instead.

CP51932 uses the same extended character set as CP932; while CP932 is
MicroSoft's extended version of Shift-JIS, CP51932 is their extended version
of EUC-JP. So the same reasoning applies to CP51932.
2020-11-25 20:51:45 +02:00
Alex Dowad 315d48b434 0x5C is not a backslash in Shift-JIS-2004
Shift-JIS-2004 is an extension of Shift-JIS, which uses 0x5C for the Yen
sign. Therefore, it is not correct to convert ASCII 0x5C (backslash) to
Shift-JIS-2004 0x5C (yen sign). JIS X 0208 does have a backslash, so we
can convert ASCII backslash to SJIS-2004 backslash instead.

From time immemorial, there has been confusion around the treatment
of 0x5C bytes on systems using legacy Japanese encodings. JIS X 0201
specified that 0x5C means a yen sign, and thus fonts on Japanese systems,
including early versions of Windows, displayed a 0x5C byte as a yen sign.
This meant that when ASCII text files were displayed on such systems,
what were meant to be backslashes would appear as yen signs. Japanese C
programmers could write character escapes using yen signs, and C compilers
built on the assumption that the input was ASCII would interpret these
escapes as desired. Likewise for shell scripts. Et cetera, et cetera...

Therefore, if the input to `mb_convert_encoding` is (for example) a C
program, and after converting to Shift-JIS-2004, the user wishes to feed
the output into a C compiler, *then* perhaps ASCII 0x5C should be mapped
to SJIS 0x5C. However, this scenario is ridiculous and will never happen.

A more realistic scenario might be: an article written in SJIS-2004 has
embedded Windows file paths (like 'C:\Program Files'), with yen signs used
as a path separator. If we convert SJIS-2004 0x5C to ASCII 0x5C, then the
path separators will be 'fixed' by the conversion.

For general written texts, it is much better to convert backslashes to...
backslashes. And yen signs, to yen signs.
2020-11-25 20:51:44 +02:00
Alex Dowad 5c805655db Enhance handling of CP51932 encoding
- Don't pass 'control' characters through in the middle of a multi-byte char
- Treat truncated multi-byte characters as an error
2020-11-25 20:51:44 +02:00
Alex Dowad beef597124 Fix mbstring support for SJIS-Mobile (DoCoMo, KDDI, and Softbank variants of Shift-JIS)
Lots of problems here.

- Don't pass 'control' characters through silently in the middle of a
  multi-byte character.
- Treat it as an error if a multi-byte character is truncated.
- For ESC sequences used to encode emoji on earlier Softbank phones, if an
  invalid ESC sequence is found, don't pass it through. Rather, handle it as
  an error and respect `mb_substitute_character`.
- In ranges used by mobile vendors for emoji, if a certain byte sequence
  doesn't map to any emoji, don't emit a mangled value (actually a raw
  (ku*94)+ten value, which may not even be a valid Unicode codepoint at all).
- When converting Unicode to SJIS-Mobile, don't mangle codepoints which fall
  in the 2nd range of MicroSoft vendor extensions.

Some vendor-specific emoji have been mapped to standard Unicode codepoints
now, rather than 'private use area' codepoints. When the legacy code was
written, these codepoints may not have existed yet in the Unicode standard
which was current at that time.

Also do a major code cleanup -- remove dead code, rearrange what is left,
use some new macros and helper functions to make the code clearer...
2020-11-25 20:51:44 +02:00
Alex Dowad bbbadae0ae Combine MBFL_ENCTYPE_MWC2{BE,LE} constants
These constants indicate that a text encoding uses 2+ bytes for each character,
and is either big endian or little endian (respectively). But nothing in
mbstring cares about the difference between MBFL_ENCTYPE_MWC2BE and
MBFL_ENCTYPE_MWC2LE.

(Actually, nothing cares about whether these flags are set at all...
maybe we should just remove them?)
2020-11-25 19:52:19 +02:00
Alex Dowad 72660c416a Combine MBFL_ENCTYPE_WCS{2,4}{BE,LE} constants
These flags identify text encodings in mbstring which use a constant number of
bytes per character. While some parts of the code do use these flags, usually
to detect cases which can be optimized due to constant-width encoding, nothing
cares whether the encodings are 'LE' (little-endian) or 'BE' (big-endian).

So we can simplify things by combining constants.
2020-11-25 19:52:19 +02:00
Alex Dowad 5ffcf563bd Don't pass invalid JIS X 0212, JIS X 0213, and Windows-CP932 characters through
Similarly to JIS X 0208, mbstring would pass kuten codes which are not mapped
in the JIS X 0212, JIS X 0213, or CP932 character sets through silently when
converting to another Japanese encoding.
2020-11-25 19:52:19 +02:00
Alex Dowad 8ae0473324 Don't pass invalid JIS X 0208 characters through
Many Japanese encodings, such as JIS7/8, Shift JIS, ISO-2022-JP, EUC-JP, and
so on encode characters from the JIS X 0208 character set. JIS X 0208 is based
on the concept of a 94x94 table, with numbered rows and columns. However,
more than a thousand of the cells in that table are empty; JIS X 0208 does not
actually use all 94x94=8,836 possible kuten codes.

mbstring had a dubious feature whereby, if a Japanese string contained one of
these 'unmapped' kuten codes, and it was being converted to another Japanese
encoding which was also based on JIS X 0208, the non-existent character would
be silently passed through, and the unmapped kuten code would be re-encoded
using the normal encoding method of the target text encoding.

Again, this _only_ happened if converting the text with the funky kuten code
to a Japanese encoding. If one tried converting it to Unicode, mbstring would
treat that as an error.

If somebody, somewhere, made their own private extension to JIS X 0208, and
used the regular Japanese encodings like Shift JIS and EUC-JP to encode this
private character set, then this feature might conceivably be useful. But how
likely is that? If someone is using Shift JIS, EUC-JP, ISO-2022-JP, etc. to
encode a funky version of JIS X 0208 with extra characters added, then that
should be treated as a separate text encoding.

The code which flags such characters with MBFL_WCSPLANE_JIS0208 is retained
solely for error reporting in `mbfl_filt_conv_illegal_output`.
2020-11-25 19:52:19 +02:00
Alex Dowad 2759874a42 Enhance handling of CP932 text encoding
- Don't allow control characters to appear in the middle of a multi-byte
  character. (This was a strange feature of mbstring; it doesn't make much
  sense, and iconv doesn't allow it.)
- Treat truncated multi-byte characters as an error.
2020-11-25 19:52:19 +02:00
Alex Dowad b489c1bc4d Bugfixes for findInvalidChars (helper for mbstring test suite) 2020-11-25 19:52:19 +02:00
Nikita Popov 635e0539a2 Merge branch 'PHP-8.0'
* PHP-8.0:
  Add UPGRADING note for PDO::inTransaction()

[ci skip]
2020-11-25 17:28:38 +01:00
Nikita Popov 306555e11d Add UPGRADING note for PDO::inTransaction()
[ci skip]
2020-11-25 17:28:23 +01:00
Nikita Popov c3d113653d Merge branch 'PHP-8.0'
* PHP-8.0:
  Fixed bug #80411
2020-11-25 17:25:08 +01:00
Nikita Popov 217f247bb5 Merge branch 'PHP-7.4' into PHP-8.0
* PHP-7.4:
  Fixed bug #80411
2020-11-25 17:24:49 +01:00
Nikita Popov 2fb12be84c Fixed bug #80411
References to null-serializations are stored as null, and as such
are part of the reference count.

Reminds me that we really need to deprecate the mess that is
Serializable.
2020-11-25 17:23:42 +01:00
Nikita Popov de8bae4d5c Merge branch 'PHP-8.0'
* PHP-8.0:
  Fix unserialization ref source management, again
2020-11-25 17:05:25 +01:00
Nikita Popov f5b93626a6 Fix unserialization ref source management, again
Handle one case the previous patch did not account for: If
unserialization of data fails, we should still register a ref
source.

Also add an extra test for a reference between two typed properties,
as this used to be handled incorrectly earlier.
2020-11-25 17:04:07 +01:00
Nikita Popov bd1c2ede99 Merge branch 'PHP-8.0'
* PHP-8.0:
  Fixed error reporting in mysqli_stmt::__construct
2020-11-25 16:29:22 +01:00
Nikita Popov 518eb0ca2b Merge branch 'PHP-7.4' into PHP-8.0
* PHP-7.4:
  Fixed error reporting in mysqli_stmt::__construct
2020-11-25 16:29:00 +01:00
Dharman 233f507fe6 Fixed error reporting in mysqli_stmt::__construct
For the sake of simplicity, I've synchronized the implementation
with PHP 8, which means null values are also accepted.

Closes GH-6454.
2020-11-25 16:27:41 +01:00
Nikita Popov 18bc3c1a3d Merge branch 'PHP-8.0'
* PHP-8.0:
  Fix phpt reindentation in tidy script
  Reindent more mysqli tests
2020-11-25 16:08:36 +01:00
Nikita Popov e3e6e1e59b Merge branch 'PHP-7.4' into PHP-8.0
* PHP-7.4:
  Reindent more mysqli tests
2020-11-25 16:08:28 +01:00
Nikita Popov 367f8452c4 Fix phpt reindentation in tidy script
This was missing adjacent SKIPIF/FILE/CLEAN sections.
2020-11-25 16:07:56 +01:00
Nikita Popov e3e67b721f Reindent more mysqli tests
Due to a bug in the tidy script, most tests did not actually get
reindented...
2020-11-25 16:07:16 +01:00
Nikita Popov 5b224fb86c Merge branch 'PHP-8.0'
* PHP-8.0:
  Reindent ext/mysqli tests
  Allow running tidy.php on specific directory
2020-11-25 15:58:39 +01:00
Nikita Popov 308050e94b Merge branch 'PHP-7.4' into PHP-8.0
* PHP-7.4:
  Reindent ext/mysqli tests
2020-11-25 15:58:21 +01:00
Nikita Popov 97d192b444 Reindent ext/mysqli tests
Reindent ext/mysqli tests on PHP-7.4, so they match with the
indentation on PHP-8.0. Otherwise merging test changes across
branches is very unpleasant.
2020-11-25 15:57:11 +01:00
Nikita Popov 373fd61a51 Allow running tidy.php on specific directory 2020-11-25 15:54:26 +01:00
Nikita Popov bb42738040 Merge branch 'PHP-8.0'
* PHP-8.0:
  Fix ref source management during unserialization
2020-11-25 12:28:15 +01:00
Nikita Popov 7a3f25e370 Fix ref source management during unserialization
Only register the slot for adding ref sources later if we didn't
immediately register one. Also avoids leaking a ref source if
it is added early and the assignment fails.

Fixes oss-fuzz #27628.
2020-11-25 12:25:07 +01:00
Nikita Popov 2b7d5eddfc Merge branch 'PHP-8.0'
* PHP-8.0:
  sockets: Fix variable/macro name collision on AIX
2020-11-25 11:55:05 +01:00
Calvin Buckley e074e029ee sockets: Fix variable/macro name collision on AIX
The name "rem_size" is used by a macro in a system header on AIX,
specifically `sys/xmem.h`. Without changing the name, you get the
name mangled like so:

```
In file included from /usr/include/sys/uio.h:92:0,
                 from /QOpenSys/pkgs/lib/gcc/powerpc-ibm-aix6.1.0.0/6.3.0/include-fixed-7.1/sys/socket.h:83,
                 from /usr/include/sys/syslog.h:151,
                 from /usr/include/syslog.h:29,
                 from /home/calvin/rpmbuild/BUILD/php-8.0.0RC5/main/php_syslog.h:27,
                 from /home/calvin/rpmbuild/BUILD/php-8.0.0RC5/main/php.h:318,
                 from /home/calvin/rpmbuild/BUILD/php-8.0.0RC5/ext/sockets/sendrecvmsg.c:17:
/home/calvin/rpmbuild/BUILD/php-8.0.0RC5/ext/sockets/sendrecvmsg.c: In function 'zif_socket_cmsg_space':
/home/calvin/rpmbuild/BUILD/php-8.0.0RC5/ext/sockets/sendrecvmsg.c:298:10: error: expected '=', ',', ';', 'asm' or '__attribute__' before '.' token
   size_t rem_size = ZEND_LONG_MAX - entry->size;
          ^
/home/calvin/rpmbuild/BUILD/php-8.0.0RC5/ext/sockets/sendrecvmsg.c:298:10: error: expected expression before '.' token
/home/calvin/rpmbuild/BUILD/php-8.0.0RC5/ext/sockets/sendrecvmsg.c:299:18: error: 'u2' undeclared (first use in this function)
   size_t n_max = rem_size / entry->var_el_size;
                  ^
/home/calvin/rpmbuild/BUILD/php-8.0.0RC5/ext/sockets/sendrecvmsg.c:299:18: note: each undeclared identifier is reported only once for each function it appears in
```

...because of the declaration in `sys/xmem.h`:

```
```

This just renames the variable so that it won't trip on this
definition. Tested to fix the build on IBM i PASE.

Closes GH-6453.
2020-11-25 11:54:40 +01:00
Nikita Popov 7db29d2186 Merge branch 'PHP-8.0'
* PHP-8.0:
  Fixed bug #80377
2020-11-25 11:48:51 +01:00
Nikita Popov 4633e70ab1 Fixed bug #80377
Make sure the $PHP_THREAD_SAFETY variable is always available
when configuring extensions. It was previously available for
phpized extensions, but for in-tree builds it was being set
too late.

Then, use $PHP_THREAD_SAFETY instead of $enable_zts to check for
ZTS in bundled extensions, which makes sure these checks also
work for phpize builds.
2020-11-25 11:47:05 +01:00
Christopher Jones 30770e576a Merge branch 'PHP-8.0' into master
* PHP-8.0:
  Fix test diff
2020-11-25 16:42:52 +11:00
Christopher Jones 37f96d990c Fix test diff 2020-11-25 16:42:12 +11:00
Dmitry Stogov e2227ddb05 Merge branch 'PHP-8.0'
* PHP-8.0:
  Use diferent temporary register (%r0 may keep a method address)
2020-11-25 03:50:55 +03:00
Dmitry Stogov cb399d0410 Use diferent temporary register (%r0 may keep a method address) 2020-11-25 03:49:42 +03:00
Dmitry Stogov c4f4406349 Merge branch 'PHP-8.0'
* PHP-8.0:
  Revert "Fixed bug #80377"
2020-11-25 01:13:21 +03:00
Dmitry Stogov 7fc2a3e15e Revert "Fixed bug #80377"
This reverts commit fc26ad9b12.
2020-11-25 01:10:26 +03:00
Christoph M. Becker b324ab6f15 Merge branch 'PHP-8.0'
* PHP-8.0:
  [ci skip] Fix misspelled method names
2020-11-24 18:19:07 +01:00
Florian Engelhardt cdd5ec7a3c [ci skip] Fix misspelled method names
Closes GH-6452.
2020-11-24 18:18:40 +01:00
Nikita Popov 6ae7bceecf Merge branch 'PHP-8.0'
* PHP-8.0:
  Fix usage of casted string in ReflectionParameter ctor
2020-11-24 16:43:02 +01:00
Nikita Popov 70f59b3416 Merge branch 'PHP-7.4' into PHP-8.0
* PHP-7.4:
  Fix usage of casted string in ReflectionParameter ctor
2020-11-24 16:42:52 +01:00
Nikita Popov 706241f82d Fix usage of casted string in ReflectionParameter ctor
Fixes oss-fuzz #27755.
2020-11-24 16:42:16 +01:00
Nikita Popov 36cfa11198 Merge branch 'PHP-8.0'
* PHP-8.0:
  Fixed bug #80377
2020-11-24 15:53:24 +01:00
Nikita Popov fc26ad9b12 Fixed bug #80377
Use $PHP_THREAD_SAFETY instead of $enable_zts to check for ZTS.
This variable is also available for phpize builds, while enable_zts
is only present for in-tree builds.
2020-11-24 15:52:41 +01:00
Nikita Popov 14eabd9523 Merge branch 'PHP-8.0'
* PHP-8.0:
  Fixed bug #80393
2020-11-24 15:27:50 +01:00