1
0
mirror of https://github.com/php/php-src.git synced 2026-04-30 03:33:17 +02:00
Commit Graph

2088 Commits

Author SHA1 Message Date
Máté Kocsis 8e6e9838b0 Add support for generating MAY_BE_ARRAY_OF_REF func info flag (#7416) 2021-08-30 13:50:34 +02:00
Nikita Popov 634f2e21d3 Don't expose wchar encoding to users (#7415)
The "wchar" encoding isn't really an encoding -- it's what we
internally use as the representation of decoded characters.

In practice, it tends to behave a lot like the 8bit encoding when
used from userland, because input code units end up being treated
as code points.

This patch removes the wchar encoding from the public encoding
list and reserves it for internal use only.
2021-08-30 11:11:33 +02:00
Nikita Popov 43cb2548f7 Flush filter during non-strict encoding detection
If we reach the end of the string without reducing to a single
encoding, then we should flush to check whether the last character
is incomplete.
2021-08-27 14:48:32 +02:00
Máté Kocsis fdc6082902 Generate optimizer func info from stubs for various extensions (#7409)
ext/hash, ext/iconv, ext/mbstring, ext/xml, ext/zlib
2021-08-26 19:52:11 +02:00
Nikita Popov 2dafb0e30f Add comments to grouped character properties
[ci skip]
2021-08-24 22:09:26 +02:00
Nikita Popov 425c2e3ba1 Combine control into one character group
Same as with punct, we're currently not interested in distinguishing
between Cc and Cf, so only store their union.
2021-08-24 20:39:16 +02:00
Nikita Popov f458b16041 Combine punctuation into one character group
We're not currently interested in distinguishing between
individual punctuation types, so just merge everything into one
general category to make the property lookup more efficient.
2021-08-24 19:21:21 +02:00
Nikita Popov d2073179e3 Return bool from php_unicode_is_prop() 2021-08-24 19:21:21 +02:00
Nikita Popov 3be94217f4 Don't use sentinel value for unicode property lookup
0xffff was used to mark character properties without any members.
This made the code unnecessarily complicated, because we need to
check for 0xffff values when looking up the property ranges. We
can simply encode this as an empty set of ranges.
2021-08-24 15:53:43 +02:00
Nikita Popov 14173186db Add EXTENSIONS section 2021-08-11 14:03:18 +02:00
Nikita Popov 28500fe4ef Fixed bug #81349
The ascii to wchar was reporting errors using conv_illegal_output,
while it should have been using WCSGROUP_THROUGH. Effectively that
replaced illegal characters with '?' for the purpose of
identification.
2021-08-11 11:37:02 +02:00
Nikita Popov 89aa42c74b Add missing EXTENSIONS section 2021-07-28 12:38:41 +02:00
Nikita Popov a1c1ee6a48 Don't use opaque for encoding detection score
opaque is used by the htmlentities filter, which means that we
end up trying to free the score value as a pointer. Don't try to
be overly tricky here and simply allocate a separate structure
to hold the number of illegal characters and the score.
2021-07-28 10:54:27 +02:00
Nikita Popov 9d0db2e98a Fixed bug #81298
Creation of the filter may fail for some special encodings, for
which detection is not supported.
2021-07-28 10:11:46 +02:00
Alex Dowad 26fc7c4256 Fix typo in mbfilter.h
As pointed out by Bruno Haible (https://haible.de/bruno).
2021-07-19 12:17:00 +02:00
Alex Dowad 13136a575d Fix conversion of GB18030 text (and add test suite)
- Truncated multi-byte characters are treated as an error
- Reject GB18030 4-byte codes which translate to (non-existent)
  Unicode codepoints above 0x10FFFF
- Add a number of missing mappings from the GB18030 standards
  (These mappings are supported by iconv. I don't know why they were
  missing from mbstring.)
2021-07-19 12:17:00 +02:00
Alex Dowad 340164bcc9 Reduce size of conversion tables for CP936 2021-07-19 12:17:00 +02:00
Alex Dowad 73c6a5b89d Fix conversion of Big5 and CP950 text (and add test suite)
- Truncated multi-byte characters are treated as an error
- Follow recommended mappings from Unicode consortium
2021-07-19 12:17:00 +02:00
Nikita Popov 639015845f Deprecate calling mb_check_encoding() without argument
Part of https://wiki.php.net/rfc/deprecations_php_8_1.
2021-07-08 15:34:49 +02:00
Alex Dowad b626e893ff Fix conversion of ISO-2022-KR text (and add test suite)
- Truncated multi-byte characters are treated as an error
- Truncated or unrecognized escape sequences are treated as an error
- ASCII control characters are not allowed to appear in the middle
  of a multi-byte character
2021-07-05 16:28:16 +02:00
Alex Dowad 658db1f6ea Code cleanup in mbfilter_uhc.c 2021-07-05 16:28:16 +02:00
Alex Dowad 0a8c00755d Fix conversion of EUC-JP-2004 text (and add test suite)
- Truncated multi-byte characters are treated as an error now
- Invalid multi-byte characters are treated as an error rather than
  being quietly swallowed
- ASCII control characters are not allowed to appear in the middle
  of a multi-byte character
2021-07-05 16:28:16 +02:00
Alex Dowad ff85ed8adc Fix conversion of EUC-TW text (and add test suite)
- Treat text which ends abruptly in the middle of a multi-byte
  character as erroneous.
- Don't allow ASCII control characters to appear in the middle of a
  multi-byte character.
- If an illegal byte appears in the middle of a multi-byte character,
  go back to the initial state rather than trying to finish the
  multi-byte character.
- There was a bug in the file with the conversion tables, which set the
  'maximum codepoint which can be converted using table A2' using the
  size of table A1, not table A2. This meant that several hundred
  Unicode codepoints which should have been able to be converted to
  EUC-TW were flagged as erroneous instead.
- When a sequence which cannot possibly be a prefix of a valid
  multi-byte character is found, immediately flag it as an error, rather
  than waiting to read more bytes first.
- Allow characters in CNS-11643 plane 1 to be encoded as 4-byte
  sequences (although they can also be encoded as 2-byte sequences).
  This is allowed by the standard for EUC-TW text.
2021-06-29 12:25:21 +02:00
Alex Dowad 8b25e38b21 Fix conversion of EUC-CN text (and add test suite)
- Flag truncated multi-byte characters as erroneous.
- Don't allow ASCII control characters to appear in the middle of a
  multi-byte character.
- There was a bug whereby some unrecognized Unicode codepoints would be
  passed through unchanged to the output when converting Unicode to
  EUC-CN.
- Stick to the original EUC-CN standard, rather than CP936 (an extended
  version invented by MS).
2021-06-29 12:25:21 +02:00
Alex Dowad 69c979aaea Fix conversion of EUC-KR text (and add test suite)
- Treat truncated multi-byte characters as an error.
- Don't allow ASCII control characters to appear in the middle of a
  multi-byte character.
- There was also a bug whereby some unrecognized Unicode codepoints
  would be passed through to the output unchanged when converting
  Unicode to EUC-KR.
2021-06-29 12:25:21 +02:00
Alex Dowad ddea06699b Remove table generation scripts which have not been used for years 2021-06-29 12:25:21 +02:00
Alex Dowad ebae1a4524 Fix conversion of CP936 text (and add test suite)
- Treat truncated multi-byte characters as an error.
- Don't allow ASCII control characters to appear in the middle of a
  multi-byte character.
- Adjust some mappings to match recommendations in conversion table
  from Unicode Consortium.
2021-06-29 12:25:21 +02:00
Alex Dowad 1e5c3c13fd Fix conversion of HZ text (and add test suite)
- Treat truncated multi-byte characters as an error.
- Don't allow ASCII control characters to appear in the middle of a
  multi-byte character.
- Handle ~ escapes according to the HZ standard (RFC 1843).
- Treat unrecognized ~ escapes as an error.
- Multi-byte characters (between ~{ ~} escapes) are GB2312, not CP936.
  (CP936 is an extended version from MicroSoft, but the RFC does not
  state that this extended version of GB should be used.)
2021-06-29 12:25:21 +02:00
Patrick Allaert aff365871a Fixed some spaces used instead of tabs 2021-06-29 11:30:26 +02:00
Alex Dowad b1ab76f742 Minor formatting tweaks in mbfilter_euc_kr.c 2021-06-17 13:12:40 +02:00
Alex Dowad 958ef47d2b When flushing CP5022x conversion filter, also flush next filter in chain
All the mbstring encoding conversion filters do this. I missed it when
adding a flush function for CP5022x.
2021-06-17 13:12:40 +02:00
Alex Dowad caeaa662ab Strict conversion of UHC text to Unicode
Previously, mbstring would accept a lot of things which were not valid
UHC text. No more.

- Don't allow single-byte control characters to appear where the 2nd
  byte of a multi-byte character should be.
- Validate that the 2nd byte of a multi-byte character is in the
  expected range.
- Treat it as an error if a multi-byte character is truncated.

Also add a test suite to confirm that UHC conversion (both to and from
Unicode) works according to spec.
2021-06-17 13:12:40 +02:00
Alex Dowad 4550036d96 Minor formatting tweaks in mbfilter_uhc.c 2021-06-17 13:12:40 +02:00
Alex Dowad 9868c17368 Mark CP932 and CP51932 encoding tests as 'slow tests' 2021-06-17 13:12:40 +02:00
Alex Dowad e2459857af Remove duplicate implementation of CP932 from mbstring
Sigh. Double sigh. After fruitlessly searching the Internet for information on
this mysterious text encoding called "SJIS-open", I wrote a script to try
converting every Unicode codepoint from 0-0xFFFF and compare the results from
different variants of Shift-JIS, to see which one "SJIS-open" would be most
similar to.

The result? It's just CP932. There is no difference at all. So why do we have
two implementations of CP932 in mbstring?

In case somebody, somewhere is using "SJIS-open" (or its aliases "SJIS-win" or
"SJIS-ms"), add these as aliases to CP932 so existing code will continue to
work.
2021-06-17 13:12:40 +02:00
Alex Dowad 7502c86342 Add test suite for UTF-{7,8,16,32}
Also fix a couple small problems with UTF-32 and UTF-8 support:

- UTF-32 would pass very large codepoints (>= 0x80000000), which are
  not valid.
- UTF-8 would sometimes emit two error marker characters for a single
  bad input byte.
2021-06-17 13:12:40 +02:00
Nikita Popov a06d015e61 Remove unnecessary mbstring skipifs
These functions are always available (if the extension is available
at all).
2021-06-14 15:27:28 +02:00
Nikita Popov 6600ad6067 Add some missing EXTENSIONS sections to misc tests 2021-06-14 14:52:44 +02:00
Nikita Popov 4083600bd5 Port mbstring to use EXTENSIONS 2021-06-11 14:00:43 +02:00
Nikita Popov 39131219e8 Migrate more SKIPIF -> EXTENSIONS (#7139)
This is a mix of more automated and manual migration. It should remove all applicable extension_loaded() checks outside of skipif.inc files.
2021-06-11 12:58:44 +02:00
Nikita Popov 7485978339 Migrate SKIPIF -> EXTENSIONS (#7138)
This is an automated migration of most SKIPIF extension_loaded checks.
2021-06-11 11:57:42 +02:00
Ayesh Karunaratne b8e380ab09 Update deprecation message for incompatible float to int conversion
Updates the deprecation message for implicit incompatible float to int conversion from:

```
Implicit conversion from non-compatible float %.*H to int in %s on line %d
```

to

```
Implicit conversion from float %.*H to int loses precision in %s on line %d
```

Related: #6661
2021-06-07 14:36:11 +02:00
George Peter Banyard b6958bb847 Implement "Deprecate implicit non-integer-compatible float to int conversions" RFC. (#6661)
RFC: https://wiki.php.net/rfc/implicit-float-int-deprecate

Co-authored-by: Nikita Popov <nikita.ppv@gmail.com>
2021-05-31 15:48:45 +01:00
George Peter Banyard e7135cb817 Use zend_string_equals_* API in a couple of more place
Closes GH-6979
2021-05-14 13:45:17 +01:00
George Peter Banyard aca6aefd85 Remove 'register' type qualifier (#6980)
The compiler should be smart enough to optimize this on its own
2021-05-14 13:38:01 +01:00
George Peter Banyard c40231afbf Mark various functions with void arguments.
This fixes a bunch of [-Wstrict-prototypes] warning,
because in C func() and func(void) have different semantics.
2021-05-12 14:55:53 +01:00
KsaR 01b3fc03c3 Update http->https in license (#6945)
1. Update: http://www.php.net/license/3_01.txt to https, as there is anyway server header "Location:" to https.
2. Update few license 3.0 to 3.01 as 3.0 states "php 5.1.1, 4.1.1, and earlier".
3. In some license comments is "at through the world-wide-web" while most is without "at", so deleted.
4. fixed indentation in some files before |
2021-05-06 12:16:35 +02:00
Christoph M. Becker 592cfa309e Merge branch 'PHP-8.0'
* PHP-8.0:
  Fix #81011: mb_convert_encoding removes references from arrays
2021-05-04 18:40:23 +02:00
Christoph M. Becker d1c0cbdcb1 Merge branch 'PHP-7.4' into PHP-8.0
* PHP-7.4:
  Fix #81011: mb_convert_encoding removes references from arrays
2021-05-04 18:39:39 +02:00
Christoph M. Becker 0cafd53d18 Fix #81011: mb_convert_encoding removes references from arrays
We need to dereference references.

Closes GH-6938.
2021-05-04 18:37:40 +02:00