1
0
mirror of https://github.com/php/php-src.git synced 2026-04-26 09:28:21 +02:00
Commit Graph

1354 Commits

Author SHA1 Message Date
Nikita Popov c98714f19e Merge branch 'PHP-7.2' 2017-08-03 21:57:35 +02:00
Nikita Popov fb9bf5b64b Revert/fix substitution character fallback
The introduced checks were not correct in two respects:
 * It was checked whether the source encoding of the string matches
   the internal encoding, while the actually relevant encoding is
   the *target* encoding.
 * Even if the correct encoding is used, the checks are still too
   conservative. Just because something is not a "Unicode-encoding"
   does not mean that it does not map any non-ASCII characters.

I've reverted the added checks and instead adjusted mbfl_convert
to first try to use the provided substitution character and if
that fails, perform the fallback to '?' at that point. This means
that any codepoint mapped in the target encoding should now be
correctly supported and anything else should fall back to '?'.
2017-08-03 21:53:59 +02:00
Nikita Popov 3d948d77d1 Merge branch 'PHP-7.2' 2017-08-03 21:17:26 +02:00
Nikita Popov a8a9e93e9a Revert/fix mb_substitute_character() codepoint checks
The introduced checks did not treat "non-Unicode" encodings correctly,
because they treated the passed integer as encoded in the internal
encoding in that case, while in actuality the substitute character
is always a Unicode codepoint.

Additionally checking the codepoint against the internal encoding
is not correct in any case, because the substitution character must
be mapped in the *target* encoding of the conversion, which does
not necessarily coincide with the internal encoding (the internal
encoding is the default *source* encoding, not *target* encoding).

This reverts the checks back to simple range checks, but in a way
that still resolves #69079: Characters outside the Basic
Multilingual Plane are now accepted and Surrogate Codepoints are
rejected. A distinction between UTF-8 and non-UTF-8 encodings is
not made for surrogate checks (as in the original patch), as
surrogates are always illegal on their own. Specifying a surrogate
as substitution character would only make sense if you could
specify a substitution string with more than one character --
however we do not support that.
2017-08-03 21:12:41 +02:00
Nikita Popov 94fe629992 Merge branch 'PHP-7.2' 2017-08-02 18:11:17 +02:00
Nikita Popov 91240073ea Merge branch 'PHP-7.1' into PHP-7.2 2017-08-02 18:11:12 +02:00
Nikita Popov 63607375f5 Merge branch 'PHP-7.0' into PHP-7.1 2017-08-02 18:09:09 +02:00
Fabien Villepinte 2cc1cbf2f4 Fix Bug #75001: Wrong reflection on mb_eregi_replace 2017-08-02 18:08:42 +02:00
Anatol Belski f9c3ee9ae8 fix c89 compat 2017-07-28 22:18:51 +02:00
Nikita Popov f4a1d9c821 Fixed bug #65544 and #71298 2017-07-28 14:57:08 +02:00
Nikita Popov 25b6e68432 Merge branch 'PHP-7.2' 2017-07-28 13:03:35 +02:00
Nikita Popov 5d777e56e2 Merge branch 'PHP-7.1' into PHP-7.2 2017-07-28 13:03:26 +02:00
Nikita Popov c48c638aeb Merge branch 'PHP-7.0' into PHP-7.1 2017-07-28 13:03:02 +02:00
Nikita Popov e3d25e78eb Fixed bug #62934 2017-07-28 13:02:25 +02:00
Nikita Popov 582a65b06f Implement full case mapping
Implement full case mapping according to SpecialCasing.txt and
also full case folding according to CaseFolding.txt (F). There
are a number of caveats:

* Only language-agnostic and unconditional full case mapping
  is implemented. The only language-agnostic conditional case
  mapping rule relates to Greek sigma in final position
  (Final_Sigma). Correctly handling this requires both arbitrary
  lookahead and lookbehind, which would require some larger
  changes to how the case mapping is implemented. This is a
  possible future extension.
* The only language-specific handling that is implemented is
  for Turkish dotted/undotted Is, if the ISO-8859-9 encoding
  is used. This matches the previous behavior and makes sure
  that no codepoints not supported by the encoding are
  produced. A future extension would be to also handle the
  Turkish mappings specified by SpecialCasing.txt based on
  the mbfl internal language.
* Full case folding is implemented, but case-insensitive mb_*
  operations continue to use simple case folding. The reason is
  that full case folding of the haystack string may change the
  position at which a match occurred. This would have to be
  mapped back into the position in the original string.
* mb_convert_case() exposes both the full and the simple case
  mapping / folding, where full is the default. The constants
  are:

   * MB_CASE_LOWER (used by mb_strtolower)
   * MB_CASE_UPPER (used by mb_strtolower)
   * MB_CASE_TITLE
   * MB_CASE_FOLD
   * MB_CASE_LOWER_SIMPLE
   * MB_CASE_UPPER_SIMPLE
   * MB_CASE_TITLE_SIMPLE
   * MB_CASE_FOLD_SIMPLE (used by case-insensitive operations)
2017-07-28 12:32:50 +02:00
Nikita Popov 9ac7c1e71d Use case-folding for case insensitive comparisons
Instead of using lowercasing.
2017-07-28 12:32:50 +02:00
Nikita Popov 80a0601fe5 Use MPH for case maps
Instead of performing a binary search, use a hashtable to store
the case maps. In particular a minimal perfect hash construction
is used, which does not require collision resolution (but does
use an auxiliary table for the hash perturbation).
2017-07-28 12:32:50 +02:00
Nikita Popov f56b0afe6e Avoid some unnecessary mbfl_strlen() calculations 2017-07-28 12:32:50 +02:00
Nikita Popov eacd70f762 Don't store titlecase if same as uppercase
The totitle code already has a fallback for that case.
2017-07-28 12:32:50 +02:00
Nikita Popov cedfc2f426 Drop implementation-specific character properties
No point in keeping around non-standard character properties if
we're not using them and most are not even being populated.
2017-07-28 12:32:50 +02:00
Anatol Belski 98fe82cc05 fix data types 2017-07-25 21:26:25 +02:00
Anatol Belski 13a2629005 size_t fixes 2017-07-25 19:03:33 +02:00
Nikita Popov 8ace7045e9 Handle character ranges in ucgendat generically
In particular, the previous implementation did not account for
Tangut Ideographs and CJK Ideograph extensions C through F.
2017-07-25 18:48:12 +02:00
Nikita Popov 0c0e35fedc Port ucgendat to PHP
Implemented such that the output is identical, including some
quirks that should be fixed subsequently.
2017-07-25 18:48:12 +02:00
Nikita Popov 4bd61ec7ad Fix handling of some special ranges in ucgendat
* Han Ideagraphs go up to U+9FEA.
* CJK Compatibility Ideographs are no longer specified as a special
  range in remotely recent versions of Unicode.
* Surrogate properties should be assigned to U+D800-U+DFFF, not to
  U+10000-U+1FFFF.
2017-07-25 18:48:12 +02:00
Nikita Popov 445e13b149 Add MBFL_SUBSTR_TO_END mode to mbfl_substr
This takes the substr from the offset to the end of the string.
This avoids pointless searching for the end position and also
saves us a length calculation in the strstr family of functions.
2017-07-23 23:17:12 +02:00
Nikita Popov bff11c382e Remove more obsolete length checks 2017-07-23 19:09:36 +02:00
Nikita Popov 3c6b2512cb Change layout of case mapping table
Previously the case mapping table was segregated by the type of
the character (upper, lower, title) and always stored the other
two variants (key, other1, other2). Now the table is segregated
by the target type (key, other). As only very few characters have
more than one target this only slightly increases the size of the
table.

The advantage of this layout is that we only need to perform a
single table lookup in the case table. Previously, depending on
the case that was hit, either one lookup in the property table,
or two lookups in the property table and one lookup in the case
table were required.

This changes the layout from libunicode in the OpenLDAP project
-- however, the last commit there was over 10 years ago, so I
don't see value in keeping this in sync.
2017-07-23 18:33:15 +02:00
Anatol Belski 78944bdfc6 remove cast 2017-07-23 17:38:28 +02:00
Anatol Belski 6809be2090 fix warnings and datatype
ident
2017-07-23 17:36:10 +02:00
Anatol Belski 7496bad2ac adjust datatype, used for position handling 2017-07-23 16:37:31 +02:00
Anatol Belski ea83b69883 Adjust datatypes and reorder which saves 8 bytes on 64-bit 2017-07-23 16:37:30 +02:00
Nikita Popov fe8384fdfd Merge branch 'PHP-7.2' 2017-07-23 16:06:25 +02:00
Nikita Popov 706f0cf8a0 Update Unicode data for Unicode 10 2017-07-23 16:05:39 +02:00
Nikita Popov 24cfbfd56f Update ucgendat for more bidi properties
Handle them the same way as others -- by classifying as Other
Neutral.
2017-07-23 16:03:11 +02:00
Nikita Popov 7077c719db Merge branch 'PHP-7.2' 2017-07-23 15:36:25 +02:00
Nikita Popov 077e61fad3 Fixed bug #69267 completely
ucgendat.c was assuming that a title-case character is a character
that has both lower and upper-case variants. However, there are
title-case characters that only have a lower-case variant. Use the
Lt general character proprety to determine where in the case map
the character should be placed instead.
2017-07-23 15:30:17 +02:00
Nikita Popov c0bcd301d3 Another fix for bug #69267
mb_strtoupper() was converting lowercase characters into
titlecase characters, instead of uppercase characters. Luckily
there are only very few characters with a distinct titlecase
representation, so this mostly worked out okay...
2017-07-23 15:07:02 +02:00
Nikita Popov 0e4af9192f Partial fix for bug #69267
This pulls in 60a25c72ba389f53b0621ca250bc99f3b295d43f from the
OpenLDAP project.
2017-07-23 14:47:21 +02:00
Nikita Popov 698132d6f9 Merge branch 'PHP-7.2' 2017-07-23 12:22:09 +02:00
Nikita Popov 88f752a947 Merge branch 'PHP-7.1' into PHP-7.2 2017-07-23 12:21:51 +02:00
Nikita Popov f116a88592 Merge branch 'PHP-7.0' into PHP-7.1 2017-07-23 12:21:16 +02:00
Christoph M. Becker 418da85f15 Fix #71606: Segmentation fault mb_strcut with HTML-ENTITIES
The HTML decoding filter uses the `opaque` member of mbfl_convert_filter
as buffer, but there was no copy constructor defined, what caused double
frees when the filter is copied (what happens multiple times in mb_strcut(),
for instance).
2017-07-23 12:19:27 +02:00
Nikita Popov b8ed74ce77 Merge branch 'PHP-7.2' 2017-07-23 11:55:46 +02:00
Nikita Popov 42ff1aa86c Fix overflow checks in mbfl_memory_device
Also prune out some duplicate code and use strlen() and memcpy()
instead of ad-hoc reimplementations. Remove multiplications by
sizeof(unsigned char), which wrongly imply that this can be
anything but 1.
2017-07-23 11:55:43 +02:00
Nikita Popov bd63c0f5b3 Fix bug #73528 2017-07-23 11:55:43 +02:00
Nikita Popov 80463579ce Remove confusing null checks in mb_send_mail
These are required parameters, they cannot be missing.
2017-07-23 11:55:43 +02:00
Nikita Popov 9af5b7f33d Fix use after free in mb_send_mail 2017-07-23 11:55:26 +02:00
Anatol Belski 4fbd7ccba2 touch yet more places for datatypes 2017-07-23 00:47:24 +02:00
Anatol Belski 0eea41b6c4 add missing header 2017-07-23 00:23:02 +02:00