archived-php-src

mirror of https://github.com/php/php-src.git synced 2026-04-28 10:43:30 +02:00

Author	SHA1	Message	Date
Alex Dowad	0b32a15eb0	Optimize mb_str{,im}width for performance Rather than doing a linear search of a table of fullwidth codepoint ranges for every input character, 1) Short-cut the search if the codepoint is below the first such range 2) Otherwise, do a binary (rather than linear) search	2021-09-29 18:19:01 +02:00
Nikita Popov	425c2e3ba1	Combine control into one character group Same as with punct, we're currently not interested in distinguishing between Cc and Cf, so only store their union.	2021-08-24 20:39:16 +02:00
Nikita Popov	f458b16041	Combine punctuation into one character group We're not currently interested in distinguishing between individual punctuation types, so just merge everything into one general category to make the property lookup more efficient.	2021-08-24 19:21:21 +02:00
Nikita Popov	3be94217f4	Don't use sentinel value for unicode property lookup 0xffff was used to mark character properties without any members. This made the code unnecessarily complicated, because we need to check for 0xffff values when looking up the property ranges. We can simply encode this as an empty set of ranges.	2021-08-24 15:53:43 +02:00
Alex Dowad	d8c785b894	Update 'East Asian Width' table to comply with Unicode 13.0 Instead of manually maintaining the data in eaw_table.h, it is now automatically generated by ucgendat/ucgendat.php, using the EastAsianWidth.txt file from the Unicode Consortium. Something must be said about the deleted test case. Back in 2004, someone noticed that `mb_strwidth` didn't comply with Unicode 4.0. A test case was added to expose the problem. Well, time keeps moving on, and with the changing years, new Unicodes are born and old Unicodes die. Some characters which were counted as double-width in Unicode 4.0 are no longer such in Unicode 13.0, which renders the test case obsolete. At the same time, make a couple of spelling/grammar fixes in ucgendat.php.	2021-01-19 20:38:44 +02:00
Peter Kokot	36c7946522	Move ucgendata README to generator file header	2019-04-20 22:35:25 +02:00
Peter Kokot	37c329d715	Trim trailing whitespace in source code files	2018-10-13 14:17:28 +02:00
Peter Kokot	02294f0c84	Make PHP development tools files and scripts executable This patch makes several scripts and PHP development tools files executable and adds more proper shebangs to the PHP scripts. The `#!/usr/bin/env php` shebang provides running the script via `./script.php` and uses env to find PHP script location on the system. At the same time it still provides running the script with a user defined PHP location using `php script.php`.	2018-08-29 20:58:17 +02:00
Nikita Popov	f4a1d9c821	Fixed bug #65544 and #71298	2017-07-28 14:57:08 +02:00
Nikita Popov	582a65b06f	Implement full case mapping Implement full case mapping according to SpecialCasing.txt and also full case folding according to CaseFolding.txt (F). There are a number of caveats: * Only language-agnostic and unconditional full case mapping is implemented. The only language-agnostic conditional case mapping rule relates to Greek sigma in final position (Final_Sigma). Correctly handling this requires both arbitrary lookahead and lookbehind, which would require some larger changes to how the case mapping is implemented. This is a possible future extension. * The only language-specific handling that is implemented is for Turkish dotted/undotted Is, if the ISO-8859-9 encoding is used. This matches the previous behavior and makes sure that no codepoints not supported by the encoding are produced. A future extension would be to also handle the Turkish mappings specified by SpecialCasing.txt based on the mbfl internal language. * Full case folding is implemented, but case-insensitive mb_* operations continue to use simple case folding. The reason is that full case folding of the haystack string may change the position at which a match occurred. This would have to be mapped back into the position in the original string. * mb_convert_case() exposes both the full and the simple case mapping / folding, where full is the default. The constants are: * MB_CASE_LOWER (used by mb_strtolower) * MB_CASE_UPPER (used by mb_strtolower) * MB_CASE_TITLE * MB_CASE_FOLD * MB_CASE_LOWER_SIMPLE * MB_CASE_UPPER_SIMPLE * MB_CASE_TITLE_SIMPLE * MB_CASE_FOLD_SIMPLE (used by case-insensitive operations)	2017-07-28 12:32:50 +02:00
Nikita Popov	9ac7c1e71d	Use case-folding for case insensitive comparisons Instead of using lowercasing.	2017-07-28 12:32:50 +02:00
Nikita Popov	80a0601fe5	Use MPH for case maps Instead of performing a binary search, use a hashtable to store the case maps. In particular a minimal perfect hash construction is used, which does not require collision resolution (but does use an auxiliary table for the hash perturbation).	2017-07-28 12:32:50 +02:00
Nikita Popov	eacd70f762	Don't store titlecase if same as uppercase The totitle code already has a fallback for that case.	2017-07-28 12:32:50 +02:00
Nikita Popov	cedfc2f426	Drop implementation-specific character properties No point in keeping around non-standard character properties if we're not using them and most are not even being populated.	2017-07-28 12:32:50 +02:00
Nikita Popov	8ace7045e9	Handle character ranges in ucgendat generically In particular, the previous implementation did not account for Tangut Ideographs and CJK Ideograph extensions C through F.	2017-07-25 18:48:12 +02:00
Nikita Popov	0c0e35fedc	Port ucgendat to PHP Implemented such that the output is identical, including some quirks that should be fixed subsequently.	2017-07-25 18:48:12 +02:00

16 Commits