1
0
mirror of https://github.com/php/php-src.git synced 2026-04-05 07:02:33 +02:00
Commit Graph

21 Commits

Author SHA1 Message Date
Alex Dowad
58d0aad75c mb_detect_encoding recognizes all letters in Hungarian alphabet 2022-05-25 08:22:07 +02:00
Alex Dowad
6a4b6d2344 mb_detect_encoding recognizes all letters in Czech alphabet 2022-05-25 07:52:39 +02:00
Alex Dowad
9bb97ee8bc Fix mb_detect_encoding's recognition of Slavic names
Thanks to Côme Chilliet for reporting that mb_detect_encoding was not
detecting the desired text encoding for strings containing š or Ž.
These characters are used in Czech, Serbian, Croatian, Bosnian,
Macedonian, etc. names.
2022-05-24 15:32:20 +02:00
Alex Dowad
1a2c608053 Add unit tests for mb_detect_encoding on Polish text 2021-11-26 17:42:53 +02:00
Alex Dowad
a2bc57e0e5 mb_detect_encoding will not return non-encodings
Among the text encodings supported by mbstring are several which are
not really 'text encodings'. These include Base64, QPrint, UUencode,
HTML entities, '7 bit', and '8 bit'.

Rather than providing an explicit list of text encodings which they are
interested in, users may pass the output of mb_list_encodings to
mb_detect_encoding. Since Base64, QPrint, and so on are included in
the output of mb_list_encodings, mb_detect_encoding can return one of
these as its 'detected encoding' (and in fact, this often happens).
Before mb_detect_encoding was enhanced so it could detect any of the
supported text encodings, this did not happen, and it is never desired.
2021-10-19 18:05:52 +02:00
Alex Dowad
28b346bc06 Improve detection accuracy of mb_detect_encoding
Originally, `mb_detect_encoding` essentially just checked all candidate
encodings to see which ones the input string was valid in. However, it
was only able to do this for a limited few of all the text encodings
which are officially supported by mbstring.

In 3e7acf901d, I modified it so it could 'detect' any text encoding
supported by mbstring. While this is arguably an improvement, if the
only text encodings one is interested in are those which
`mb_detect_encoding` could originally handle, the old
`mb_detect_encoding` may have been preferable. Because the new one has
more possible encodings which it can guess, it also has more chances to
get the answer wrong.

This commit adjusts the detection heuristics to provide accurate
detection in a wider variety of scenarios. While the previous detection
code would frequently confuse UTF-32BE with UTF-32LE or UTF-16BE with
UTF-16LE, the adjusted code is extremely accurate in those cases.
Detection for Chinese text in Chinese encodings like GB18030 or BIG5
and for Japanese text in Japanese encodings like EUC-JP or SJIS is
greatly improved. Detection of UTF-7 is also greatly improved. An 8KB
table, with one bit for each codepoint from U+0000 up to U+FFFF, is
used to achieve this.

One significant constraint is that the heuristics are completely based
on looking at each codepoint in a string in isolation, treating some
codepoints as 'likely' and others as 'unlikely'. It might still be
possible to achieve great gains in detection accuracy by looking at
sequences of codepoints rather than individual codepoints. However,
this might require huge tables. Further, we might need a huge corpus
of text in various languages to derive those tables.

Accuracy is still dismal when trying to distinguish single-byte
encodings like ISO-8859-1, ISO-8859-2, KOI8-R, and so on. This is
because the valid bytes in these encodings are basically all the same,
and all valid bytes decode to 'likely' codepoints, so our method of
detection (which is based on rating codepoints as likely or unlikely)
cannot tell any difference between the candidates at all. It just
selects the first encoding in the provided list of candidates.

Speaking of which, if one wants to get good results from
`mb_detect_encoding`, it is important to order the list of candidate
encodings according to your prior belief of which are more likely to
be correct. When the function cannot tell any difference between two
candidates, it returns whichever appeared earlier in the array.
2021-10-19 18:05:51 +02:00
Alex Dowad
c25a1ef8d0 Bug #81390: mb_detect_encoding should not prematurely stop processing input
As a performance optimization, mb_detect_encoding tries to stop
processing the input string early when there is only one 'candidate'
encoding which the input string is valid in. However, the code which
keeps count of how many candidate encodings have already been rejected
was buggy. This caused mb_detect_encoding to prematurely stop
processing the input when it should have continued.

As a result, it did not notice that in the test case provided by Alec,
the input string was not valid in UTF-16.
2021-09-20 11:21:39 +02:00
Nikita Popov
39131219e8 Migrate more SKIPIF -> EXTENSIONS (#7139)
This is a mix of more automated and manual migration. It should remove all applicable extension_loaded() checks outside of skipif.inc files.
2021-06-11 12:58:44 +02:00
Alex Dowad
3e7acf901d Remove mbstring identify filters
mbstring had an 'identify filter' for almost every supported text encoding
which was used when auto-detecting the most likely encoding for a string.
It would run over the string and set a 'flag' if it saw anything which
did not appear likely to be the encoding in question.

One problem with this scheme was that encodings which merely appeared
less likely to be the correct one were completely rejected, even if there
was no better candidate. Another problem was that the 'identify filters'
had a huge amount of code duplication with the 'conversion filters'.

Eliminate the identify filters. Instead, when auto-detecting text
encoding, use conversion filters to see whether the input string is valid
in candidate encodings or not. At the same type, watch the type of
codepoints which the string decodes to and mark it as less likely if
non-printable characters (ESC, form feed, bell, etc.) or 'private use
area' codepoints are seen.

Interestingly, one old test case in which JIS text was misidentified
as UTF-8 (and this wrong behavior was enshrined in the test) was 'fixed'
and the JIS string is now auto-detected as JIS.
2020-11-09 13:45:17 +02:00
Nikita Popov
cafceea742 Update mbstring parameter names
Closes GH-6207.
2020-09-28 09:51:58 +02:00
Alex Dowad
73dcfb6faa Fix typos in mbstring tests
Man, I can be pedantic sometimes. Tiny little things like misspelled words just
hurt me inside. So while it's not really a big deal, I couldn't leave these typos
alone...
2020-09-02 20:48:22 +02:00
George Peter Banyard
fa3b8c75fb Promote unknown encoding throws in encoding array/string list
For the string list we emit still emit a warning by comparing arg_num to 0

Closes GH-5337
2020-04-03 10:58:46 +02:00
Nikita Popov
cd5a29b820 Properly report unknown encoding in encoding lists
And clean up the related array and list parsing code.
2020-03-30 14:46:59 +02:00
Nikita Popov
852485d8ec Adjust tests for zpp TypeError change 2019-03-11 11:32:20 +01:00
Nikita Popov
4bd18db8cc Remove custom error handler in mbstring tests
To make it more obvious what is tested and what the error messages
are.
2019-03-05 11:41:53 +01:00
Peter Kokot
d679f02295 Sync leading and final newlines in *.phpt sections
This patch adds missing newlines, trims multiple redundant final
newlines into a single one, and trims redundant leading newlines in all
*.phpt sections.

According to POSIX, a line is a sequence of zero or more non-' <newline>'
characters plus a terminating '<newline>' character. [1] Files should
normally have at least one final newline character.

C89 [2] and later standards [3] mention a final newline:
"A source file that is not empty shall end in a new-line character,
which shall not be immediately preceded by a backslash character."

Although it is not mandatory for all files to have a final newline
fixed, a more consistent and homogeneous approach brings less of commit
differences issues and a better development experience in certain text
editors and IDEs.

[1] http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap03.html#tag_03_206
[2] https://port70.net/~nsz/c/c89/c89-draft.html#2.1.1.2
[3] https://port70.net/~nsz/c/c99/n1256.html#5.1.1.2
2018-10-15 04:33:09 +02:00
Peter Kokot
d7a3edd45d Trim trailing whitespace in *.phpt 2018-10-14 19:46:15 +02:00
Moriyoshi Koizumi
85d7751097 - Fix include_path. 2008-07-17 16:30:32 +00:00
Moriyoshi Koizumi
91bf8e5dc9 Explicitly specify mbstring.language. 2003-09-26 14:42:37 +00:00
Moriyoshi Koizumi
a705a8b597 Clean up. 2002-10-30 08:06:52 +00:00
Moriyoshi Koizumi
bce3d0cf7d Renamed the test cases. 2002-10-21 19:19:05 +00:00