mirror of
https://github.com/php/pftt2.git
synced 2026-03-24 09:12:17 +01:00
[Test Issue] Mbstring tests charset issue #45
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @hollyhuiLi on GitHub (Aug 27, 2019).
Below mbstring tests failed, suspect charset issue since all contains non-english charactors in the test.
ext/mbstring/tests/bug62934.phpt
ext/mbstring/tests/bug69267.phpt
ext/mbstring/tests/bug71298.phpt
ext/mbstring/tests/bug76532.phpt
ext/mbstring/tests/bug76958.phpt
ext/mbstring/tests/casefolding.phpt
ext/mbstring/tests/casemapping.phpt
ext/mbstring/tests/mb_ereg_dupnames.phpt
ext/mbstring/tests/mb_ereg_named_subpatterns.phpt
ext/mbstring/tests/mb_ereg_search_named_subpatterns.phpt
For example, diff of ext/mbstring/tests/casemapping.phpt shows:
@@ -1,3 +1,3 @@
+String: ß
+Lower: ß
+Lower Simple: ß
-String: Ã
-Lower: Ã
-Lower Simple: Ã
@@ -5,1 +5,1 @@
Lower: Ã
Lower Simple: Ã
Upper: SS
+Upper Simple: ß
-Upper Simple: Ã
@@ -7,1 +7,1 @@
Upper: SS
Upper Simple: Ã
Title: Ss
+Title Simple: Ă
-Title Simple: Ã
@@ -9,1 +9,1 @@
Title: Ss
Title Simple: Ã
Fold: ss
+Fold Simple: Ă
-Fold Simple: Ã
@@ -11,3 +11,3 @@
Fold: ss
Fold Simple: Ã
+String: ďŹ
+Lower: ďŹ
+Lower Simple: ďŹ
-String: ï¬
-Lower: ï¬
-Lower Simple: ï¬
@@ -15,1 +15,1 @@
Lower: ï¬
Lower Simple: ï¬
Upper: FF
+Upper Simple: ďŹ
-Upper Simple: ï¬
@@ -17,1 +17,1 @@
Upper: FF
Upper Simple: ï¬
Title: Ff
+Title Simple: ďŹ
-Title Simple: ï¬
@@ -19,1 +19,1 @@
Title: Ff
Title Simple: ï¬
Fold: ff
+Fold Simple: ďŹ
-Fold Simple: ï¬
@@ -22,1 +22,1 @@
Fold Simple: ï¬
String: İ
+Lower: iĚ
-Lower: iÌ
@@ -28,1 +28,1 @@
[Trancated, please reference the original xml file ...]
@cmb69 commented on GitHub (Aug 27, 2019):
I can confirm this issue. Setting a breakpoint at this line shows that charset is detected as ISO-8859-1, which causes a MultiCharsetByLineReader to be instantiated, which is clearly wrong, since ISO-8859-1 is not a multibyte charset. Changing
charset == nulltotruemakes the test pass for me.@cmb69 commented on GitHub (Oct 23, 2019):
Well, most relevant is likely this:
332325c239/src/com/github/mattficken/io/AbstractDetectingCharsetReader.java (L29)Currently, the
EXPRESS_RECOGNIZERSdon't include UTF-8 at all.@cmb69 commented on GitHub (Nov 13, 2019):
For debugging purposes I've applied the following patch:
Testing ext/mbstring/tests/mb_ereg_search_named_subpatterns.phpt basically yields for the EXPECT section:
and for the actual test output:
This shows that PFTT currently tries to detect the charset for every line of the EXPECT section, which is highly unlikely to give correct results; from the docs:
However, many of these lines are only a few bytes long, and the detected character sets are far from accurate (at least for this test case). Using the high-level CharsetDetector ("new") yields somewhat better results, but still has some errors or fails to detect the charset sometimes at all.
At least for now, I think we're better off to just drop that attempt to detect the charset. While that would yield harder to read diffs, it should at least prevent several false positives, and ideally, we shouldn't have to read diffs of failing tests at all.