[Test Issue] Mbstring tests charset issue #45

Open
opened 2026-01-24 11:40:29 +01:00 by admin · 3 comments
Owner

Originally created by @hollyhuiLi on GitHub (Aug 27, 2019).

Below mbstring tests failed, suspect charset issue since all contains non-english charactors in the test.

ext/mbstring/tests/bug62934.phpt
ext/mbstring/tests/bug69267.phpt
ext/mbstring/tests/bug71298.phpt
ext/mbstring/tests/bug76532.phpt
ext/mbstring/tests/bug76958.phpt
ext/mbstring/tests/casefolding.phpt
ext/mbstring/tests/casemapping.phpt
ext/mbstring/tests/mb_ereg_dupnames.phpt
ext/mbstring/tests/mb_ereg_named_subpatterns.phpt
ext/mbstring/tests/mb_ereg_search_named_subpatterns.phpt

For example, diff of ext/mbstring/tests/casemapping.phpt shows:

@@ -1,3 +1,3 @@
+String: ß
+Lower: ß
+Lower Simple: ß
-String: ß
-Lower: ß
-Lower Simple: ß
@@ -5,1 +5,1 @@
Lower: ß
Lower Simple: ß
Upper: SS
+Upper Simple: ß
-Upper Simple: ß
@@ -7,1 +7,1 @@
Upper: SS
Upper Simple: ß
Title: Ss
+Title Simple: ß
-Title Simple: ß
@@ -9,1 +9,1 @@
Title: Ss
Title Simple: ß
Fold: ss
+Fold Simple: ß
-Fold Simple: ß
@@ -11,3 +11,3 @@
Fold: ss
Fold Simple: ß

+String: ff
+Lower: ff
+Lower Simple: ff
-String: ff
-Lower: ff
-Lower Simple: ff
@@ -15,1 +15,1 @@
Lower: ff
Lower Simple: ff
Upper: FF
+Upper Simple: ff
-Upper Simple: ff
@@ -17,1 +17,1 @@
Upper: FF
Upper Simple: ff
Title: Ff
+Title Simple: ff
-Title Simple: ff
@@ -19,1 +19,1 @@
Title: Ff
Title Simple: ff
Fold: ff
+Fold Simple: ff
-Fold Simple: ff
@@ -22,1 +22,1 @@
Fold Simple: ff

String: İ
+Lower: i̇
-Lower: i̇
@@ -28,1 +28,1 @&#64
[Trancated, please reference the original xml file ...]

Originally created by @hollyhuiLi on GitHub (Aug 27, 2019). Below mbstring tests failed, suspect charset issue since all contains non-english charactors in the test. ext/mbstring/tests/bug62934.phpt ext/mbstring/tests/bug69267.phpt ext/mbstring/tests/bug71298.phpt ext/mbstring/tests/bug76532.phpt ext/mbstring/tests/bug76958.phpt ext/mbstring/tests/casefolding.phpt ext/mbstring/tests/casemapping.phpt ext/mbstring/tests/mb_ereg_dupnames.phpt ext/mbstring/tests/mb_ereg_named_subpatterns.phpt ext/mbstring/tests/mb_ereg_search_named_subpatterns.phpt For example, diff of ext/mbstring/tests/casemapping.phpt shows: @@ -1,3 +1,3 @@ +String: ß +Lower: ß +Lower Simple: ß -String: ß -Lower: ß -Lower Simple: ß @@ -5,1 +5,1 @@ Lower: ß Lower Simple: ß Upper: SS +Upper Simple: ß -Upper Simple: ß @@ -7,1 +7,1 @@ Upper: SS Upper Simple: ß Title: Ss +Title Simple: ĂŸ -Title Simple: ß @@ -9,1 +9,1 @@ Title: Ss Title Simple: ß Fold: ss +Fold Simple: ĂŸ -Fold Simple: ß @@ -11,3 +11,3 @@ Fold: ss Fold Simple: ß +String:  +Lower:  +Lower Simple:  -String: ff -Lower: ff -Lower Simple: ff @@ -15,1 +15,1 @@ Lower: ff Lower Simple: ff Upper: FF +Upper Simple:  -Upper Simple: ff @@ -17,1 +17,1 @@ Upper: FF Upper Simple: ff Title: Ff +Title Simple:  -Title Simple: ff @@ -19,1 +19,1 @@ Title: Ff Title Simple: ff Fold: ff +Fold Simple:  -Fold Simple: ff @@ -22,1 +22,1 @@ Fold Simple: ff String: İ +Lower: i̇ -Lower: i̇ @@ -28,1 +28,1 @&#64 [Trancated, please reference the original xml file ...]
admin added the bug label 2026-01-24 11:40:29 +01:00
Author
Owner

@cmb69 commented on GitHub (Aug 27, 2019):

I can confirm this issue. Setting a breakpoint at this line shows that charset is detected as ISO-8859-1, which causes a MultiCharsetByLineReader to be instantiated, which is clearly wrong, since ISO-8859-1 is not a multibyte charset. Changing charset == null to true makes the test pass for me.

@cmb69 commented on GitHub (Aug 27, 2019): I can confirm this issue. Setting a breakpoint at [this line](https://github.com/php/pftt2/blob/e4aaefd28d0e5a51645657593995768964d59f33/src/com/mostc/pftt/host/LocalHost.java#L537) shows that charset is detected as ISO-8859-1, which causes a MultiCharsetByLineReader to be instantiated, which is clearly wrong, since ISO-8859-1 is not a multibyte charset. Changing `charset == null` to `true` makes the test pass for me.
Author
Owner

@cmb69 commented on GitHub (Oct 23, 2019):

Well, most relevant is likely this: 332325c239/src/com/github/mattficken/io/AbstractDetectingCharsetReader.java (L29)
Currently, the EXPRESS_RECOGNIZERS don't include UTF-8 at all.

@cmb69 commented on GitHub (Oct 23, 2019): Well, most relevant is likely this: https://github.com/php/pftt2/blob/332325c23940184aaa90ef97c76583a66b034767/src/com/github/mattficken/io/AbstractDetectingCharsetReader.java#L29 Currently, the `EXPRESS_RECOGNIZERS` don't include UTF-8 at all.
Author
Owner

@cmb69 commented on GitHub (Nov 13, 2019):

For debugging purposes I've applied the following patch:

 .../io/AbstractDetectingCharsetReader.java           | 20 ++++++++++++++++++++
 src/com/ibm/icu/text/CharsetDetector.java            |  6 +++---
 2 files changed, 23 insertions(+), 3 deletions(-)

diff --git a/src/com/github/mattficken/io/AbstractDetectingCharsetReader.java b/src/com/github/mattficken/io/AbstractDetectingCharsetReader.java
index c20c104a85..5971bceaf4 100644
--- a/src/com/github/mattficken/io/AbstractDetectingCharsetReader.java
+++ b/src/com/github/mattficken/io/AbstractDetectingCharsetReader.java
@@ -29,6 +29,7 @@ public abstract class AbstractDetectingCharsetReader extends AbstractReader {
 	public CharsetRecognizer[] recogs = CharsetDeciderDecoder.EXPRESS_RECOGNIZERS;//.ALL_RECOGNIZERS; // TODO
 	CharsetRec hc_cm = null; // TODO usually start over for each #detectCharset call
 	protected void detectCharset(byte[] bytes, int off, int len) {
+		byte[] orig_bytes = java.util.Arrays.copyOfRange(bytes, off, len+off);;
 		bytes = IOUtil.ensureLeftShifted(bytes, off, len);
 		
 		
@@ -60,6 +61,25 @@ public abstract class AbstractDetectingCharsetReader extends AbstractReader {
 		//
 		
 		this.cs = cm.cs;
+
+		com.ibm.icu.text.CharsetDetector charsetDetector = new com.ibm.icu.text.CharsetDetector();
+		charsetDetector.setText(orig_bytes);
+		com.ibm.icu.text.CharsetMatch charsetMatch = charsetDetector.detect();
+		if (charsetMatch != null) {
+			try {
+				System.out.println("[CHARSET] txt: " + charsetMatch.getString());
+			} catch (java.io.IOException ex) {
+				System.out.println("[CHARSET] txt: UNKOWN");
+			}
+		} else {
+			System.out.println("[CHARSET] txt: CAN'T CONVERT");
+		}
+		System.out.println("[CHARSET] old: " + cm.cs.name());
+		if (charsetMatch != null) {
+			System.out.println("[CHARSET] new: " + charsetMatch.getName());
+		} else {
+			System.out.println("[CHARSET] new: UNKNOWN");
+		}
 	}
 	
 	protected String convertLine(byte[] bytes, int off, int len, boolean end_of_input) {
diff --git a/src/com/ibm/icu/text/CharsetDetector.java b/src/com/ibm/icu/text/CharsetDetector.java
index 6c157285ce..97627b68c0 100644
--- a/src/com/ibm/icu/text/CharsetDetector.java
+++ b/src/com/ibm/icu/text/CharsetDetector.java
@@ -188,16 +188,16 @@ public class CharsetDetector {
         ArrayList<CharsetMatch>         matches = new ArrayList<CharsetMatch>();
         
         MungeInput();  // Strip html markup, collect byte stats.
-        System.out.println(fCSRecognizers);
+        //System.out.println(fCSRecognizers);
         //  Iterate over all possible charsets, remember all that
         //    give a match quality > 0.
         for (i=0; i<fCSRecognizers.size(); i++) {
             csr = fCSRecognizers.get(i);
             detectResults = csr.match(this);
-            System.out.println("197 "+csr+" "+detectResults);
+            //System.out.println("197 "+csr+" "+detectResults);
             confidence = detectResults & 0x000000ff;
             if (confidence > 0) {
-            	System.out.println(csr.getName()); // TODO
+            	//System.out.println(csr.getName()); // TODO
                 CharsetMatch  m = new CharsetMatch(this, csr, confidence);
                 matches.add(m);
             }

Testing ext/mbstring/tests/mb_ereg_search_named_subpatterns.phpt basically yields for the EXPECT section:

[CHARSET] txt: array(7) {
[CHARSET] old: ISO-8859-9
[CHARSET] new: Big5
[CHARSET] txt: CAN'T CONVERT
[CHARSET] old: ISO-8859-9
[CHARSET] new: UNKNOWN
[CHARSET] txt:   string(11) "  中��"
[CHARSET] old: ISO-8859-9
[CHARSET] new: ISO-8859-1
[CHARSET] txt: CAN'T CONVERT
[CHARSET] old: ISO-8859-9
[CHARSET] new: UNKNOWN
[CHARSET] txt:   string(2) "  "
[CHARSET] old: ISO-8859-1
[CHARSET] new: ISO-8859-1
[CHARSET] txt: CAN'T CONVERT
[CHARSET] old: ISO-8859-1
[CHARSET] new: UNKNOWN
[CHARSET] txt:   string(6) "中�"
[CHARSET] old: ISO-8859-1
[CHARSET] new: ISO-8859-1
[CHARSET] txt: CAN'T CONVERT
[CHARSET] old: ISO-8859-1
[CHARSET] new: UNKNOWN
[CHARSET] txt:   string(3) "ï¼?"
[CHARSET] old: ISO-8859-1
[CHARSET] new: ISO-8859-1
[CHARSET] txt:   ["punct"]=>
[CHARSET] old: ISO-8859-1
[CHARSET] new: Big5
[CHARSET] txt:   string(3) "ï¼?"
[CHARSET] old: ISO-8859-1
[CHARSET] new: ISO-8859-1
[CHARSET] txt:   ["wsp"]=>
[CHARSET] old: ISO-8859-1
[CHARSET] new: Big5
[CHARSET] txt:   string(2) "  "
[CHARSET] old: ISO-8859-1
[CHARSET] new: ISO-8859-1
[CHARSET] txt:   ["word"]=>
[CHARSET] old: ISO-8859-1
[CHARSET] new: Big5
[CHARSET] txt:   string(6) "中�"
[CHARSET] old: ISO-8859-1
[CHARSET] new: ISO-8859-1
[CHARSET] txt: CAN'T CONVERT
[CHARSET] old: ISO-8859-1
[CHARSET] new: UNKNOWN

and for the actual test output:

[CHARSET] txt: array(7) {
[CHARSET] old: Shift_JIS
[CHARSET] new: Big5
[CHARSET] txt: CAN'T CONVERT
[CHARSET] old: Shift_JIS
[CHARSET] new: UNKNOWN
[CHARSET] txt:   string(11) "  中��"
[CHARSET] old: windows-1252
[CHARSET] new: ISO-8859-1
[CHARSET] txt: CAN'T CONVERT
[CHARSET] old: windows-1252
[CHARSET] new: UNKNOWN
[CHARSET] txt:   string(2) "  "
[CHARSET] old: ISO-8859-1
[CHARSET] new: ISO-8859-1
[CHARSET] txt: CAN'T CONVERT
[CHARSET] old: ISO-8859-1
[CHARSET] new: UNKNOWN
[CHARSET] txt:   string(6) "中�"
[CHARSET] old: ISO-8859-1
[CHARSET] new: ISO-8859-1
[CHARSET] txt: CAN'T CONVERT
[CHARSET] old: ISO-8859-1
[CHARSET] new: UNKNOWN
[CHARSET] txt:   string(3) "ï¼?"
[CHARSET] old: ISO-8859-1
[CHARSET] new: ISO-8859-1
[CHARSET] txt:   ["punct"]=>
[CHARSET] old: ISO-8859-1
[CHARSET] new: Big5
[CHARSET] txt:   string(3) "ï¼?"
[CHARSET] old: ISO-8859-1
[CHARSET] new: ISO-8859-1
[CHARSET] txt:   ["wsp"]=>
[CHARSET] old: ISO-8859-1
[CHARSET] new: Big5
[CHARSET] txt:   string(2) "  "
[CHARSET] old: ISO-8859-1
[CHARSET] new: ISO-8859-1
[CHARSET] txt:   ["word"]=>
[CHARSET] old: ISO-8859-1
[CHARSET] new: Big5
[CHARSET] txt:   string(6) "中�"
[CHARSET] old: ISO-8859-1
[CHARSET] new: ISO-8859-1
[CHARSET] txt: CAN'T CONVERT
[CHARSET] old: ISO-8859-1
[CHARSET] new: UNKNOWN

This shows that PFTT currently tries to detect the charset for every line of the EXPECT section, which is highly unlikely to give correct results; from the docs:

For best accuracy in charset detection, the input data should be primarily in a single language, and a minimum of a few hundred bytes worth of plain text in the language are needed.

However, many of these lines are only a few bytes long, and the detected character sets are far from accurate (at least for this test case). Using the high-level CharsetDetector ("new") yields somewhat better results, but still has some errors or fails to detect the charset sometimes at all.

At least for now, I think we're better off to just drop that attempt to detect the charset. While that would yield harder to read diffs, it should at least prevent several false positives, and ideally, we shouldn't have to read diffs of failing tests at all.

@cmb69 commented on GitHub (Nov 13, 2019): For debugging purposes I've applied the following patch: ````diff .../io/AbstractDetectingCharsetReader.java | 20 ++++++++++++++++++++ src/com/ibm/icu/text/CharsetDetector.java | 6 +++--- 2 files changed, 23 insertions(+), 3 deletions(-) diff --git a/src/com/github/mattficken/io/AbstractDetectingCharsetReader.java b/src/com/github/mattficken/io/AbstractDetectingCharsetReader.java index c20c104a85..5971bceaf4 100644 --- a/src/com/github/mattficken/io/AbstractDetectingCharsetReader.java +++ b/src/com/github/mattficken/io/AbstractDetectingCharsetReader.java @@ -29,6 +29,7 @@ public abstract class AbstractDetectingCharsetReader extends AbstractReader { public CharsetRecognizer[] recogs = CharsetDeciderDecoder.EXPRESS_RECOGNIZERS;//.ALL_RECOGNIZERS; // TODO CharsetRec hc_cm = null; // TODO usually start over for each #detectCharset call protected void detectCharset(byte[] bytes, int off, int len) { + byte[] orig_bytes = java.util.Arrays.copyOfRange(bytes, off, len+off);; bytes = IOUtil.ensureLeftShifted(bytes, off, len); @@ -60,6 +61,25 @@ public abstract class AbstractDetectingCharsetReader extends AbstractReader { // this.cs = cm.cs; + + com.ibm.icu.text.CharsetDetector charsetDetector = new com.ibm.icu.text.CharsetDetector(); + charsetDetector.setText(orig_bytes); + com.ibm.icu.text.CharsetMatch charsetMatch = charsetDetector.detect(); + if (charsetMatch != null) { + try { + System.out.println("[CHARSET] txt: " + charsetMatch.getString()); + } catch (java.io.IOException ex) { + System.out.println("[CHARSET] txt: UNKOWN"); + } + } else { + System.out.println("[CHARSET] txt: CAN'T CONVERT"); + } + System.out.println("[CHARSET] old: " + cm.cs.name()); + if (charsetMatch != null) { + System.out.println("[CHARSET] new: " + charsetMatch.getName()); + } else { + System.out.println("[CHARSET] new: UNKNOWN"); + } } protected String convertLine(byte[] bytes, int off, int len, boolean end_of_input) { diff --git a/src/com/ibm/icu/text/CharsetDetector.java b/src/com/ibm/icu/text/CharsetDetector.java index 6c157285ce..97627b68c0 100644 --- a/src/com/ibm/icu/text/CharsetDetector.java +++ b/src/com/ibm/icu/text/CharsetDetector.java @@ -188,16 +188,16 @@ public class CharsetDetector { ArrayList<CharsetMatch> matches = new ArrayList<CharsetMatch>(); MungeInput(); // Strip html markup, collect byte stats. - System.out.println(fCSRecognizers); + //System.out.println(fCSRecognizers); // Iterate over all possible charsets, remember all that // give a match quality > 0. for (i=0; i<fCSRecognizers.size(); i++) { csr = fCSRecognizers.get(i); detectResults = csr.match(this); - System.out.println("197 "+csr+" "+detectResults); + //System.out.println("197 "+csr+" "+detectResults); confidence = detectResults & 0x000000ff; if (confidence > 0) { - System.out.println(csr.getName()); // TODO + //System.out.println(csr.getName()); // TODO CharsetMatch m = new CharsetMatch(this, csr, confidence); matches.add(m); } ```` Testing ext/mbstring/tests/mb_ereg_search_named_subpatterns.phpt basically yields for the EXPECT section: ```` [CHARSET] txt: array(7) { [CHARSET] old: ISO-8859-9 [CHARSET] new: Big5 [CHARSET] txt: CAN'T CONVERT [CHARSET] old: ISO-8859-9 [CHARSET] new: UNKNOWN [CHARSET] txt: string(11) " 中��" [CHARSET] old: ISO-8859-9 [CHARSET] new: ISO-8859-1 [CHARSET] txt: CAN'T CONVERT [CHARSET] old: ISO-8859-9 [CHARSET] new: UNKNOWN [CHARSET] txt: string(2) " " [CHARSET] old: ISO-8859-1 [CHARSET] new: ISO-8859-1 [CHARSET] txt: CAN'T CONVERT [CHARSET] old: ISO-8859-1 [CHARSET] new: UNKNOWN [CHARSET] txt: string(6) "中�" [CHARSET] old: ISO-8859-1 [CHARSET] new: ISO-8859-1 [CHARSET] txt: CAN'T CONVERT [CHARSET] old: ISO-8859-1 [CHARSET] new: UNKNOWN [CHARSET] txt: string(3) "�" [CHARSET] old: ISO-8859-1 [CHARSET] new: ISO-8859-1 [CHARSET] txt: ["punct"]=> [CHARSET] old: ISO-8859-1 [CHARSET] new: Big5 [CHARSET] txt: string(3) "�" [CHARSET] old: ISO-8859-1 [CHARSET] new: ISO-8859-1 [CHARSET] txt: ["wsp"]=> [CHARSET] old: ISO-8859-1 [CHARSET] new: Big5 [CHARSET] txt: string(2) " " [CHARSET] old: ISO-8859-1 [CHARSET] new: ISO-8859-1 [CHARSET] txt: ["word"]=> [CHARSET] old: ISO-8859-1 [CHARSET] new: Big5 [CHARSET] txt: string(6) "中�" [CHARSET] old: ISO-8859-1 [CHARSET] new: ISO-8859-1 [CHARSET] txt: CAN'T CONVERT [CHARSET] old: ISO-8859-1 [CHARSET] new: UNKNOWN ```` and for the actual test output: ```` [CHARSET] txt: array(7) { [CHARSET] old: Shift_JIS [CHARSET] new: Big5 [CHARSET] txt: CAN'T CONVERT [CHARSET] old: Shift_JIS [CHARSET] new: UNKNOWN [CHARSET] txt: string(11) " 中��" [CHARSET] old: windows-1252 [CHARSET] new: ISO-8859-1 [CHARSET] txt: CAN'T CONVERT [CHARSET] old: windows-1252 [CHARSET] new: UNKNOWN [CHARSET] txt: string(2) " " [CHARSET] old: ISO-8859-1 [CHARSET] new: ISO-8859-1 [CHARSET] txt: CAN'T CONVERT [CHARSET] old: ISO-8859-1 [CHARSET] new: UNKNOWN [CHARSET] txt: string(6) "中�" [CHARSET] old: ISO-8859-1 [CHARSET] new: ISO-8859-1 [CHARSET] txt: CAN'T CONVERT [CHARSET] old: ISO-8859-1 [CHARSET] new: UNKNOWN [CHARSET] txt: string(3) "�" [CHARSET] old: ISO-8859-1 [CHARSET] new: ISO-8859-1 [CHARSET] txt: ["punct"]=> [CHARSET] old: ISO-8859-1 [CHARSET] new: Big5 [CHARSET] txt: string(3) "�" [CHARSET] old: ISO-8859-1 [CHARSET] new: ISO-8859-1 [CHARSET] txt: ["wsp"]=> [CHARSET] old: ISO-8859-1 [CHARSET] new: Big5 [CHARSET] txt: string(2) " " [CHARSET] old: ISO-8859-1 [CHARSET] new: ISO-8859-1 [CHARSET] txt: ["word"]=> [CHARSET] old: ISO-8859-1 [CHARSET] new: Big5 [CHARSET] txt: string(6) "中�" [CHARSET] old: ISO-8859-1 [CHARSET] new: ISO-8859-1 [CHARSET] txt: CAN'T CONVERT [CHARSET] old: ISO-8859-1 [CHARSET] new: UNKNOWN ```` This shows that PFTT currently tries to detect the charset for every line of the EXPECT section, which is highly unlikely to give correct results; from the [docs](https://unicode-org.github.io/icu-docs/apidoc/released/icu4j/com/ibm/icu/text/CharsetDetector.html): > For best accuracy in charset detection, the input data should be primarily in a single language, and a minimum of a few hundred bytes worth of plain text in the language are needed. However, many of these lines are only a few bytes long, and the detected character sets are far from accurate (at least for this test case). Using the high-level CharsetDetector ("new") yields somewhat better results, but still has some errors or fails to detect the charset sometimes at all. At least for now, I think we're better off to just drop that attempt to detect the charset. While that would yield harder to read diffs, it should at least prevent several false positives, and ideally, we shouldn't have to read diffs of failing tests at all.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: php/pftt2#45