Skip to content

Conversation

@HeRaNO
Copy link

@HeRaNO HeRaNO commented Nov 28, 2025

GB18030-2022 is the current official standard, superseding the previous 2005 and 2000 versions. It is essential for modern Chinese text processing for the following reasons:

  1. Superset Relationship: GB18030 is a strict superset of CP936 (GBK) and EUC-CN (GB2312). Using GB18030 as the detection target covers all characters in these older encodings while enabling support for a much wider range of characters.
  2. Extended Character Coverage: The 2022 standard includes significant updates, covering over 87,000 characters. It adds support for CJK Extensions (C, D, E, F, G) and updates mappings for rare characters that were previously mapped to the Private Use Area (PUA) in the 2005 version. This is critical for correctly handling names containing rare characters (e.g., in banking or government data).
  3. Backward Compatibility: It is safe to promote GB18030-2022 as the preferred encoding. Files encoded in EUC-CN or CP936 are valid GB18030 streams.

This PR adds GB18030-2022 to the default encoding list for CN.

@alexdowad
Copy link
Contributor

@HeRaNO Thank you for the PR! We appreciate it.

I believe that adding this text encoding to the default identify list for CN shouldn't cause any problem.

We should have tests which confirm the desired change in behavior actually occurs. It is better for these tests to be very thorough, so we are sure of exactly what will change and what will not change. (Please note that mbstring has a large number of existing users, and even very small changes often break somebody's code, often in ways which are very difficult to predict. So we are generally very cautious about making changes to existing APIs. In short: do more testing than you think is necessary.)

The NEWS and UPGRADING files also need to be updated.

@youkidearitai
Copy link
Contributor

@HeRaNO Thank you very much!
In fact, I'm glad to receive this pull request since we hadn't received any information from China.

For now, I'd just like to say thank you.

@HeRaNO
Copy link
Author

HeRaNO commented Nov 28, 2025

Hi @alexdowad, and thanks for your comment. I added the message into NEWS and UPGRADE (I hope I'm putting it in the right place).

For the test issue, I think adding tests is feasible for me, but I'm not sure what to test. The GB18030-2022 will work as a "fallback" option when detecting encoding, and GB18030-2022 is a superset of GBK (CP936), so I think it will hardly affect users. Maybe a test like

// Set language to zh-CN
// Decode sentence in EUC-CN
// Decode sentence in CP936
// Decode sentence in GB18030-2022

is good enough? WDYT?

@youkidearitai
Copy link
Contributor

youkidearitai commented Nov 28, 2025

@HeRaNO How about a test like this?

$ cat ext/mbstring/tests/gb18030-2022-is-default-encoding-in-cn.phpt
--TEST--
GB18030-2022 is default encoding in Simplified Chinese
--INI--
mbstring.language=Simplified Chinese
--FILE--
<?php
var_dump(mb_detect_order());
?>
--EXPECT--
array(5) {
  [0]=>
  string(5) "ASCII"
  [1]=>
  string(5) "UTF-8"
  [2]=>
  string(6) "EUC-CN"
  [3]=>
  string(5) "CP936"
  [4]=>
  string(12) "GB18030-2022"
}

@alexdowad
Copy link
Contributor

For the test issue, I think adding tests is feasible for me, but I'm not sure what to test. The GB18030-2022 will work as a "fallback" option when detecting encoding, and GB18030-2022 is a superset of GBK (CP936), so I think it will hardly affect users...

Dear @HeRaNO, because we have a large number of users depending on mbstring, when we merge code changes, we would like to know exactly how the observable behavior of the library will change.

Changes to the detect order should primarily affect mb_detect_encoding. Is there any input to mb_detect_encoding for which the "detected" (guessed) text encoding will be different? If so, are there many such inputs? Or just a few? If you don't know, then please find out.

One good start would be to write a little PHP program which exhaustively tests all 4 billion 4-byte strings. This should take less than an hour to run. Make it print out a list of all byte sequences for which mb_detect_encoding returns GB18030-2022, and see how many there are among strings of 4 bytes or less.

That may give us a starting point for assessing the impact of the change, and how to test it.

@alexdowad
Copy link
Contributor

One good start would be to write a little PHP program which exhaustively tests all 4 billion 4-byte strings. This should take less than an hour to run. Make it print out a list of all byte sequences for which mb_detect_encoding returns GB18030-2022, and see how many there are among strings of 4 bytes or less.

For clarity, if you write a script like I am suggesting, make sure to run it on your own version of PHP, with the detect order change compiled in!

@HeRaNO
Copy link
Author

HeRaNO commented Nov 28, 2025

Well, something went wrong.

<?php

function generate4ByteStrings(): Generator
{
    for ($i = 0; $i < 256; $i++) {
        for ($j = 0; $j < 256; $j++) {
            for ($k = 0; $k < 256; $k++) {
                for ($l = 0; $l < 256; $l++) {
                    $string = chr($i) . chr($j) . chr($k) . chr($l);

                    yield $string;
                }
            }
        }
    }
}

set_time_limit(0);

$oriEncodings = ['ASCII', 'UTF-8', 'EUC-CN', 'CP936'];
$newEncodings = ['ASCII', 'UTF-8', 'EUC-CN', 'CP936', 'GB18030-2022'];

$stringGenerator = generate4ByteStrings();

$eq = 0;
$new_detect = 0;
$wrong_detect = 0;
$changed_detect = 0;
$changed_detect_to_gb18030 = 0;
$pass = 0;
$fail = 0;
$count = 0;

foreach ($stringGenerator as $currentString) {
    $count++;

    // strict = false, will guess an encoding
    $ori = mb_detect_encoding($currentString, $oriEncodings);
    $new = mb_detect_encoding($currentString, $newEncodings);

    // strict = true, will return `false` if encoding is not in list
    $ori_strict = mb_detect_encoding($currentString, $oriEncodings, true);
    $new_strict = mb_detect_encoding($currentString, $newEncodings, true);

    if ($ori === $new) {
        $eq++; // equal guessing, we're happy
    } else {
        if ($new === $new_strict) {
            $new_detect++; // not equal guessing, but guess it right in the new way, also happy
        } else {
            if ($ori === $ori_strict) {
                $wrong_detect++; // not equal guessing, the encoding is definite,
                                 // and guess it wrong in the new way, not happy
            } else {
                $changed_detect++; // not equal guessing, the guessing changed, sigh...
                if ($new == 'GB18030-2022') {
                    $changed_detect_to_gb18030++; // guessing changed to GB18030
                }
            }
        }
    }
    
    if ($ori_strict === $new_strict) {
        $pass++;
    } else {
        if ($new_strict === 'GB18030-2022' && $ori_strict === false) {
            $pass++;
        } else {
            echo "String: " . bin2hex($currentString) . " from: " . $ori_strict . " to: " . $new_strict . "\n";
            $fail++;
        }
    }
    
}

echo "Checked " . number_format($count) . " strings\n";
echo "Equal: " . number_format($eq) . "\n";
echo "New detected: " . number_format($new_detect) . "\n";
echo "Wrong detected: " . number_format($wrong_detect) . "\n";
echo "Changed detected: " . number_format($changed_detect) . "\n";
echo "  Changed to GB18030: " . number_format($changed_detect_to_gb18030) . "\n";
echo "Pass: " . number_format($pass) . "\n";
echo "Fail: " . number_format($fail) . "\n";
?>

Outputs:

Checked 4,294,967,296 strings
Equal: 4,269,809,975
New detected: 2,155,227
Wrong detected: 0
Changed detected: 23,002,094
  Changed to GB18030: 20,422,724
Pass: 4,293,900,065
Fail: 1,067,231

Fail counter increases at:

String: 0000a2e3 from: CP936 to: GB18030-2022
String: 0000a6d9 from: CP936 to: GB18030-2022
String: 0000a6da from: CP936 to: GB18030-2022
String: 0000a6db from: CP936 to: GB18030-2022
String: 0000a6dc from: CP936 to: GB18030-2022
String: 0000a6dd from: CP936 to: GB18030-2022
String: 0000a6de from: CP936 to: GB18030-2022
String: 0000a6df from: CP936 to: GB18030-2022
String: 0000a6ec from: CP936 to: GB18030-2022
String: 0000a6ed from: CP936 to: GB18030-2022
String: 0000a6f3 from: CP936 to: GB18030-2022
String: 0001a2e3 from: CP936 to: GB18030-2022
String: 0001a6d9 from: CP936 to: GB18030-2022
String: 0001a6da from: CP936 to: GB18030-2022
String: 0001a6db from: CP936 to: GB18030-2022
String: 0001a6dc from: CP936 to: GB18030-2022
String: 0001a6dd from: CP936 to: GB18030-2022
String: 0001a6de from: CP936 to: GB18030-2022
String: 0001a6df from: CP936 to: GB18030-2022
String: 0001a6ec from: CP936 to: GB18030-2022
String: 0001a6ed from: CP936 to: GB18030-2022
String: 0001a6f3 from: CP936 to: GB18030-2022
String: 0002a2e3 from: CP936 to: GB18030-2022
String: 0002a6d9 from: CP936 to: GB18030-2022
String: 0002a6da from: CP936 to: GB18030-2022
String: 0002a6db from: CP936 to: GB18030-2022
String: 0002a6dc from: CP936 to: GB18030-2022
String: 0002a6dd from: CP936 to: GB18030-2022
String: 0002a6de from: CP936 to: GB18030-2022
String: 0002a6df from: CP936 to: GB18030-2022
String: 0002a6ec from: CP936 to: GB18030-2022
String: 0002a6ed from: CP936 to: GB18030-2022
String: 0002a6f3 from: CP936 to: GB18030-2022
String: 0003a2e3 from: CP936 to: GB18030-2022
String: 0003a6d9 from: CP936 to: GB18030-2022
String: 0003a6da from: CP936 to: GB18030-2022
String: 0003a6db from: CP936 to: GB18030-2022
String: 0003a6dc from: CP936 to: GB18030-2022
String: 0003a6dd from: CP936 to: GB18030-2022
String: 0003a6de from: CP936 to: GB18030-2022
String: 0003a6df from: CP936 to: GB18030-2022
String: 0003a6ec from: CP936 to: GB18030-2022
String: 0003a6ed from: CP936 to: GB18030-2022
String: 0003a6f3 from: CP936 to: GB18030-2022
String: 0004a2e3 from: CP936 to: GB18030-2022
String: 0004a6d9 from: CP936 to: GB18030-2022
String: 0004a6da from: CP936 to: GB18030-2022
String: 0004a6db from: CP936 to: GB18030-2022
String: 0004a6dc from: CP936 to: GB18030-2022
String: 0004a6dd from: CP936 to: GB18030-2022
String: 0004a6de from: CP936 to: GB18030-2022
String: 0004a6df from: CP936 to: GB18030-2022
String: 0004a6ec from: CP936 to: GB18030-2022
String: 0004a6ed from: CP936 to: GB18030-2022
String: 0004a6f3 from: CP936 to: GB18030-2022
String: 0005a2e3 from: CP936 to: GB18030-2022
String: 0005a6d9 from: CP936 to: GB18030-2022
String: 0005a6da from: CP936 to: GB18030-2022
String: 0005a6db from: CP936 to: GB18030-2022
String: 0005a6dc from: CP936 to: GB18030-2022
String: 0005a6dd from: CP936 to: GB18030-2022
String: 0005a6de from: CP936 to: GB18030-2022
String: 0005a6df from: CP936 to: GB18030-2022
String: 0005a6ec from: CP936 to: GB18030-2022
String: 0005a6ed from: CP936 to: GB18030-2022
String: 0005a6f3 from: CP936 to: GB18030-2022
String: 0006a2e3 from: CP936 to: GB18030-2022
String: 0006a6d9 from: CP936 to: GB18030-2022
String: 0006a6da from: CP936 to: GB18030-2022
String: 0006a6db from: CP936 to: GB18030-2022
String: 0006a6dc from: CP936 to: GB18030-2022
String: 0006a6dd from: CP936 to: GB18030-2022
String: 0006a6de from: CP936 to: GB18030-2022
String: 0006a6df from: CP936 to: GB18030-2022
String: 0006a6ec from: CP936 to: GB18030-2022
String: 0006a6ed from: CP936 to: GB18030-2022
String: 0006a6f3 from: CP936 to: GB18030-2022
String: 0007a2e3 from: CP936 to: GB18030-2022
String: 0007a6d9 from: CP936 to: GB18030-2022
String: 0007a6da from: CP936 to: GB18030-2022
String: 0007a6db from: CP936 to: GB18030-2022
String: 0007a6dc from: CP936 to: GB18030-2022
String: 0007a6dd from: CP936 to: GB18030-2022
String: 0007a6de from: CP936 to: GB18030-2022
String: 0007a6df from: CP936 to: GB18030-2022
String: 0007a6ec from: CP936 to: GB18030-2022
String: 0007a6ed from: CP936 to: GB18030-2022
String: 0007a6f3 from: CP936 to: GB18030-2022
...

Feel free to close this PR.

@alexdowad
Copy link
Contributor

@HeRaNO, I think it's too early to say that we should close the PR. Let's make sure we understand exactly what the effect of the PR is.

We can see that there are a number of strings which were previously detected as CP-936, but are now detected as GB-18030-2022. The question is: Does that make sense? Do these strings actually appear more like GB-18030-2022? Or do they appear more like CP936?

Any comment? Personally, I may have to refresh my memory on how CP936 works, since it's been a long time since I worked on the code for that encoding.

@HeRaNO
Copy link
Author

HeRaNO commented Nov 28, 2025

Unfortunately, I'm not an expert on encoding. I tried to analyze a certain string in the log.

String: 0007a6ed from: CP936 to: GB18030-2022

The string is like: 00 07 A6 ED.

I tried to decode the string in both CP936 and GB18030-2022. For CP936, the first two bytes will remain as they are, and the last two bytes are not a valid CP936 character1. For GB18030-2022, things are the same as CP936, and the resulting decoded strings are the same. So the encoding of the string can be both CP936 and GB18030-2022.

I think I can change the logic, which allows EUC-CN and CP936 to be detected as GB18030-2022. Is it the right thing to do?

Footnotes

  1. https://www.khngai.com/chinese/charmap/tblgbk.php?page=2

@youkidearitai
Copy link
Contributor

In my opinion, I think it would be fine if GB18030-2022 were the standard in China.
And if the Chinese people are comfortable with the change.

Hmm... Sorry, I'm not sure...

@alexdowad
Copy link
Contributor

Just thinking about this. Something is strange.

In @HeRaNO's test script, he counts a test case as "fail" if: with strict encoding detection, the original detect order list resulted in "CP936", but now it results in "GB18030-2022". He picked a random "failing" case for further investigation, and found that it appears to be invalid in both CP936 and GB18030-2022.

If that is so, then strict detection should have returned false both before and after the change. From the docs:

strict
Controls the behaviour when string is not valid in any of the listed encodings. If strict is set to false, the closest matching encoding will be returned; if strict is set to true, false will be returned.

I think we need to look at this a bit more closely and make sure we understand what is going on.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants