-
Notifications
You must be signed in to change notification settings - Fork 8k
Add GB18030-2022 to default encoding list for CN #20604
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
@HeRaNO Thank you for the PR! We appreciate it. I believe that adding this text encoding to the default identify list for CN shouldn't cause any problem. We should have tests which confirm the desired change in behavior actually occurs. It is better for these tests to be very thorough, so we are sure of exactly what will change and what will not change. (Please note that The NEWS and UPGRADING files also need to be updated. |
|
@HeRaNO Thank you very much! For now, I'd just like to say thank you. |
|
Hi @alexdowad, and thanks for your comment. I added the message into For the test issue, I think adding tests is feasible for me, but I'm not sure what to test. The GB18030-2022 will work as a "fallback" option when detecting encoding, and GB18030-2022 is a superset of GBK (CP936), so I think it will hardly affect users. Maybe a test like is good enough? WDYT? |
|
@HeRaNO How about a test like this? |
Dear @HeRaNO, because we have a large number of users depending on Changes to the detect order should primarily affect One good start would be to write a little PHP program which exhaustively tests all 4 billion 4-byte strings. This should take less than an hour to run. Make it print out a list of all byte sequences for which That may give us a starting point for assessing the impact of the change, and how to test it. |
For clarity, if you write a script like I am suggesting, make sure to run it on your own version of PHP, with the detect order change compiled in! |
|
Well, something went wrong. <?php
function generate4ByteStrings(): Generator
{
for ($i = 0; $i < 256; $i++) {
for ($j = 0; $j < 256; $j++) {
for ($k = 0; $k < 256; $k++) {
for ($l = 0; $l < 256; $l++) {
$string = chr($i) . chr($j) . chr($k) . chr($l);
yield $string;
}
}
}
}
}
set_time_limit(0);
$oriEncodings = ['ASCII', 'UTF-8', 'EUC-CN', 'CP936'];
$newEncodings = ['ASCII', 'UTF-8', 'EUC-CN', 'CP936', 'GB18030-2022'];
$stringGenerator = generate4ByteStrings();
$eq = 0;
$new_detect = 0;
$wrong_detect = 0;
$changed_detect = 0;
$changed_detect_to_gb18030 = 0;
$pass = 0;
$fail = 0;
$count = 0;
foreach ($stringGenerator as $currentString) {
$count++;
// strict = false, will guess an encoding
$ori = mb_detect_encoding($currentString, $oriEncodings);
$new = mb_detect_encoding($currentString, $newEncodings);
// strict = true, will return `false` if encoding is not in list
$ori_strict = mb_detect_encoding($currentString, $oriEncodings, true);
$new_strict = mb_detect_encoding($currentString, $newEncodings, true);
if ($ori === $new) {
$eq++; // equal guessing, we're happy
} else {
if ($new === $new_strict) {
$new_detect++; // not equal guessing, but guess it right in the new way, also happy
} else {
if ($ori === $ori_strict) {
$wrong_detect++; // not equal guessing, the encoding is definite,
// and guess it wrong in the new way, not happy
} else {
$changed_detect++; // not equal guessing, the guessing changed, sigh...
if ($new == 'GB18030-2022') {
$changed_detect_to_gb18030++; // guessing changed to GB18030
}
}
}
}
if ($ori_strict === $new_strict) {
$pass++;
} else {
if ($new_strict === 'GB18030-2022' && $ori_strict === false) {
$pass++;
} else {
echo "String: " . bin2hex($currentString) . " from: " . $ori_strict . " to: " . $new_strict . "\n";
$fail++;
}
}
}
echo "Checked " . number_format($count) . " strings\n";
echo "Equal: " . number_format($eq) . "\n";
echo "New detected: " . number_format($new_detect) . "\n";
echo "Wrong detected: " . number_format($wrong_detect) . "\n";
echo "Changed detected: " . number_format($changed_detect) . "\n";
echo " Changed to GB18030: " . number_format($changed_detect_to_gb18030) . "\n";
echo "Pass: " . number_format($pass) . "\n";
echo "Fail: " . number_format($fail) . "\n";
?>Outputs:
Feel free to close this PR. |
|
@HeRaNO, I think it's too early to say that we should close the PR. Let's make sure we understand exactly what the effect of the PR is. We can see that there are a number of strings which were previously detected as CP-936, but are now detected as GB-18030-2022. The question is: Does that make sense? Do these strings actually appear more like GB-18030-2022? Or do they appear more like CP936? Any comment? Personally, I may have to refresh my memory on how CP936 works, since it's been a long time since I worked on the code for that encoding. |
|
Unfortunately, I'm not an expert on encoding. I tried to analyze a certain string in the log. The string is like: I tried to decode the string in both CP936 and GB18030-2022. For CP936, the first two bytes will remain as they are, and the last two bytes are not a valid CP936 character1. For GB18030-2022, things are the same as CP936, and the resulting decoded strings are the same. So the encoding of the string can be both CP936 and GB18030-2022. I think I can change the logic, which allows EUC-CN and CP936 to be detected as GB18030-2022. Is it the right thing to do? Footnotes |
|
In my opinion, I think it would be fine if GB18030-2022 were the standard in China. Hmm... Sorry, I'm not sure... |
|
Just thinking about this. Something is strange. In @HeRaNO's test script, he counts a test case as "fail" if: with strict encoding detection, the original detect order list resulted in "CP936", but now it results in "GB18030-2022". He picked a random "failing" case for further investigation, and found that it appears to be invalid in both CP936 and GB18030-2022. If that is so, then strict detection should have returned
I think we need to look at this a bit more closely and make sure we understand what is going on. |
GB18030-2022 is the current official standard, superseding the previous 2005 and 2000 versions. It is essential for modern Chinese text processing for the following reasons:
This PR adds GB18030-2022 to the default encoding list for CN.