Skip to content

Commit 1f3fe93

Browse files
authored
Add GB18030-2022 to default encoding list for zh-CN (#20604)
GB18030-2022 is the current official standard, superseding the previous 2005 and 2000 versions. It is essential for modern Chinese text processing for the following reasons: 1. Superset Relationship: GB18030 is a strict superset of CP936 (GBK) and EUC-CN (GB2312). Using GB18030 as the detection target covers all characters in these older encodings while enabling support for a much wider range of characters. 2. Extended Character Coverage: The 2022 standard includes significant updates, covering over 87,000 characters. It adds support for CJK Extensions (C, D, E, F, G) and updates mappings for rare characters that were previously mapped to the Private Use Area (PUA) in the 2005 version. This is critical for correctly handling names containing rare characters (e.g., in banking or government data). 3. Backward Compatibility: It is safe to promote GB18030-2022 as the preferred encoding. Files encoded in EUC-CN or CP936 are valid GB18030 streams. This PR adds GB18030-2022 to the default encoding list for CN.
1 parent 19deb91 commit 1f3fe93

File tree

4 files changed

+30
-1
lines changed

4 files changed

+30
-1
lines changed

NEWS

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -25,6 +25,7 @@ PHP NEWS
2525
- Mbstring:
2626
. ini_set() with mbstring.detect_order changes the order of mb_detect_order
2727
as intended, since mbstring.detect_order is an INI_ALL setting. (tobee94)
28+
. Added GB18030-2022 to default encoding list for zh-CN. (HeRaNO)
2829

2930
- Opcache:
3031
. Fixed bug GH-20051 (apache2 shutdowns when restart is requested during

UPGRADING.INTERNALS

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -64,6 +64,9 @@ PHP 8.6 INTERNALS UPGRADE NOTES
6464
. Removed the XML_GetCurrentByteCount() libxml compatibility wrapper,
6565
as it was unused and could return the wrong result.
6666

67+
- ext/mbstring:
68+
. Added GB18030-2022 to default encoding list for zh-CN.
69+
6770
========================
6871
4. OpCode changes
6972
========================

ext/mbstring/mbstring.c

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -116,7 +116,8 @@ static const enum mbfl_no_encoding php_mb_default_identify_list_cn[] = {
116116
mbfl_no_encoding_ascii,
117117
mbfl_no_encoding_utf8,
118118
mbfl_no_encoding_euc_cn,
119-
mbfl_no_encoding_cp936
119+
mbfl_no_encoding_cp936,
120+
mbfl_no_encoding_gb18030_2022
120121
};
121122

122123
static const enum mbfl_no_encoding php_mb_default_identify_list_tw_hk[] = {
Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,24 @@
1+
--TEST--
2+
Default encodings in Simplified Chinese
3+
--EXTENSIONS--
4+
mbstring
5+
--INI--
6+
mbstring.language=Simplified Chinese
7+
--FILE--
8+
<?php
9+
var_dump(mb_detect_order());
10+
11+
?>
12+
--EXPECT--
13+
array(5) {
14+
[0]=>
15+
string(5) "ASCII"
16+
[1]=>
17+
string(5) "UTF-8"
18+
[2]=>
19+
string(6) "EUC-CN"
20+
[3]=>
21+
string(5) "CP936"
22+
[4]=>
23+
string(12) "GB18030-2022"
24+
}

0 commit comments

Comments
 (0)