Python JSON 库对 UTF8 字符的处理方式分析

发布时间 2023-10-13 16:09:09作者: BuckyI

默认情况

在使用 json 模块的 json.dump 时,默认会将非 ASCII 字符(中文字符等)进行 Unicode 转义,保证最终文件只有 ASCII 字符。

例如下述代码:

import json

with open("text.json", "w") as f:
    data = {'1':111,'2':"你好", '3':"Hello", '4':"?"}
    json.dump(data, f)

获得 text.json 文件为:

{"1": 111, "2": "\u4f60\u597d", "3": "Hello", "4": "\ud83c\udf83"}

As permitted, though not required, by the RFC, this module’s serializer sets ensure_ascii=True by default, thus escaping the output so that the resulting strings only contain ASCII characters.
source: https://docs.python.org/3.11/library/json.html#character-encodings

RFC 7159 对 JSON 数据格式进行了规范,其中提到默认文本编码类型为 UTF-8, 而 Python 选择默认均转化为 ascii 字符。可能的原因后面进行分析。

JSON text SHALL be encoded in UTF-8, UTF-16, or UTF-32. The default encoding is UTF-8, and JSON texts that are encoded in UTF-8 are interoperable in the sense that they will be read successfully by the maximum number of implementations; there are many implementations that cannot successfully read texts in other encodings (such as UTF-16 and UTF-32).

原样输出的方式

如果希望原样输出,那么 Python 在使用 open 打开文件时,需要指定编码格式,并且禁用 ensure_ascii

import json

with open("text.json", "w") as f:
    data = {'1':111,'2':"你好", '3':"Hello", '4':"?"}
    json.dump(data, f, ensure_ascii=True)
{"1": 111, "2": "你好", "3": "Hello", "4": "?"}

补充信息:为什么要额外指定编码类型?

Python open 默认编码类型依平台而定,并不全是 UTF8。例如,在 windows 上返回 "ANSI code page",特别地,在我的电脑环境中为 gbk 编码,遇到 emoji 字符时就报错了。

UnicodeEncodeError: 'gbk' codec can't encode character '\U0001f383' in position 1: illegal multibyte sequence

我想,这也是为什么 Python JSON 库要默认转义字符,因为不能保证处理的文件编码类型一致,就默认统一为 ascii 字符了。

手动处理被转移字符的方式

import codecs
import json

s = r"\u4f60\u597d" # 你好
print(codecs.decode(s, 'unicode-escape'))
print(eval('"' + s + '"'))
s = r"\ud83c\udf83" # ?
print(json.loads('"' + s + '"'))