Python2的编码

在Python2中，str和unicode都是basestring的子类。

str是unicode经过编码后的字节组成的序列。

utf-8编码的str是可变长度，节省空间。unicode占固定长度字节，浪费空间。所以网络传输或者保存文件通常会使用unf-8.

             basestring
                /  \
               /    \ 
              /      \
             /        \
            /          \
           /            \
          /   decode     \    
        str  -------->  unicode
             <--------
              encode('utf-8')

basestring

/ \

/ decode \

str --------> unicode

<--------

encode('utf-8')

下面这段代码，展示了str和unicode的关系。

In [20]: s = 'hello'

In [21]: isinstance(s, str)
Out[21]: True

In [24]: new_s = s.decode()

In [25]: isinstance(s, unicode)
Out[25]: False

In [26]: isinstance(new_s, unicode)
Out[26]: True

In [20]: s = 'hello'

In [21]: isinstance(s, str)

Out[21]: True

In [24]: new_s = s.decode()

In [25]: isinstance(s, unicode)

Out[25]: False

In [26]: isinstance(new_s, unicode)

Out[26]: True

要注意的是，要处理字符串，需要尽量保证要么都是str，要么都是unicode。否则可能会遇到编码错误，比如下面这样。

In [14]: s
Out[14]: u'hello{}'

In [15]: s.format('你好')
---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-15-f4ff66fca54d> in <module>()
----> 1 s.format('你好')

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 0: ordinal not in range(128)

In [16]: s.format(u'你好')
Out[16]: u'hello\u4f60\u597d'

In [14]: s

Out[14]: u'hello{}'

In [15]: s.format('你好')

---------------------------------------------------------------------------

UnicodeDecodeError Traceback (most recent call last)

<ipython-input-15-f4ff66fca54d> in <module>()

----> 1 s.format('你好')

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 0: ordinal not in range(128)

In [16]: s.format(u'你好')

Out[16]: u'hello\u4f60\u597d'

上面的代码就是试图用unicode格式化str，是一般pythoner都会遇到的错误。

可以看出，如果包含中文，又不只包含中文，就会打印出中文的编码。一种可选的处理方式是，用正则表达式替换出unicode编码的内容。

In [17]: def zhprint(obj):
    ...:     import re
    ...:     print re.sub(r"\\u([a-f0-9]{4})", lambda mg: unichr(int(mg.group(1)
    ...: , 16)), obj.__repr__())
    ...: 

In [18]: zhprint(s.format(u'你好'))
u'hello你好'

# 涛吴@https://www.zhihu.com/question/20413029/answer/15064222

In [17]: def zhprint(obj):

...: import re

...: print re.sub(r"\\u([a-f0-9]{4})", lambda mg: unichr(int(mg.group(1)

...: , 16)), obj.__repr__())

...:

In [18]: zhprint(s.format(u'你好'))

u'hello你好'

# 涛吴@https://www.zhihu.com/question/20413029/answer/15064222

Python3的编码

Python3中，字符串是以unicode编码的，就没有str和unicode了，只有str，用unicode编码。

用unicode表示的str，通过encode()方法可以编码为bytes。网络传输或文件保存使用bytes，节省空间。byte可以通过decode方法转换成unicode。

建议使用uft-8进行转换。注意，str和bytes是python的对象，utf-8是编码方式。

建议在开头添加这样的编码声明：

#!/usr/bin/env python3
# -*- coding: utf-8 -*-

1 2	#!/usr/bin/env python3 # -- coding: utf-8 --

实际上Python只检查#、coding和编码字符串，其他的字符都是为了美观加上的。这段代码是告诉Python解释器，对于这个文件采用什么样的编码方式。如果没有这一行，就会默认为ASCII，遇到汉字会无法解析。

在Pycon2012上有个名为Unicode之痛的演讲，讲的比较仔细。演讲中提出五个事实以及处理Unicode的三个建议：

我们有五个不可忽视的事实:

程序中所有的输入和输出均为 byte
世界上的文本需要比 256 更多的符号来表现
你的程序必须处理 byte 和 unicode
byte 流中不会包含编码信息
指明的编码有可能是错误的

这是你在编程中保持 Unicode 清洁的三个建议:

Unicode 三明治：尽可能的让你程序处理的文本都为 Unicode 。
了解你的字符串。你应该知道你的程序中，哪些是 unicode, 哪些是 byte, 对于这些 byte 串，你应该知道，他们的编码是什么。
测试 Unicode 支持。使用一些奇怪的符号来测试你是否已经做到了以上几点。

Pycon 2014，2015中都提到过Unicode，有兴趣可以看一下这两个视频演讲。

参考资料

理解Python的Iterable和Iterator

Python的字符串编码

Python2的编码

Python3的编码

参考资料

相关文章:

Leave a comment 取消回复

2026 年 6 月
一	二	三	四	五	六	日
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30