Python3 str混入bytes的解码问题

问题："abc\\xe2\\x86\\x92"是一个str，其中混入了一些被错误转义的字符，需要解码得到"abc→"，其中，b'\xe2\x86\x92'.decode()会得到→。

TL;DR

In [27]: "abc\\xe2\\x86\\x92".encode().decode('unicode-escape
    ...: ').encode('latin1').decode('utf-8')
Out[27]: 'abc→'

In [27]: "abc\\xe2\\x86\\x92".encode().decode('unicode-escape

...: ').encode('latin1').decode('utf-8')

Out[27]: 'abc→'

解释

"abc\\xe2\\x86\\x92"是一个str，其中\\是一个字符，前面一个反斜杠将后面一个转义了。所以这一共是15个字符。而我想要的事后面12个字符其实是表示一个utf-8编码的unicode字符。

Python提供了unicode-escape编码/解码器。可以编码/解码字面意思的unicode。编码的时候将\xe2四个字节，转换成一个字节。解码的时候，将3个字节表示的一个unicode字符，先展开unicode的表示形式，例如\u4f60，然后包括\，一个字符占一个字节，一共6个字节。

Encoding suitable as the contents of a Unicode literal in ASCII-encoded Python source code, except that quotes are not escaped. Decodes from Latin-1 source code. Beware that Python source code actually uses UTF-8 by default.

unicode-escape实际上做了两件事：

escape。encode的时候，将一个unicode字符转换成多个字符，字面形式表示。decode的时候，将多个字符转成一个unicode字节。
正常encode/decode做的事情，提供str和bytes之间的转换，但是只使用latin-1编码。

In [29]: "你好".encode("unicode-escape")
Out[29]: b'\\u4f60\\u597d'

In [30]: '\\u4f60\\u597d'.encode().decode("unicode-escape")
Out[30]: '你好'

In [29]: "你好".encode("unicode-escape")

Out[29]: b'\\u4f60\\u597d'

In [30]: '\\u4f60\\u597d'.encode().decode("unicode-escape")

Out[30]: '你好'

"abc\\xe2\\x86\\x92"中后半部分是bytes，所以应该先转换成bytes的正确格式。

In [39]: b = "abc\\xe2\\x86\\x92".encode()

In [40]: len(b)
Out[40]: 15

In [41]: b
Out[41]: b'abc\\xe2\\x86\\x92'

In [39]: b = "abc\\xe2\\x86\\x92".encode()

In [40]: len(b)

Out[40]: 15

In [41]: b

Out[41]: b'abc\\xe2\\x86\\x92'

现在是15个字符，后面12个应该是一个字符，按照unicode解释。

In [42]: b.decode('unicode-escape')
Out[42]: 'abcâ\x86\x92'

1 2	In [42]: b.decode('unicode-escape') Out[42]: 'abcâ\x86\x92'

根据上面提到的，现在decode变成了latin-1编码的str，但实际上这是utf-8编码的，所以应该用latin-1解码然后重新用utf-8编码。

In [43]: _.encode('latin-1').decode('utf-8')
Out[43]: 'abc→'

1 2	In [43]: _.encode('latin-1').decode('utf-8') Out[43]: 'abc→'

参考：

一	二	三	四	五	六	日
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

Python3 str混入bytes的解码问题

TL;DR

解释

相关文章:

Leave a comment 取消回复