Python 如何解码这段字符串为正常显示？

推荐学习书目

› Learn Python the Hard Way

Python Sites

› PyPI - Python Package Index

› http://diveintopython.org/toc/index.html

› Pocoo

值得关注的项目

› PyPy

› Celery

› Jinja2

› Read the Docs

› gevent

› pyenv

› virtualenv

› Stackless Python

› Beautiful Soup

› 结巴中文分词

› Green Unicorn

› Sentry

› Shovel

› Pyflakes

› pytest

Python 编程

› pep8 Checker

Styles

› PEP 8

› Google Python Style Guide

› Code Style from The Hitchhiker's Guide

This topic created in 3661 days ago, the information mentioned may be changed or developed.

u'\xe8\xb4\xa2\xe5\x8a\xa1/\xe9\x87\x91\xe8\x9e\x8d/\xe4\xbf\x9d\xe9\x99\xa9'

如上，请高手赐教？

xe8

xe9

赐教

x91

12 replies • 2016-07-20 13:40:25 +08:00

felixzhu

Jul 19, 2016

呐
print ''.join([chr(ord(i)) for i in u'\xe8\xb4\xa2\xe5\x8a\xa1/\xe9\x87\x91\xe8\x9e\x8d/\xe4\xbf\x9d\xe9\x99\xa9'])

jerryrong

Jul 19, 2016

财务 /金融 /保险?

lowzoom

Jul 19, 2016

>>> print('\xe8\xb4\xa2\xe5\x8a\xa1/\xe9\x87\x91\xe8\x9e\x8d/\xe4\xbf\x9d\xe9\x99\xa9'.decode('utf-8'))
财务 /金融 /保险

Yanickkk

Jul 19, 2016

http://stackoverflow.com/questions/5649407/hexadecimal-string-to-byte-array-in-python

lightning1141

Jul 19, 2016

a = u'\xe8\xb4\xa2\xe5\x8a\xa1/\xe9\x87\x91\xe8\x9e\x8d/\xe4\xbf\x9d\xe9\x99\xa9'
print a.encode('latin1').decode('utf8')

http://stackoverflow.com/questions/9845842/bytes-in-a-unicode-python-string

我想你应该解决这段编码怎么来的问题 :D

changshu

Jul 19, 2016

楼主的意思是为什么输入的是 u'unicode 数据'会变成 u'unicode 数据的 ascii 码', windows 下用 code.interact 打开的 shell 会有这问题，貌似和默认编码有关

zungmou

Jul 19, 2016

@yannxia

感谢！

x = u'\xe8\xb4\xa2\xe5\x8a\xa1/\xe9\x87\x91\xe8\x9e\x8d/\xe4\xbf\x9d\xe9\x99\xa9'
y = '\xe8\xb4\xa2\xe5\x8a\xa1/\xe9\x87\x91\xe8\x9e\x8d/\xe4\xbf\x9d\xe9\x99\xa9'

为什么这两种方式，用 print 打印 x 无法正常显示呢？如果我想正常显示 x ，应该怎么操作呢？

zungmou

Jul 19, 2016

@lightning1141 十分感谢！您是怎么知道这段字符串是用 latin1 编码的呢?

lightning1141

Jul 19, 2016

@zungmou
latin1 只是为了转换，利用了他单字节的特性，具体你可以查查资料。

Magic347

Jul 19, 2016

http://stackoverflow.com/questions/14539807/convert-unicode-with-utf-8-string-as-content-to-str
解这个问题的 tricky 之处在于利用这个特性：
Unicode codepoints U+0000 to U+00FF all map one-on-one with the latin-1 encoding

先将 unicode 字符串编码为 latin1 字符串，编码后保留了等价的字节流数据。
而此时在这个问题中，这一字节流数据又恰恰对应了 utf8 编码，因此对其进行 utf8 解码即可还原最初的 unicode 字符。
不过值得注意的是，需要确定的是形如\xe8\xb4\xa2 究竟是 utf8 编码还是类似 gbk 的其他类型编码，
这一点对于最终正确还原 unicode 字符也是同样重要的。

>>> x = u'\xe8\xb4\xa2\xe5\x8a\xa1/\xe9\x87\x91\xe8\x9e\x8d/\xe4\xbf\x9d\xe9\x99\xa9'

>>> x.encode("latin1")
'\xe8\xb4\xa2\xe5\x8a\xa1/\xe9\x87\x91\xe8\x9e\x8d/\xe4\xbf\x9d\xe9\x99\xa9'

>>> x.encode("latin1").decode("utf8")
u'\u8d22\u52a1/\u91d1\u878d/\u4fdd\u9669'

>>> print x.encode("latin1").decode("utf8")
财务 /金融 /保险

zungmou

Jul 19, 2016

@Magic347 感谢！感觉 python 在编码处理上稍微复杂，掌握不好容易出问题，但是掌握好了又比其它语言灵活！

lilydjwg

Jul 20, 2016

@zungmou 任何语言处理你这个编码错了的字符串都比较复杂。 Python 2 因为自动转码导致比较难以理解， Python 3 显式区分已编码数据和未编码的 Unicode 字符串，更容易理解。