使用pyquery过程中出现乱码怎么办？

代码是：
3.py
# -*- coding=utf-8 -*-
from pyquery import PyQuery as pq
import codecs

d = pq(url='http://www.baidu.com/')
sep = d('body').html().decode('gb18030').encode('utf-8')
file = codecs.open('new.html', 'w', 'utf-8')
file.write(se)
file.close()

python 111.py出错信息如下：
Traceback (most recent call last):
File "3.py", line 6, in <module>
sep = d('body').html().decode('gb18030').encode('utf-8')
UnicodeEncodeError: 'ascii' codec can't encode characters in position 146-149: ordinal not in range(128)

第 1 条附言 · 2013-01-30 00:05:00 +08:00

把原来的换成'http://www.baidu.com/'换成'http://tieba.baidu.com/f?kw=宋时行'后又乱码了，查看发现百度主页是gb2312，而贴吧是gbk，pyquery无法自动解码gbk？

utf

HTML

15 条回复 • 1970-01-01 08:00:00 +08:00

wong2

2013-01-29 21:40:54 +08:00

d('body').html()得到的已经是unicode了，直接encode('utf-8')就行了

shanshuise

2013-01-29 22:19:54 +08:00

@wong2 去掉.decode('gb18030')后还是报错。。

Traceback (most recent call last):
File "3.py", line 7, in <module>
file.write(sep)
File "/usr/lib/python2.7/codecs.py", line 691, in write
return self.writer.write(data)
File "/usr/lib/python2.7/codecs.py", line 351, in write
data, consumed = self.encode(object, self.errors)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe6 in position 146: ordinal not in range(128)

swulling

2013-01-29 22:22:18 +08:00

@shanshuise 去掉所有的encode decode codecs。。。

open就好，为啥用codecs.open?

spark

2013-01-29 22:28:29 +08:00

加入:
import sys
reload(sys)
sys.setdefaultencoding('utf-8')

如不行:
input_file = unicode(open(file).read(), 'utf-8')
source = pq(input_file)

swulling

2013-01-29 22:29:51 +08:00

@shanshuise 以下两种方法均可
因为html()输出是unicode，如果用codecs.open，在写入时会自动转换为utf-8，而如果用open，则需要手动转换为utf-8

sep = d('body').html()
file = codecs.open('new.html', 'w', 'utf-8')
file.write(sep)
file.close()

sep = d('body').html()
file = open('new.html', 'w')
file.write(sep.encode('utf8'))
file.close()

swulling

2013-01-29 22:31:06 +08:00

而如果既手动转了utf-8，然后在自动转一次，就会报错。。

013231

2013-01-29 22:33:11 +08:00

@shanshuise 對於用`codecs.open`打開的文件, 你應該寫入unicode而非str. 寫入時會自動用你指定的編碼來編碼字符.

shanshuise

2013-01-29 22:33:27 +08:00

import sys
reload(sys)
sys.setdefaultencoding('utf8')

在加入以上语句后就可以正常工作了，本来我以为用了codecs.open就可以不再使用上述的语句，不知道是怎么回事。话说编码还是真复杂，晕了。

013231

2013-01-29 22:37:35 +08:00

@shanshuise 如果用`setdefaultencoding`來解決這個問題, `codecs.open`就沒有意義了. 直接`open`吧.

shanshuise

2013-01-29 22:38:09 +08:00

@swulling 刚看到，确实因为转两次的原因才报错了。

但是我转两次utf8加上下边的语句全搞上去倒成功了。

import sys
reload(sys)
sys.setdefaultencoding('utf8')

swulling

2013-01-29 22:38:40 +08:00

@shanshuise the use of sys.setdefaultencoding() has always been discouraged

不要用这个，上面我回复的两种方法都很优雅，没必要用这种dirty hack

shanshuise

2013-01-29 22:39:09 +08:00

@013231 现在已经了解了。使用codecs.open，去掉etdefaultencoding，去掉原文中的encode后正常。

shanshuise

2013-01-29 22:40:05 +08:00

@swulling 嗯，学习了。多谢。

spark

2013-01-30 20:22:28 +08:00

@swulling 不错，学习了

54dev

2013-09-12 17:21:05 +08:00

@swulling pyjquery所有的输出好像都是unicode？