lxml 的 xpath 的一个 BUG，不知道你们遇到没？

推荐学习书目

› Learn Python the Hard Way

Python Sites

› PyPI - Python Package Index

› http://diveintopython.org/toc/index.html

› Pocoo

值得关注的项目

› PyPy

› Celery

› Jinja2

› Read the Docs

› gevent

› pyenv

› virtualenv

› Stackless Python

› Beautiful Soup

› 结巴中文分词

› Green Unicorn

› Sentry

› Shovel

› Pyflakes

› pytest

Python 编程

› pep8 Checker

Styles

› PEP 8

› Google Python Style Guide

› Code Style from The Hitchhiker's Guide

This topic created in 2785 days ago, the information mentioned may be changed or developed.

最近在做网易云个人信息遍历抓取；
那么毛病就来了；
https://music.163.com/user/home?id=29879272
https://music.163.com/user/home?id=132128

这是两个个人主页； 29879272 这个呢，etree.HTML(源码)，可以完全解析 html。

132128 这个呢，就 BUG 了，etree.HTML(源码)，发现 html 被截断。

会被源码里面的 description 的 —— 双横线给截断了。简直奇葩；

有没有大神 look look。我发现这是一个 BUG

HTML

源码

etree

bug

7 replies • 2018-11-14 17:22:25 +08:00

itskingname

Nov 10, 2018 via iPhone

lxml 不要用 etree.HTML。换成

from lxml.html import fromstring

selector = fromstring(source)
selector.xpath(...)

smallgoogle

Nov 10, 2018

@itskingname 一样的。都会被那个双横线给截断掉。

ioven

Nov 11, 2018

>音乐人。不定义,不局限。 —\\u0000 —\\u0000 微博

被零字符截断了，替换掉就行

smallgoogle

Nov 12, 2018

@ioven 我是不是要先替换，才行。

ioven

Nov 12, 2018

@smallgoogle selector = fromstring(source.replace('\u0000', ''))

之后正常使用

rocketman13

Nov 13, 2018

最近处理 PostgreSQl 数据库写入时也遇到过\u0000 不能解析，字符串中 replace 掉就好了

canwushuang

Nov 14, 2018

python 模块采用 c 编写导致，而\u0000 是 unicode 表示的一个特殊字符，，在 c 里面用这个字符作为字符串结束的标志。