V2EX = way to explore
V2EX 是一个关于分享和探索的地方
现在注册
已注册用户请  登录
推荐学习书目
Learn Python the Hard Way
Python Sites
PyPI - Python Package Index
http://diveintopython.org/toc/index.html
Pocoo
值得关注的项目
PyPy
Celery
Jinja2
Read the Docs
gevent
pyenv
virtualenv
Stackless Python
Beautiful Soup
结巴中文分词
Green Unicorn
Sentry
Shovel
Pyflakes
pytest
Python 编程
pep8 Checker
Styles
PEP 8
Google Python Style Guide
Code Style from The Hitchhiker's Guide
qazwsxkevin
V2EX  ›  Python

lxml.etree 不能处理不完整的 html 代码?如果强硬要对节选代码段进行 xpath 定位,应该如何操作?

  •  
  •   qazwsxkevin · 2019-07-05 10:07:27 +08:00 · 3551 次点击
    这是一个创建于 1959 天前的主题,其中的信息可能已经有所发展或是发生改变。
    
        html = etree.parse(htmlStr,etree.HTMLParser()) # htmlStr 来自完整的整个 html 文件内容,这一步正常
        result = html.xpath('//*[@div="info"]')
        
        tmpStr = ''
        for st in result:
        	divSetion = (etree.tostring(st,encoding="unicode", pretty_print=True, method="html"))
          	if (xxxxxxx) in divSetion:
          	  tmpStr = divSetion #成功获得代码段
          	else:
          	  exit(0)
        
        #此时 tmpStr 肯定是有内容的,条件满足的话,打算对这一代码段进行 xpath 定位选择
        #html = etree.parse(tmpStr,etree.HTMLParser() )
        html = etree.parse(tmpStr)  #这一步不行了
        result = html.xpath('//*[@class="homeinfo"]')
          for st in result: #测试输出有无内容
             print(st)
    

    PcCharm 报错内容输出节选:

    
    Traceback (most recent call last):
      File "D:/Mycode/tedital.py", line 55, in <module>
        html = etree.parse(MatchDetailed)
      File "src\lxml\etree.pyx", line 3435, in lxml.etree.parse
      File "src\lxml\parser.pxi", line 1840, in lxml.etree._parseDocument
      File "src\lxml\parser.pxi", line 1866, in lxml.etree._parseDocumentFromURL
      File "src\lxml\parser.pxi", line 1770, in lxml.etree._parseDocFromFile
      File "src\lxml\parser.pxi", line 1163, in lxml.etree._BaseParser._parseDocFromFile
      File "src\lxml\parser.pxi", line 601, in lxml.etree._ParserContext._handleParseResultDoc
      File "src\lxml\parser.pxi", line 711, in lxml.etree._handleParseResult
      File "src\lxml\parser.pxi", line 638, in lxml.etree._raiseParseError
    OSError: Error reading file '
    
    6 条回复    2019-07-08 22:15:55 +08:00
    congeec
        1
    congeec  
       2019-07-05 12:09:17 +08:00
    html = etree.parse(MatchDetailed) 这一行并不在你的代码里面,没人能帮你 debug

    那个 exit(0) 看得我一脸懵逼
    qazwsxkevin
        2
    qazwsxkevin  
    OP
       2019-07-05 14:37:32 +08:00
    @congeec 晕了,我把粘贴的内容和另外一个测试搞混了,exit(0)是 C++留下的习惯,都不用继续了,直接 exit(0)退出不好么, 6666666
    qazwsxkevin
        3
    qazwsxkevin  
    OP
       2019-07-05 14:39:21 +08:00
    正确版本的报错输出

    ```
    Traceback (most recent call last):
    File "D:/Mycode/tedital.py", line 55, in <module>
    html = etree.parse(tmpStr)
    File "src\lxml\etree.pyx", line 3435, in lxml.etree.parse
    File "src\lxml\parser.pxi", line 1840, in lxml.etree._parseDocument
    File "src\lxml\parser.pxi", line 1866, in lxml.etree._parseDocumentFromURL
    File "src\lxml\parser.pxi", line 1770, in lxml.etree._parseDocFromFile
    File "src\lxml\parser.pxi", line 1163, in lxml.etree._BaseParser._parseDocFromFile
    File "src\lxml\parser.pxi", line 601, in lxml.etree._ParserContext._handleParseResultDoc
    File "src\lxml\parser.pxi", line 711, in lxml.etree._handleParseResult
    File "src\lxml\parser.pxi", line 638, in lxml.etree._raiseParseError
    OSError: Error reading file '
    ```
    limuyan44
        4
    limuyan44  
       2019-07-05 16:09:35 +08:00 via Android
    我看你的代码你为什么不一步定位呢,xpath 这个逻辑应该写的出来啊?换正则得了
    lc1450
        5
    lc1450  
       2019-07-05 17:43:18 +08:00
    `The parse() function is used to parse from files and file-like objects.`

    不看文档的吗?
    limyel
        6
    limyel  
       2019-07-08 22:15:55 +08:00 via iPhone
    得用 etree.HTML 吧
    关于   ·   帮助文档   ·   博客   ·   API   ·   FAQ   ·   实用小工具   ·   1051 人在线   最高记录 6679   ·     Select Language
    创意工作者们的社区
    World is powered by solitude
    VERSION: 3.9.8.5 · 22ms · UTC 22:33 · PVG 06:33 · LAX 14:33 · JFK 17:33
    Developed with CodeLauncher
    ♥ Do have faith in what you're doing.