lxml.etree 不能处理不完整的 html 代码？如果强硬要对节选代码段进行 xpath 定位，应该如何操作？


    html = etree.parse(htmlStr,etree.HTMLParser()) # htmlStr 来自完整的整个 html 文件内容，这一步正常
    result = html.xpath('//*[@div="info"]')
    
    tmpStr = ''
    for st in result:
    	divSetion = (etree.tostring(st,encoding="unicode", pretty_print=True, method="html"))
      	if (xxxxxxx) in divSetion:
      	  tmpStr = divSetion #成功获得代码段
      	else:
      	  exit(0)
    
    #此时 tmpStr 肯定是有内容的，条件满足的话，打算对这一代码段进行 xpath 定位选择
    #html = etree.parse(tmpStr,etree.HTMLParser() )
    html = etree.parse(tmpStr)  #这一步不行了
    result = html.xpath('//*[@class="homeinfo"]')
      for st in result: #测试输出有无内容
         print(st)

PcCharm 报错内容输出节选:


Traceback (most recent call last):
  File "D:/Mycode/tedital.py", line 55, in <module>
    html = etree.parse(MatchDetailed)
  File "src\lxml\etree.pyx", line 3435, in lxml.etree.parse
  File "src\lxml\parser.pxi", line 1840, in lxml.etree._parseDocument
  File "src\lxml\parser.pxi", line 1866, in lxml.etree._parseDocumentFromURL
  File "src\lxml\parser.pxi", line 1770, in lxml.etree._parseDocFromFile
  File "src\lxml\parser.pxi", line 1163, in lxml.etree._BaseParser._parseDocFromFile
  File "src\lxml\parser.pxi", line 601, in lxml.etree._ParserContext._handleParseResultDoc
  File "src\lxml\parser.pxi", line 711, in lxml.etree._handleParseResult
  File "src\lxml\parser.pxi", line 638, in lxml.etree._raiseParseError
OSError: Error reading file '

file

ree

lxml

pxi

6 条回复 • 2019-07-08 22:15:55 +08:00

congeec

2019-07-05 12:09:17 +08:00

html = etree.parse(MatchDetailed) 这一行并不在你的代码里面，没人能帮你 debug

那个 exit(0) 看得我一脸懵逼

qazwsxkevin

2019-07-05 14:37:32 +08:00

@congeec 晕了，我把粘贴的内容和另外一个测试搞混了,exit(0)是 C++留下的习惯，都不用继续了，直接 exit(0)退出不好么, 6666666

qazwsxkevin

2019-07-05 14:39:21 +08:00

正确版本的报错输出

```
Traceback (most recent call last):
File "D:/Mycode/tedital.py", line 55, in <module>
html = etree.parse(tmpStr)
File "src\lxml\etree.pyx", line 3435, in lxml.etree.parse
File "src\lxml\parser.pxi", line 1840, in lxml.etree._parseDocument
File "src\lxml\parser.pxi", line 1866, in lxml.etree._parseDocumentFromURL
File "src\lxml\parser.pxi", line 1770, in lxml.etree._parseDocFromFile
File "src\lxml\parser.pxi", line 1163, in lxml.etree._BaseParser._parseDocFromFile
File "src\lxml\parser.pxi", line 601, in lxml.etree._ParserContext._handleParseResultDoc
File "src\lxml\parser.pxi", line 711, in lxml.etree._handleParseResult
File "src\lxml\parser.pxi", line 638, in lxml.etree._raiseParseError
OSError: Error reading file '
```

limuyan44

2019-07-05 16:09:35 +08:00 via Android

我看你的代码你为什么不一步定位呢，xpath 这个逻辑应该写的出来啊？换正则得了

atx

2019-07-05 17:43:18 +08:00

`The parse() function is used to parse from files and file-like objects.`

不看文档的吗?

limyel

2019-07-08 22:15:55 +08:00 via iPhone

得用 etree.HTML 吧