html_doc = """
<html><head><title>The Dormouse's story</title></head>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc)
def has_class_but_no_id(tag):
return tag.has_attr('class') and not tag.has_attr('id')
soup.find_all(has_class_but_no_id)
# [<p class="title"><b>The Dormouse's story</b></p>,
# <p class="story">Once upon a time there were...</p>,
# <p class="story">...</p>]
可我的结果是:
[<p class="title"><b>The Dormouse's story</b></p>, <p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>, <p class="story">...</p>]
凌乱了, 谢谢各位指点.
1
thekoc 2016-08-18 22:36:56 +08:00
你的整个程序是怎么样的?它的意思是说可以给 find_all 传一个函数作为参数,用这个函数来定义应该满足的条件。你传进去的函数是和用例种一模一样的吗?
|
2
emric 2016-08-18 22:48:42 +08:00
这个解析的结果是正确的。`<p class="story">Once upon a time there were...</p>` 后处有省略号。
|
3
redhatping OP @emric 为什么是 P 标签呢.? a 标签为什么 没有考虑呢? a 有 id 属性啊.
|
4
kxxoling 2016-08-18 23:03:09 +08:00
晕,看了半天才反应过来。。。问题在于 bs tag 的打印方法上,你的结果和它的同样是一个长度为 3 的列表,只不过例子中用省略号代替了中间的标签,而你的输出中列表的第二个元素打印出来是 ``<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>; and they lived at the bottom of a well.</p>``,不信你打印一下结果的长度看看。 |
5
redhatping OP @kxxoling 恩, 谢谢, 但是不明白 ,为什么打印 P 标签,而没有 A 标签的判断呢?
|
6
kxxoling 2016-08-18 23:16:40 +08:00
@redhatping 没明白你的问题?能换个方法问一下吗。。。
|
7
cheneydog 2016-08-18 23:17:17 +08:00
and not tag.has_attr('id')
|
8
redhatping OP @kxxoling 为什么没有过滤掉 <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a> |
9
kxxoling 2016-08-18 23:29:05 +08:00
@redhatping 列表的第二项是一个 p 标签,而这 3 个 a 标签是 p 标签里面的内容,并没有独立出现在过滤结果中。
|
10
emric 2016-08-18 23:31:13 +08:00
@redhatping 因为这个 P 包括 A , print 一下 `soup.find_all(True)` 你就懂了。
|
11
skydiver 2016-08-19 02:26:16 +08:00 via iPad
文档里省略号了而已……
|
12
redhatping OP @skydiver 不对的
|
13
skydiver 2016-08-19 13:20:01 +08:00
@redhatping 楼上已经好几个人跟你解释了,你还不理解就没办法了
|
14
redhatping OP @skydiver 不是这回事 , 请看官方文档,.
|
15
amustart 2016-08-23 14:43:45 +08:00
return tag.has_attr('class') and not tag.has_attr('id')
返回 有 class 属性 但是 没有 id 属性的 tag , a 标签有 id 属性, 所以 passpass 掉了 |
16
amustart 2016-08-23 15:15:46 +08:00
@amustart 无脑打了,发现不对,敲了一遍, find 了 三个, a 标签是 p 的子标签, `has_class_but_no_id(tag)` 不会递归去看 p 标签的子标签,(这是你问为什么没有 A 标签的答案。)
下面我在每个找到的元素之间加个了几个换行以显示的更清晰 """ <p class="title"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> """ 官方文档确实省略了第二个 p 里的 东西 |