BeautifulSoup 官方文档中的例子,没看懂,


html_doc = """
<html><head><title>The Dormouse's story</title></head>

<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc)

def has_class_but_no_id(tag):
    return tag.has_attr('class') and not tag.has_attr('id')

soup.find_all(has_class_but_no_id)
# [<p class="title"><b>The Dormouse's story</b></p>,
#  <p class="story">Once upon a time there were...</p>,
#  <p class="story">...</p>]

可我的结果是:

[<p class="title"><b>The Dormouse's story</b></p>, <p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>, <p class="story">...</p>]

凌乱了, 谢谢各位指点.

16 条回复 • 2016-08-23 15:15:46 +08:00

thekoc

2016 年 8 月 18 日

你的整个程序是怎么样的？它的意思是说可以给 find_all 传一个函数作为参数，用这个函数来定义应该满足的条件。你传进去的函数是和用例种一模一样的吗？

emric

2016 年 8 月 18 日

这个解析的结果是正确的。`Once upon a time there were...` 后处有省略号。

redhatping

2016 年 8 月 18 日

@emric 为什么是 P 标签呢.? a 标签为什么没有考虑呢? a 有 id 属性啊.

kxxoling

2016 年 8 月 18 日

晕，看了半天才反应过来。。。问题在于 bs tag 的打印方法上，你的结果和它的同样是一个长度为 3 的列表，只不过例子中用省略号代替了中间的标签，而你的输出中列表的第二个元素打印出来是 ``Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.``，不信你打印一下结果的长度看看。

redhatping

2016 年 8 月 18 日

@kxxoling 恩, 谢谢, 但是不明白 ,为什么打印 P 标签,而没有 A 标签的判断呢?

kxxoling

2016 年 8 月 18 日

@redhatping 没明白你的问题？能换个方法问一下吗。。。

cheneydog

2016 年 8 月 18 日

and not tag.has_attr('id')

redhatping

2016 年 8 月 18 日

@kxxoling 为什么没有过滤掉 <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

kxxoling

2016 年 8 月 18 日

@redhatping 列表的第二项是一个 p 标签，而这 3 个 a 标签是 p 标签里面的内容，并没有独立出现在过滤结果中。

emric

2016 年 8 月 18 日

@redhatping 因为这个 P 包括 A ， print 一下 `soup.find_all(True)` 你就懂了。

skydiver

2016 年 8 月 19 日 via iPad

文档里省略号了而已……

redhatping

2016 年 8 月 19 日

@skydiver 不对的

skydiver

2016 年 8 月 19 日

@redhatping 楼上已经好几个人跟你解释了，你还不理解就没办法了

redhatping

2016 年 8 月 19 日

@skydiver 不是这回事 , 请看官方文档,.

amustart

2016 年 8 月 23 日

return tag.has_attr('class') and not tag.has_attr('id')

返回有 class 属性但是没有 id 属性的 tag ， a 标签有 id 属性，所以 passpass 掉了

amustart

2016 年 8 月 23 日

@amustart 无脑打了，发现不对，敲了一遍， find 了三个， a 标签是 p 的子标签， `has_class_but_no_id(tag)` 不会递归去看 p 标签的子标签，（这是你问为什么没有 A 标签的答案。）

下面我在每个找到的元素之间加个了几个换行以显示的更清晰

"""
The Dormouse's story

Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.

...

"""
官方文档确实省略了第二个 p 里的东西