用 BeautifulSoup4 怎样取出全部链接的 text ?

from bs4 import BeautifulSoup
soup = BeautifulSoup(open('tmp1.txt'), 'html.parser')
list_a=[]
list_a = soup.find_all('a')

list_a_text = []
list_a_text = soup.a.get_text() #只有第一个链接的 text 赋值

我想取出全部的链接放到 list_a 中，另外再取出链接中的 text 即文本字符串放到 list_a_text 中，用 soup.a.get_text() 只是把第一个链接的 text 取了出来，应该怎么操作才能取出全部链接的 text ？

Text

list_a

链接

取出

8 条回复 • 2016-04-11 14:13:06 +08:00

yangbin9317

2016-04-10 19:05:47 +08:00

from bs4 import BeautifulSoup
soup = BeautifulSoup(open('tmp1.txt'), 'html.parser')
list_a = soup.find_all('a')

list_a_text = []

for link in list_a:
list_a_text.append(link.text)

yangbin9317

2016-04-10 19:06:27 +08:00

还招聘写爬虫的吗？

omg21

2016-04-10 19:46:28 +08:00

@yangbin9317 谢谢，原来是这样做的。
额~~目前不招，不过希望以后有机会能合作。

gitb

2016-04-10 23:31:45 +08:00 via Android

推荐用 lxml 解析，自带的效率低

omg21

2016-04-10 23:40:29 +08:00

@gitb 好的，我已经发现这个问题了，数据量大的时候巨慢。

Rithard

2016-04-10 23:41:57 +08:00 via Android

from bs4 import BeautifulSoup
from lxml import etree

soup = BeautifulSoup(open('tmp1.txt'), 'lxml')
list_a = soup.find_all('a')

list_a_text = []

for link in list_a:
list_a_text.append(link.text)

chevalier

2016-04-11 01:15:02 +08:00

from bs4 import BeautifulSoup
soup = BeautifulSoup(open('tmp1.txt'), 'lxml')

list_a = [tag.get('href') for tag in soup.select('a[href]')]

list_a 中即全部的页面超链接

# 求各种爬虫兼职

xlzd

2016-04-11 14:13:06 +08:00

list_a, list_a_text = (lambda l: ([_['href'] for _ in l], [_.getText() for _ in l]))(getattr(__import__('bs4'), 'BeautifulSoup')(open('tmp1.txt'), 'lxml').find_all('a'))

上面的代码即可取出 tmp1.txt 中的所有链接放到 list_a 中，并将链接中的文本放到 list_a_text 中。