This topic created in 3688 days ago, the information mentioned may be changed or developed.
from bs4 import BeautifulSoup
soup = BeautifulSoup(open('tmp1.txt'), 'html.parser')
list_a=[]
list_a = soup.find_all('a')
list_a_text = []
list_a_text = soup.a.get_text() #只有第一个链接的 text 赋值
我想取出全部的链接放到 list_a 中,另外再取出链接中的 text 即文本字符串放到 list_a_text 中,用 soup.a.get_text() 只是把第一个链接的 text 取了出来,应该怎么操作才能取出全部链接的 text ?
8 replies • 2016-04-11 14:13:06 +08:00
 |
|
1
yangbin9317 Apr 10, 2016 1
from bs4 import BeautifulSoup soup = BeautifulSoup(open('tmp1.txt'), 'html.parser') list_a = soup.find_all('a')
list_a_text = []
for link in list_a: list_a_text.append(link.text)
|
 |
|
4
gitb Apr 10, 2016 via Android
推荐用 lxml 解析,自带的效率低
|
 |
|
5
omg21 Apr 10, 2016
@ gitb 好的,我已经发现这个问题了,数据量大的时候巨慢。
|
 |
|
6
Rithard Apr 10, 2016 via Android
from bs4 import BeautifulSoup from lxml import etree
soup = BeautifulSoup(open('tmp1.txt'), 'lxml') list_a = soup.find_all('a')
list_a_text = []
for link in list_a: list_a_text.append(link.text)
|
 |
|
7
chevalier Apr 11, 2016
from bs4 import BeautifulSoup soup = BeautifulSoup(open('tmp1.txt'), 'lxml')
list_a = [tag.get('href') for tag in soup.select('a[href]')]
list_a 中即全部的页面超链接
# 求各种爬虫兼职
|
 |
|
8
xlzd Apr 11, 2016
list_a, list_a_text = (lambda l: ([_['href'] for _ in l], [_.getText() for _ in l]))(getattr(__import__('bs4'), 'BeautifulSoup')(open('tmp1.txt'), 'lxml').find_all('a'))
上面的代码即可取出 tmp1.txt 中的所有链接放到 list_a 中,并将链接中的文本放到 list_a_text 中。
|