求帮我修改一下爬虫，能把它改成多线程的吗？初学 py！求帮帮忙，找的别人的源码！ - V2EX

Home Sign Up Sign In

推荐学习书目

› Learn Python the Hard Way

Python Sites

› PyPI - Python Package Index

› http://diveintopython.org/toc/index.html

› Pocoo

值得关注的项目

› PyPy

› Celery

› Jinja2

› Read the Docs

› gevent

› pyenv

› Stackless Python

› Beautiful Soup

› 结巴中文分词

› Green Unicorn

› Sentry

› Shovel

› pytest

Python 编程

› pep8 Checker

Styles

› PEP 8

› Google Python Style Guide

› Code Style from The Hitchhiker's Guide

This topic created in 4199 days ago, the information mentioned may be changed or developed.

# -*- coding: utf-8 -*-
import urllib2
import sys
#BeautifulSoup3不需要修改，BeautifulSoup4，改成from bs4 import BeautifulSoup
from BeautifulSoup import BeautifulSoup

reload(sys)
sys.setdefaultencoding( "utf-8" )

def getcontent(url):
print url
req = urllib2.Request(url)
res = urllib2.urlopen(req)
magnetlist=[]
html = res.read()
res.close()
soup = BeautifulSoup(html)
#BeautifulSoup3不需要修改，BeautifulSoup4，改成soup.find_all('a')
allentry=soup.findAll('a')
for link in allentry:
if "magnet:"==link.get('href')[0:7]:
magnetlist.append(link.get('href'))
magnetlist = [line+'\n' for line in magnetlist]
f =open("magnet.txt", "a")
f.writelines(magnetlist)
f.close()

def main():
site="http://bt.shousibaocai.com/search/"
keyword="地心引力"
keyword=urllib2.quote(keyword)
#总共抓前多少页
page=3
for i in range(1,page):
searchurl=site+keyword+"/"+str(i)
getcontent(searchurl)

if __name__ == '__main__':
main()
#end Jarett

27 replies • 2015-01-16 15:31:44 +08:00

1

tuuuz

Jan 15, 2015

1

看的好累

2

POP

OP

Jan 15, 2015

（T_T）

3

mhycy

Jan 15, 2015

1

初学还别人的源码，为何不自己改呢？就是一个学习的过程而已~

4

Kilerd

Jan 15, 2015

1

python的缩进，一贴上来就全乱了，怎么看，扯淡啊
LZ你压根就不懂py吧，，拒绝伸手党

5

POP

OP

Jan 15, 2015

@Kilerd 算了，我还是自己去改吧。。。贴代码的缩进问题，我还真没注意到

6

POP

OP

Jan 15, 2015

@mhycy 受教了

7

justjavac

Jan 15, 2015

已block

8

bigtan

Jan 15, 2015

1

http://segmentfault.com/blog/caspar/1190000000414339

参考这个

我自己使用过一次，效果很好

9

POP

OP

Jan 15, 2015

@bigtan 谢谢，我在尝试用线程池改

10

virusdefender

Jan 15, 2015

1

http://www.virusdefender.net/index.php/archives/84/

把我的代码最下面的 job 函数和循环添加任务的函数换成你的就行了~

11

Septembers

Jan 15, 2015

http://my.oschina.net/leejun2005/blog/194270

12

POP

OP

Jan 15, 2015

@virusdefender 我还是先自己改改看。。。谢谢了

13

lxkaka

Jan 15, 2015

1

python 的多线程无法榨干cpu吧，要用多进程

14

POP

OP

Jan 15, 2015

@lxkaka 我还是先修改源码吧。。。：（

15

libo26

Jan 15, 2015

个人用requests代替urllib2

16

langxuan

Jan 15, 2015

不会是熊厂在过good coder吧。。。囧

17

CodeDrift

Jan 15, 2015

1

为什么没缩进。。。我也初学，PY很注重缩进的。。我一般都是一个tab..不知道对不对，看别人都是4个空格。。

18

O21

Jan 15, 2015

@Anybfans Tab党路过。。

19

POP

OP

Jan 15, 2015

@Anybfans 第一次贴代码。没注意。。。

20

surewen

Jan 15, 2015

要是v2ex支持markdown就好了

21

Delbert

Jan 16, 2015

@surewen 已经支持。右边的入口是md的，正常的入口是没有md的。要贴代码用gist。

22

pandada8

Jan 16, 2015

@lxkaka 爬虫基本是IO型，用多线程基本就够啦

23

wezzard

Jan 16, 2015 via iPhone

你們都中招了，樓主是來黑Python的！

24

icedx

Jan 16, 2015 via Android

如果楼主不想深入学多* 程的哲学可以试试map()

25

ivanlw

Jan 16, 2015

原来不只有我一个人看没有缩进的代码很累呀……怪不得天然适合python……

26

xylophone21

Jan 16, 2015

把range(1,page):这里的1,page两个参数改为从argc获取
然后shell多启几个
不会shell的话手动启也行,反正你也不会想要太多进程的.

27

POP

OP

Jan 16, 2015

@xylophone21 这都行！厉害！！

About · Help · Advertise · Blog · API · FAQ · Solana · 3944 Online Highest 6679 ·

Select Language

创意工作者们的社区

World is powered by solitude

VERSION: 3.9.8.5 · 95ms · UTC 04:17 · PVG 12:17 · LAX 21:17 · JFK 00:17
♥ Do have faith in what you're doing.