PySpider 怎样爬一个已知的、不断生成的 URL 列表？ - V2EX

首页注册登录

V2EX = way to explore

V2EX 是一个关于分享和探索的地方

现在注册

已注册用户请登录

推荐学习书目

› Learn Python the Hard Way

Python Sites

› PyPI - Python Package Index

› http://diveintopython.org/toc/index.html

› Pocoo

值得关注的项目

› PyPy

› Celery

› Jinja2

› Read the Docs

› gevent

› pyenv

› Stackless Python

› Beautiful Soup

› 结巴中文分词

› Green Unicorn

› Sentry

› Shovel

› pytest

Python 编程

› pep8 Checker

Styles

› PEP 8

› Google Python Style Guide

› Code Style from The Hitchhiker's Guide

这是一个创建于 3007 天前的主题，其中的信息可能已经有所发展或是发生改变。

情形：

现有一个固定格式的、大约 700w 条数据的 URL ，如 "http://example/item/%d" % (i)

我不想生成一个这么大的列表。最科学的方法应该是弄成一个生成器，生成 1000 个 url ， yield 出来，再生成 1000 个...直到一直把活干完。

不知道 PySpider 怎样重写这个 url 。查看文档和源码，发现是用 BaseHandler 的 crawl 方法，读取一个 url 列表。

我现在想实现：

每次自动生成一定数量的 url 作为任务
记住当前工作状态，比如运行的 taskid ，任务暂停、重启动时，自动从该位置生成新的 url (如从 23333 开始)

不知有没有老大有经验？

def urls(self):
    i=1
    while i<7000000:
        yield "http://example/item/%d" % (i)
        i += 1

@every(minutes=24 * 60)
def on_start(self):
    self.crawl(self.urls(), callback=self.index_page)

6 条回复 • 2018-12-29 09:55:50 +08:00

1

binux

2016-08-31 06:23:11 +08:00

1

```
def on_start(self):
for i in range(10000):
self.crawl('data:,step%d' % i, callback=self.gen_url, save=i)

@config(priority=0)
def gen_url(self, respond):
for i in range(respond.save * 700, (respond.save + 1) * 700):
self.crawl("http://example/item/%d" % i, callback=self.parse_page)

@config(priority=1)
def parse_page(self, respond):
pass
```

2

PythonAnswer

OP

2016-08-31 08:08:58 +08:00

@binux 好巧妙，有这个 save 参数。多谢！！

另请教老大一下，如果是个不断增长的 url 怎么办呢？现在 700w ，但是每天都在不断增长，我怎么追踪才好？

提前致谢

3

binux

2016-08-31 08:29:38 +08:00

1

@PythonAnswer 接着 self.crawl 提交就好了啊

4

PythonAnswer

OP

2016-08-31 09:18:02 +08:00

多谢，我继续研究一下吧。认真看源码，发掘各种功能。

非常好用的软件~

5

figofuture

2016-08-31 09:53:43 +08:00

mark

6

ddzzhen

2018-12-29 09:55:50 +08:00

mark 很好的使用方法

关于 · 帮助文档 · 博客 · API · FAQ · 实用小工具 · 2929 人在线 最高记录 6679 ·

Select Language

创意工作者们的社区

World is powered by solitude

VERSION: 3.9.8.5 · 23ms · UTC 14:49 · PVG 22:49 · LAX 06:49 · JFK 09:49
Developed with CodeLauncher
♥ Do have faith in what you're doing.