请教使用 scrapy 爬取豆瓣读书的时候，无法多页面爬取的解决办法

V2EX = way to explore

V2EX 是一个关于分享和探索的地方

现在注册

已注册用户请登录

推荐学习书目

› Learn Python the Hard Way

Python Sites

› PyPI - Python Package Index

› http://diveintopython.org/toc/index.html

› Pocoo

值得关注的项目

› PyPy

› Celery

› Jinja2

› Read the Docs

› gevent

› pyenv

› virtualenv

› Stackless Python

› Beautiful Soup

› 结巴中文分词

› Green Unicorn

› Sentry

› Shovel

› Pyflakes

› pytest

Python 编程

› pep8 Checker

Styles

› PEP 8

› Google Python Style Guide

› Code Style from The Hitchhiker's Guide

这是一个创建于 2886 天前的主题，其中的信息可能已经有所发展或是发生改变。

初学 Python ，在用 scrapy 来爬取豆瓣读书练习。目前爬取单页面的书籍没有问题

在用 scrapy 提供的 Rule 和 LinkExtractor 模块练习爬取多页面的时候，始终无法获取的到下一页的结果，折腾一天无解
爬虫主要代码在下面，请各位给看看问题所在，感谢！

https://gist.github.com/loricheung/b51503a835aa8b8af238b99a4104fb21

Scrapy

豆瓣

页面

爬

9 条回复 • 2017-03-09 00:34:26 +08:00

bazingaterry

2017-03-08 09:36:44 +08:00

return book --> yield book

freestyle

2017-03-08 09:44:10 +08:00

LinkExtractor 的正则写错了 r'/tag/小说\?start=\d+'

freestyle

2017-03-08 09:44:32 +08:00

@bazingaterry return 也可以

freestyle

2017-03-08 09:47:43 +08:00

不知道你这是不是 Python2 可能要加 u ur'/tag/小说\?start=\d+'

congminghaoxue92

2017-03-08 10:08:29 +08:00

你没有加翻页功能吧，加上翻页判断。

anguslg

2017-03-08 13:23:41 +08:00

@freestyle 使用的 Python3 ，正则表达式我测试过，可正确检测到对应的链接文本

anguslg

2017-03-08 13:25:15 +08:00

@congminghaoxue92 scrapy 框架已经帮做了这个事情

nicevar

2017-03-09 00:20:35 +08:00

问题出在 rules 上， callback 随便重新写个函数，不用覆盖 parse 就行了

anguslg

2017-03-09 00:34:26 +08:00

@nicevar 确实是这个问题。很奇怪，我在开始使用 Rule 来爬取多页的时候，就把 callback 函数重写了，但是当时也是只能只能爬取单个页面……