scrapy 如何对接 selenium？

推荐学习书目

› Learn Python the Hard Way

Python Sites

› PyPI - Python Package Index

› http://diveintopython.org/toc/index.html

› Pocoo

值得关注的项目

› PyPy

› Celery

› Jinja2

› Read the Docs

› gevent

› pyenv

› virtualenv

› Stackless Python

› Beautiful Soup

› 结巴中文分词

› Green Unicorn

› Sentry

› Shovel

› Pyflakes

› pytest

Python 编程

› pep8 Checker

Styles

› PEP 8

› Google Python Style Guide

› Code Style from The Hitchhiker's Guide

This topic created in 2629 days ago, the information mentioned may be changed or developed.

单纯用 selenium 爬取数据，效率真的是慢。但是逆向 js 又搞不定，就想着用 scrapy+selenium+redis，做成分布式，提高爬取效率。请问大佬们，有实现方式吗？或者，有没有项目可以参考看下。
万分感谢

12 replies • 2019-05-12 09:00:35 +08:00

la2la

May 11, 2019

selenium 会破坏 scrapy 的异步性吧？感觉快不了多少。如果真的要配合 selenium 的话，可以在下载中间件中用，返回 response 对象，绕开默认下载器

zgoing

May 11, 2019 via iPhone

和 scrapy 配合使用的好像都是 splash

aquariumm

May 11, 2019 via Android

我的经验是尽可能逆向 js，直接抓最根本的请求，效率巨高
或者用 js 渲染库，scapy 不清楚，requests 和 js 有渲染库的

其实逆向 js 很简单的，js 要么 xhr，要么内置 url 都很好找的

XSugar

May 11, 2019 via iPhone

middle 里面换掉

Jaho

May 11, 2019

middleware 中
另：
http://jaho.fun/google.jpg

911speedstar

May 11, 2019

@zgoing 试过了，因为需要翻页，splash 无法解决

911speedstar

May 11, 2019

@aquariumm 逆向 js 后的效率，的确会高很多。但是我对 js 语言不是很熟悉，一看到长串的 js，就感觉没头绪。。。

911speedstar

May 11, 2019

@la2la 效率上应该不会快太多。我是想做成分布式，开 2-4 个 driver 来做，这样比单纯的 selenium 要快一些

911speedstar

May 11, 2019

@XSugar 试一下了。。

aquariumm

May 11, 2019 via Android

@911speedstar 翻页直接抓包啊，八成是 xhr 实现的

smallgoogle

May 11, 2019

可以直接把 JS 下载回来。python 载入然后解密呀。这样你只需要找到 js 的加解密函数就可以了呀。

exip

May 12, 2019 via Android

selenium 保存 cookies 后传给 scrapy，等再需要 selenium 上场时 scrapy 再把 cookies 传回来。