40 行代码搞定 v2ph 爬虫

推荐学习书目

› Learn Python the Hard Way

Python Sites

› PyPI - Python Package Index

› http://diveintopython.org/toc/index.html

› Pocoo

值得关注的项目

› PyPy

› Celery

› Jinja2

› Read the Docs

› gevent

› pyenv

› virtualenv

› Stackless Python

› Beautiful Soup

› 结巴中文分词

› Green Unicorn

› Sentry

› Shovel

› Pyflakes

› pytest

Python 编程

› pep8 Checker

Styles

› PEP 8

› Google Python Style Guide

› Code Style from The Hitchhiker's Guide

This topic created in 1377 days ago, the information mentioned may be changed or developed.

上周看到有人在论坛问爬虫懒加载的问题，正好也有朋友在做这个爬虫练习项目，分享一下，仅供参考，欢迎交流爬虫技术和场景

使用网页自动化的方式，而且使用的是用户浏览器，不易被反扒监测。数据量不大的话，也无需所分布式，是一个很好的选择哈。

https://github.com/ReaganScott/v2ph

爬虫

v2ph

分布式

自动化

10 replies • 2022-09-14 21:25:40 +08:00

i8k

Sep 11, 2022

没有把图片按文章目录分类啊

automation2022

Sep 11, 2022

@i8k 嗯，这个就比较简单了，拿到 album 的名字，在 picture 下建子目录就可以的

i8k

Sep 11, 2022

@automation2022 好，我自己补一下

websql

Sep 11, 2022

1 、要用 ip 池子，不然很容易被网站屏蔽了爬虫 IP
2 、图片下载失败后，删除本地文件，重新下载图片

Puteulanus

Sep 11, 2022

练习项目更建议手工爬，操作浏览器看起来简单，练到的东西也就少了

automation2022

Sep 11, 2022

@websql 嗯，说的对。
数据量不大的话，搞个代理，自动切换代理服务器，不过没有做这方面的测试

Dart

Sep 13, 2022

厉害！学习到了不少东西

cy1027

Sep 13, 2022

selenium 换代理太麻烦了，我反正只会删除实例再创建一个然后改代理，真想学还是建议研究研究逆向什么的，模拟器还是不太够用

automation2022

Sep 14, 2022

@Dart 欢迎多交流

Dart

Sep 14, 2022

不过 v2ph.com 不行啊，感觉很烂还不如其他图片站，不知道有什么好爬的……