V2EX = way to explore
V2EX 是一个关于分享和探索的地方
Sign Up Now
For Existing Member  Sign In
推荐学习书目
Learn Python the Hard Way
Python Sites
PyPI - Python Package Index
http://diveintopython.org/toc/index.html
Pocoo
值得关注的项目
PyPy
Celery
Jinja2
Read the Docs
gevent
pyenv
virtualenv
Stackless Python
Beautiful Soup
结巴中文分词
Green Unicorn
Sentry
Shovel
Pyflakes
pytest
Python 编程
pep8 Checker
Styles
PEP 8
Google Python Style Guide
Code Style from The Hitchhiker's Guide
dylanhu
V2EX  ›  Python

scrapy:为什么有些 Crawled (200)的没有 Scraped 下来

  •  
  •   dylanhu · Apr 2, 2019 · 8340 views
    This topic created in 2581 days ago, the information mentioned may be changed or developed.

    最近用 scrapy 框架做爬虫,前几天的数据一直都没问题,这几天爬取的数据明显减少了;应该不是代码的原因,看了下日志,有些 URL 没有抓下来,怎么回事?

    2019-04-01 00:00:03 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.yellowpages.com/search?search_terms=Tires&geo_location_terms=Dundalk%2C+MD&page=4> (referer: https://www.yellowpages.com/search?search_terms=Tires&geo_location_terms=Dundalk%2C+MD)
    2019-04-01 00:00:06 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.yellowpages.com/search?search_terms=Tires&geo_location_terms=Dundalk%2C+MD&page=3> (referer: https://www.yellowpages.com/search?search_terms=Tires&geo_location_terms=Dundalk%2C+MD)
    2019-04-01 00:00:06 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.yellowpages.com/search?search_terms=Tires&geo_location_terms=Dundalk%2C+MD&page=3>
    

    如上,page=3 的就 scraped 下来了,而 page=4 的这个只是 crawled 了并没有 scraped,这是为什么,存在好多这样的情况。

    2 replies    2019-04-07 22:50:53 +08:00
    dylanhu
        1
    dylanhu  
    OP
       Apr 2, 2019
    重点是前几天没什么这种情况,这两天开始数据少了很多
    huyu
        2
    huyu  
       Apr 7, 2019 via Android
    @dylanhu 你可以试着打印 response.text 看看什么内容!
    About   ·   Help   ·   Advertise   ·   Blog   ·   API   ·   FAQ   ·   Solana   ·   5957 Online   Highest 6679   ·     Select Language
    创意工作者们的社区
    World is powered by solitude
    VERSION: 3.9.8.5 · 32ms · UTC 03:18 · PVG 11:18 · LAX 20:18 · JFK 23:18
    ♥ Do have faith in what you're doing.