V2EX = way to explore
V2EX 是一个关于分享和探索的地方
Sign Up Now
For Existing Member  Sign In
推荐学习书目
Learn Python the Hard Way
Python Sites
PyPI - Python Package Index
http://diveintopython.org/toc/index.html
Pocoo
值得关注的项目
PyPy
Celery
Jinja2
Read the Docs
gevent
pyenv
virtualenv
Stackless Python
Beautiful Soup
结巴中文分词
Green Unicorn
Sentry
Shovel
Pyflakes
pytest
Python 编程
pep8 Checker
Styles
PEP 8
Google Python Style Guide
Code Style from The Hitchhiker's Guide
Sequencer
V2EX  ›  Python

发个来自 baidu 的爬虫 不是程序员 写的不好请轻喷

  •  
  •   Sequencer · Apr 1, 2016 · 3854 views
    This topic created in 3684 days ago, the information mentioned may be changed or developed.
    import urllib.request
    import re
    import pdb
    import webbrowser
    from time import sleep
    
    
    class Webpage:
        def __init__(self, url):
            # init page
            self.url = url
            header = {
                'User-Agent': 'Mozilla/5.0 (iPhone; CPU iPhone OS 8_2 like Mac OS X) AppleWebKit/600.1.4 (KHTML, like Gecko) Version/8.0 Mobile/12D508 Safari/600.1.4'}
            request = urllib.request.Request(url, headers=header)
            sleep(2)
            webpage = urllib.request.urlopen(request)
            webpage = webpage.read()
            webpage_decode = webpage.decode()
            self.webpage = webpage_decode
            DCIM = re.compile(r'来自')
            title = DCIM.findall(webpage_decode)
            if len(title) != 0:
                webbrowser.open(url)
    def Page(min ,max):
        for i in range(min, max):
            url ='http://yun.baidu.com/share/home?uk='+str(i)
            yield url
    
    if __name__ == '__main__':
        for i in Page(1,5000):
            Webpage(i)
    
    7 replies    2016-04-01 13:50:05 +08:00
    knightdf
        1
    knightdf  
       Apr 1, 2016
    webbrowser.open(url)
    leavic
        2
    leavic  
       Apr 1, 2016
    10 年前,我们管这种一次打开 5000 个网页的东西叫做恶意脚本,确实很适合在愚人节用。
    Tink
        3
    Tink  
    PRO
       Apr 1, 2016
    我昨晚下载完了,正在往数据库里导
    Sequencer
        4
    Sequencer  
    OP
       Apr 1, 2016 via iPhone
    @knightdf @leavic 有个判断语录 还有个 sleep 5000 个里面可能能找到一个
    Sequencer
        5
    Sequencer  
    OP
       Apr 1, 2016
    @Tink 你用的分布式爬的?
    Tink
        6
    Tink  
    PRO
       Apr 1, 2016
    @Sequencer 我从 mega 上手动下载的。。。。
    aksoft
        7
    aksoft  
       Apr 1, 2016
    今天是愚人节。。
    About   ·   Help   ·   Advertise   ·   Blog   ·   API   ·   FAQ   ·   Solana   ·   2597 Online   Highest 6679   ·     Select Language
    创意工作者们的社区
    World is powered by solitude
    VERSION: 3.9.8.5 · 40ms · UTC 12:38 · PVG 20:38 · LAX 05:38 · JFK 08:38
    ♥ Do have faith in what you're doing.