V2EX = way to explore
V2EX 是一个关于分享和探索的地方
现在注册
已注册用户请  登录
推荐学习书目
Learn Python the Hard Way
Python Sites
PyPI - Python Package Index
http://diveintopython.org/toc/index.html
Pocoo
值得关注的项目
PyPy
Celery
Jinja2
Read the Docs
gevent
pyenv
virtualenv
Stackless Python
Beautiful Soup
结巴中文分词
Green Unicorn
Sentry
Shovel
Pyflakes
pytest
Python 编程
pep8 Checker
Styles
PEP 8
Google Python Style Guide
Code Style from The Hitchhiker's Guide
Superbin
V2EX  ›  Python

百度搜索结果 url 爬虫 ,怎么实现爬指定 1 到 10 页的结果(现在是固定爬取某一页的 url)

  •  
  •   Superbin · 2017-12-06 15:01:39 +08:00 · 2755 次点击
    这是一个创建于 2545 天前的主题,其中的信息可能已经有所发展或是发生改变。

    #coding=utf-8 import urllib2 import urllib import sys import re #from selenium import webdriver #from selenium.webdriver.common.keys import Keys import time

    #url = "href = "http://www.baidu.com/link?url=bu4fsa-txw7aHhz0LEu-Ej8ON__uS6btmV_mo7nI2O0_qKtfc-3rJHSyXnYOINHSgDASX4R1V6GcjE2UBGFdjZ9ahmEbG2gsGGW6MVW7pQm"" #print url pattern = re.compile(r"href = "( http://www.baidu.com/link?url=.+?)"") #rehh = re.findall(pattern, url)

    #for i in rehh: #print i

    with open('data.txt','a+') as f: key_word = [] with open('key_word.txt','r') as kf: for line in kf: request = urllib2.Request('http://www.baidu.com/s?wd='+line.decode('gbk').encode('utf-8')+'&pn=0') response = urllib2.urlopen(request)

            #print response.read()
            #pattern = re.compile(r"href = \"(.+?)\"")
            rehh = re.findall(pattern, response.read())
    
            for i in rehh:
                request2 = urllib2.Request(i)
                response2 = urllib2.urlopen(request2)
    
                print response2.geturl()
                f.write(response2.geturl())
                f.write('\n')
    

    f.close() kf.close()

    cyrbuzz
        1
    cyrbuzz  
       2017-12-06 19:04:10 +08:00
    排版感人。
    shawndev
        2
    shawndev  
       2017-12-07 11:11:24 +08:00
    selenium
    shawndev
        3
    shawndev  
       2017-12-07 11:12:19 +08:00
    pn=0,pn 即 pagenumber
    关于   ·   帮助文档   ·   博客   ·   API   ·   FAQ   ·   实用小工具   ·   6082 人在线   最高记录 6679   ·     Select Language
    创意工作者们的社区
    World is powered by solitude
    VERSION: 3.9.8.5 · 26ms · UTC 06:17 · PVG 14:17 · LAX 22:17 · JFK 01:17
    Developed with CodeLauncher
    ♥ Do have faith in what you're doing.