willme105
V2EX  ›  问与答

scrapy 采集列表页和内容页问题

  •  
  •   willme105 · Jan 14, 2015 · 4769 views
    This topic created in 4155 days ago, the information mentioned may be changed or developed.

    打算在列表页采集缩略图和标题,内容页采集分类和标签。
    不过一直采集不成功。请问该如何写这个spider。

    from scrapy.contrib.spiders import CrawlSpider, Rule
    from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
    from scrapy.selector import Selector

    from my.items import MyItem
    import re
    from scrapy.http import Request
    from scrapy.selector import Selector
    from scrapy.selector import HtmlXPathSelector

    class MySpider(CrawlSpider):
    name = 'xxx'
    allowed_domains = ['xxxx.com']
    start_urls = ['http://xxxxxx.com']

    def parse(self, response):
        item = MyItem()
        sel = Selector(response)
        videos = sel.xpath('//ul[@class="listThumbs"]/li')
        for v in videos:
            item['img']=v.xpath('a[@class="thumb"]/img/@src').extract()[0]
            item['title'] = v.xpath('a[@class="title"]/text()').extract()[0]
            item['url'] = v.xpath('a[@class="thumb"]/@href').extract()[0]
            yield item
    def parse_page(self,response):
        item=MyItem()
        hxs=Selector(response)
        cate=hxs.xpath('//div[@class="multiTag"]/ul/li')
        for c in cate:
            item['category']=c.xpath('a/text()').extract()[0]
            yield item
    
    2 replies    2015-01-19 04:03:23 +08:00
    huangguoji
        1
    huangguoji  
       Jan 14, 2015
    def parse(self, response):
    item = MyItem()
    sel = Selector(response)
    videos = sel.xpath('//ul[@class="listThumbs"]/li')
    for v in videos:
    item['img']=v.xpath('a[@class="thumb"]/img/@src').extract()[0]
    item['title'] = v.xpath('a[@class="title"]/text()').extract()[0]
    item['url'] = v.xpath('a[@class="thumb"]/@href').extract()[0]
    yield Request(item['url'],callback=self.parse_page,meta={"item":item})
    def parse_page(self,response):
    item= response.meta["item"]
    hxs=Selector(response)
    cate=hxs.xpath('//div[@class="multiTag"]/ul/li')
    for c in cate:
    item['category']=c.xpath('a/text()').extract()[0]
    yield item
    willme105
        2
    willme105  
    OP
       Jan 19, 2015
    @huangguoji 这样做还是有问题,列表页显示的内容都一样,而且都是最后一条信息。但内容页获取的信息是对的。请问改怎么办
    About   ·   Help   ·   Advertise   ·   Blog   ·   API   ·   FAQ   ·   Solana   ·   2476 Online   Highest 6679   ·     Select Language
    创意工作者们的社区
    World is powered by solitude
    VERSION: 3.9.8.5 · 40ms · UTC 16:00 · PVG 00:00 · LAX 09:00 · JFK 12:00
    ♥ Do have faith in what you're doing.