求教 python 网站爬虫过滤出图片 url 的问题 - V2EX

Home Sign Up Sign In

推荐学习书目

› Learn Python the Hard Way

Python Sites

› PyPI - Python Package Index

› http://diveintopython.org/toc/index.html

› Pocoo

值得关注的项目

› PyPy

› Celery

› Jinja2

› Read the Docs

› gevent

› pyenv

› Stackless Python

› Beautiful Soup

› 结巴中文分词

› Green Unicorn

› Sentry

› Shovel

› pytest

Python 编程

› pep8 Checker

Styles

› PEP 8

› Google Python Style Guide

› Code Style from The Hitchhiker's Guide

This topic created in 3787 days ago, the information mentioned may be changed or developed.

下面代码是在图片网站上截取的，想问下用 Python 爬虫，怎样才能过滤出 images 下面不同 size 的图片 url 呢？

"images": [{
            "size": "watermark",
            "url": "https:\/\/drscdn.500px.org\/photo\/original\/store\/67585293\/m%3D900_k%3D2_b%3D2_dpi%3D300_attachment%3D1_tags%3D1\/e47f93b520c772b2612bef1ff2fa77ae"
        },
        {
            "size": "280",
            "url": "https:\/\/drscdn.500px.org\/photo\/67585293\/w%3D280_s%3D1\/a967861eaf97496c9243c9aaccb63502"
        },
        {
            "size": "560",
            "url": "https:\/\/drscdn.500px.org\/photo\/67585293\/w%3D560_s%3D1\/90e4feec7585c24a5f1d45b1ee21262b"
        },
        {
            "size": "600",
            "url": "https:\/\/drscdn.500px.org\/photo\/67585293\/w%3D600_s%3D1\/aa4e4a084ea37fa1ce8a82c3865ce43a"
        },
        {
            "size": "115",
            "url": "https:\/\/drscdn.500px.org\/photo\/67585293\/w%3D115_h%3D115_s%3D1\/8fae531a821e375dda4f4938b0b5829f"
        },
        {
            "size": "160",
            "url": "https:\/\/drscdn.500px.org\/photo\/67585293\/w%3D160_h%3D160_s%3D1\/0e4a9829237dd66700a27ebed6d8f761"
        },
        {
            "size": "2048",
            "url": "https:\/\/drscdn.500px.org\/photo\/67585293\/w%3D2048\/d8142c223fa99fc99b9bd4a1b44462eb"
        }],

drscdn.500px.org

12 replies • 2016-02-19 11:55:29 +08:00

1

yahoo21cn

Feb 19, 2016

正则

2

mgna17

Feb 19, 2016

比如这样么
re.findall(r'(?<=url\":\s\").+?(?=\}\,)', your_text)

3

jarlyyn

Feb 19, 2016

首先，这个明显是 json 的一部分。

其次， 500px 自己就有公开的 api 。

4

Ncer

Feb 19, 2016

这种格式的用 json 解析一下

5

annielong

Feb 19, 2016

明显是 json ，格式化一下后输出就可以了，不用正则吧

6

popok

Feb 19, 2016

@annielong 对， json 直接操作就行，格不格式化只是给自己好看点而已。

7

magicdawn

Feb 19, 2016

拿到 json 字符串, 找到 `"images": [` 左大括号 index, 计算出右大括号 index, slice, json.load

8

magicdawn

Feb 19, 2016

中括号...

9

Koge

OP

Feb 19, 2016

@magicdawn 怎样才能从 html 里面解析并且拿到 json 字符串呢？

10

imn1

Feb 19, 2016

正则执行效率高， json 开发效率高
如果 json 不是单独一个文件或 XHR ，而是嵌入在页面或某个 js 里面，建议还是正则快

11

Frapples

Feb 19, 2016

import json
img_info = json.loads(json_str)
# json_str 就是你抓下来的字符串，从你贴的内容来看是 json 格式的。用此函数解析成 python 的数据结构。
# 然后你可以 print(img_info)看看

12

mikezhang0515

Feb 19, 2016

把\/替换成 / 然后匹配 url

About · Help · Advertise · Blog · API · FAQ · Solana · 1077 Online Highest 6679 ·

Select Language

创意工作者们的社区

World is powered by solitude

VERSION: 3.9.8.5 · 66ms · UTC 18:17 · PVG 02:17 · LAX 11:17 · JFK 14:17
♥ Do have faith in what you're doing.