推荐学习书目
Learn Python the Hard Way
Python Sites
PyPI - Python Package Index
http://diveintopython.org/toc/index.html
Pocoo
值得关注的项目
PyPy
Celery
Jinja2
Read the Docs
gevent
pyenv
virtualenv
Stackless Python
Beautiful Soup
结巴中文分词
Green Unicorn
Sentry
Shovel
Pyflakes
pytest
Python 编程
pep8 Checker
Styles
PEP 8
Google Python Style Guide
Code Style from The Hitchhiker's Guide
dididaren
V2EX  ›  Python

Python 下载今日头条的文章(不是爬虫),无法解码,问题到底出在哪儿?

  •  
  •   dididaren · Feb 20, 2021 · 3586 views
    This topic created in 1907 days ago, the information mentioned may be changed or developed.

    周末打算写一个脚本,功能是输入今日头条的文章链接,自动下载文章里的图片本地。

    遇到坑了。

    python requests 能成功获取到文章数据,用 charles 抓包也可以看到 requests.get 是成本获取到了文章数据,但是在本地解码的时候,无论是用 response.text,还是 response.content.decode,还是直接把 respons 接收到的数据以二进制写入文本,要么是乱码,要么是报错。( charles 抓包里查看是正常的数据)

    respose 解码时试过 gbk,utf-8,iso ; decode 参数试过 ignore, replace 也没用。

    谷歌百度也搜不到可用的解决方法。

    Supplement 1  ·  Feb 20, 2021
    import requests
    
    verify=False
    
    headers = '''
    accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9
    accept-encoding: gzip, deflate, br
    accept-language: zh-CN,zh;q=0.9,zh-TW;q=0.8,en;q=0.7
    cache-control: no-cache
    cookie: tt_webid=6930433366274721287; csrftoken=50dc3846aa3e01cf8e7cc3cacd3cb664; tt_webid=6930433366274721287; _tea_utm_cache_2256={%22utm_source%22:%22copy_link%22%2C%22utm_medium%22:%22toutiao_ios%22%2C%22utm_campaign%22:%22client_share%22}; s_v_web_id=verify_kld3dq66_Wvi5ATKw_nvWF_4CHH_9aBi_wzhdhfONq7HB; _ga=GA1.2.1782351744.1613793452; _gid=GA1.2.504008286.1613793452; csrftoken=50dc3846aa3e01cf8e7cc3cacd3cb664; __ac_nonce=06030995f005059b3fa30; __ac_signature=_02B4Z6wo00f01bNanKAAAIDB2pGskn.jLh2zfpgAAAznlJs4jAp.SxXUbLBYvi0aqQmu7OBPOm6M5vvKPJGKjxqNVGt0Cc2hHsiQ0bkQbsLQv.C4ZZyAHsoKG9tZtyfcvupCp.4rC56vmyGcb8; MONITOR_WEB_ID=72226c84-9c57-4b0e-931c-6a99c225be9d; tt_scid=w3cIoL.syHtp3f6lhdBFt08cmgEdko8ObFMxYA7uMNdRp-aVs1w0x852q6pFS7bybc1a
    dnt: 1
    pragma: no-cache
    sec-ch-ua: "Chromium";v="88", "Google Chrome";v="88", ";Not A Brand";v="99"
    sec-ch-ua-mobile: ?0
    sec-fetch-dest: document
    sec-fetch-mode: navigate
    sec-fetch-site: none
    sec-fetch-user: ?1
    upgrade-insecure-requests: 1
    user-agent: Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.135 Safari/537.36 Edge/12.10240
    '''
    
    def todict(headers):
        return dict([ l.split(': ') for l in headers.split('\n') if l])
    
    
    def downloadImages(url,headers):
    
        headers = todict(headers)
        res = requests.get(url=url,headers=headers,verify=False).content.decode('utf-8','ignore')
    
        print(res)
    
    
    
    if __name__ == '__main__':
    
    
        url = 'https://www.toutiao.com/a6931116005670371853/'
    
        downloadImages(url,headers)
    
    12 replies    2021-02-20 21:39:56 +08:00
    maobukui
        1
    maobukui  
       Feb 20, 2021 via iPhone   ❤️ 1
    地址,参数案例发一下看看
    imn1
        2
    imn1  
       Feb 20, 2021   ❤️ 1
    写到硬盘的文件试一下 zip 软件能否打开
    est
        3
    est  
       Feb 20, 2021   ❤️ 1
    目测 LZ 用的是 win98 + py2
    limuyan44
        4
    limuyan44  
       Feb 20, 2021   ❤️ 1
    你直接发代码吧,这样也好测试,不然帮个忙还得自己写代码,没抓过的人全靠猜。
    dididaren
        5
    dididaren  
    OP
       Feb 20, 2021
    py3,已附代码
    Arrowing
        6
    Arrowing  
       Feb 20, 2021 via Android   ❤️ 1
    内容经过压缩编码(gzip,compress 等)了吧,解压一下,浏览器一般自动解压,爬取的需要自己解压。
    Ptu2sha
        7
    Ptu2sha  
       Feb 20, 2021   ❤️ 1
    accept-encoding: gzip, deflate, br 去掉
    dididaren
        8
    dididaren  
    OP
       Feb 20, 2021
    @Arrowing 感谢
    dididaren
        9
    dididaren  
    OP
       Feb 20, 2021
    @Ptu2sha 感谢
    fucUup
        10
    fucUup  
       Feb 20, 2021
    Chrome/42.0 你是古董机
    omph
        11
    omph  
       Feb 20, 2021
    运行环境:linux,locale 默认 utf8
    代码输出正常。重定向存为 html 文件,打开正常
    vone
        12
    vone  
       Feb 20, 2021
    6 楼正解,br 压缩的问题,requests 默认不能解析 br 压缩后的 Body 。

    请求头的 accept-encoding 参数删掉 br 就可以了。
    About   ·   Help   ·   Advertise   ·   Blog   ·   API   ·   FAQ   ·   Solana   ·   3200 Online   Highest 6679   ·     Select Language
    创意工作者们的社区
    World is powered by solitude
    VERSION: 3.9.8.5 · 50ms · UTC 14:28 · PVG 22:28 · LAX 07:28 · JFK 10:28
    ♥ Do have faith in what you're doing.