不能外出,在家闲来无事准备写爬虫练手,选了个 pixabay.com ,浏览器正常访问,复制浏览器 headers,用 curl 抓取页面内容:
<?php
$ch = curl_init('https://pixabay.com');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_HEADER, true);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION,TRUE);
curl_setopt($ch, CURLOPT_HTTPHEADER, array(
"accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng;q=0.8,application/signed-exchange;v=b3",
"Accept-Language:en-US,en;q=0.5",
"Accept-encoding: identity",
"User-agent: Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36",
));
$a = curl_exec($ch);
curl_close($ch);
echo $a;
?>
直接返回 cloudflare 的 403,这是用了什么神奇的技术?我浏览器都能正常访问证明 ip 没被 ban
我是 php 小白,希望大神不吝赐教
1
webshe11 2020-01-26 10:11:53 +08:00
搞个 BurpSuite 抓包,看看到底一样不一样
|
2
ClarkAbe 2020-01-26 10:15:17 +08:00 via Android
捕获服务端 SetCookie 然后用带 CfCookie 的请求再次.....
|
3
kisshere OP @ClarkAbe 浏览器清空 cookie 不带 referer 直接访问,还是能正常访问 pixabay,用 curl 就不行
|
4
chzzzy 2020-01-26 10:21:05 +08:00 via iPhone
有时候我挂梯访问也会被 403
|
5
tqyq88 2020-01-26 10:23:57 +08:00
测试通过
curl 'https://pixabay.com/' -H 'authority: pixabay.com' -H 'cache-control: max-age=0' -H 'upgrade-insecure-requests: 1' -H 'user-agent: Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36' -H 'sec-fetch-user: ?1' -H 'accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3' -H 'sec-fetch-site: cross-site' -H 'sec-fetch-mode: navigate' -H 'referer: https://v2ex.com/t/640340' -H 'accept-language: zh-CN,zh;q=0.9' -H 'cookie: __cfduid=d449a6f26015565933edbff1e02f115151580005222; anonymous_user_id=f886998b-0a14-45fb-a0f4-77c58f2f627b; _ga=GA1.2.1192310497.1580005225; _gid=GA1.2.1957918665.1580005225; _gat_UA-20223345-1=1; is_human=1; _sp_id.aded=c8cdb6f1-c436-4cd6-8434-7055755d8aba.1580005226.1.1580005226.1580005226.5a63ed0f-69dc-4111-ba9b-a09f3f22a098; _sp_ses.aded=*; client_width=1399' |
6
0x400 2020-01-26 12:01:30 +08:00 via Android
|
7
baobao1270 2020-01-26 12:57:56 +08:00
说不定是 Cookie 每次请求刷新……
|
9
bloggergo 2020-01-26 19:09:18 +08:00
可能是网站使用了 cloudflare 的防盗链保护,需要检测 referer 同源或者空 referer 也可以,其他的域名会 403,还有 curl 的"Accept-encoding: identity"这个和用户浏览器有不同。
|
10
haha370104 2020-03-20 10:32:48 +08:00
@baobao1270 cookie 里面的 _sp_id.aded 参数每次都会变,里面有个时间戳
|