看了下他的检查代码,好像是定时提交一个隐藏的表单
隐藏表单内容如下:
<form id="challenge-form" action="/cdn-cgi/l/chk_jschl" method="get">
<input type="hidden" name="jschl_vc" value="8638db33f4d888e137a518e882e7c8e3">
<input type="hidden" name="pass" value="1475845385.911-E7QOUU3JrO">
<input type="hidden" id="jschl-answer" name="jschl_answer">
</form>
对应的检查代码如下:
<script type="text/javascript">
//<![CDATA[
(function(){
var a = function() {
try{
return !!window.addEventListener}
catch(e) {
return !1}
}
,
b = function(b, c) {
a() ? document.addEventListener("DOMContentLoaded", b, c) : document.attachEvent("onreadystatechange", b)};
b(function(){
var a = document.getElementById('yjs-content');
a.style.display = 'block';
setTimeout(function(){
var s,t,o,p,b,r,e,a,k,i,n,g,f, tEGADKc={"UOLHSZfuv":+((!+[]+!![]+[])+(!+[]+!![]))};
t = document.createElement('div');
t.innerHTML="<a href='/'>x</a>";
t = t.firstChild.href;
r = t.match(/https?:\/\//)[0];
t = t.substr(r.length);
t = t.substr(0,t.length-1);
a = document.getElementById('jschl-answer');
f = document.getElementById('challenge-form');
;
tEGADKc.UOLHSZfuv+=+((!+[]+!![]+!![]+!![]+!![]+[])+(+[]));
tEGADKc.UOLHSZfuv+=+((!+[]+!![]+!![]+[])+(+!![]));
tEGADKc.UOLHSZfuv+=+((!+[]+!![]+!![]+[])+(!+[]+!![]+!![]+!![]+!![]+!![]));
tEGADKc.UOLHSZfuv*=+((!+[]+!![]+!![]+!![]+[])+(!+[]+!![]+!![]));
tEGADKc.UOLHSZfuv-=!+[]+!![]+!![]+!![]+!![]+!![]+!![];
tEGADKc.UOLHSZfuv*=+((!+[]+!![]+!![]+!![]+!![]+[])+(+[]));
a.value = parseInt(tEGADKc.UOLHSZfuv, 10) + t.length;
'; 121'
f.submit();
}
, 4000);
}
, false);
}
)();
//]]>
</script>
经过搜索,发现是CloudFlare的AntiBot在作怪,并在github上发现了对应的解决模块
cloudflare-scrape
A simple Python module to bypass Cloudflare's anti-bot page, using Requests.
按照Installation和Usage的说明,依然提示无法获取对应的token:
print cfscrape.get_cookie_string("http://ips.chacuo.net/")
'http://ips.chacuo.net/' returned an error. Could not collect tokens.
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python2.7/dist-packages/cfscrape/__init__.py", line 146, in get_cookie_string
tokens, user_agent = cls.get_tokens(url, user_agent=user_agent)
File "/usr/local/lib/python2.7/dist-packages/cfscrape/__init__.py", line 119, in get_tokens
resp.raise_for_status()
File "/usr/local/lib/python2.7/dist-packages/requests/models.py", line 862, in raise_for_status
raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 503 Server Error: Service Temporarily Unavailable for url: http://ips.chacuo.net/
1
icedx 2016-10-07 23:52:33 +08:00 1
Selenium
|
2
ifishman OP @icedx 首先感谢回复,根据你的提示去了解了下 Selenium , selenium 是一个 web 的自动化测试工具,但是我需要怎么和 pyspider 进行整合?
不好意思,才入门 python ,才会基础语法那种入门 |
3
icedx 2016-10-08 12:53:31 +08:00
|
4
1130335361 2016-10-09 10:13:33 +08:00
cc @binux
|
6
binux 2016-10-09 18:32:47 +08:00 1
1. 用浏览器过掉验证,拷贝 cookie ,带 cookie 抓(可能 cookie 包含 ip 信息)
2. ``` class Handler(BaseHandler): crawl_config = { } @every(minutes=24 * 60) def on_start(self): self.crawl('http://ips.chacuo.net/', callback=self.chk_jschl , fetch_type='js', js_script=r''' function() { var script = document.querySelector('script').textContent; console.log(script); script = script.match(new RegExp("setTimeout\\(function\\(\\){([\\s\\S\n]+)f.submit", 'm'))[1] + 'a.value;'; console.log(script); return eval(script); } ''') @catch_status_code_error @config(age=10 * 24 * 60 * 60) def chk_jschl (self, response): print response.cookies print response.js_script_result self.crawl('http://ips.chacuo.net/cdn-cgi/l/chk_jschl', params={ 'jschl_vc': response.doc('input[name=jschl_vc]').val(), 'pass': response.doc('input[name=pass]').val(), 'jschl-answer': response.js_script_result }, callback=self.detail_page, cookies=response.cookies, headers={ 'Referer': 'http://ips.chacuo.net/' }) ``` |
9
1130335361 2016-10-10 10:01:27 +08:00
|
10
ifishman OP @1130335361 测试过了,已经可以正常抓取了
|
11
binux 2016-10-11 00:37:07 +08:00
@1130335361 这个验证是和 IP 绑定的,如果你在 demo.pyspider.org 测试,它有两个 IP ,要多试几次。
|
12
feigle 2017-07-26 22:16:47 +08:00
翻了下这个页面( http://ips.chacuo.net/),没发现有这个隐藏域啊
|