我需要使用 spider-flow 框架爬取下面这三个网站的内容 https://price.21food.cn/product/939.html https://price.21food.cn/product/1505.html https://price.21food.cn/product/196.html
这三个网址中我已经实现了其中一个网址的爬虫,由于这三个网址只是数据不同,所以这三个网址的数据其实可以放到一个爬虫里实现,之前我在 Selenium 框架中我是直接构建一个 url 集合用 for 循环解决的,但是在 spider-flow 中却难以实现
我的想法是先定义一个 url 集合,然后建立循环爬取,所以我构建了如下所示的内容
第一个定义变量的内容是 urlList,定义了三个地址的集合["https://price.21food.cn/product/939.html","https://price.21food.cn/product/1505.html","https://price.21food.cn/product/196.html"]
第二个是循环,顶一个 urlIndex 的下标,次数为 urlList
第三个变量定义了 url 变量,值为${urlList[urlIndex]},其实就是获取前面集合中的具体 url
第四个开始爬取使用的 url 指定为前面的 url ,值为${url}
后面都是爬取数据爬虫逻辑,后面的内容是完全可用的,我之前已经试过了,这样构造我看着感觉没问题,但是时间运行之后的结果就是在第一个定义变量定义完之后就结束了
我去网上搜索了很多教程,但是关于这个需求怎么实现的是找不到相关教程和案例,这个官网的文档我还不知道为什么打不开,我是实在没办法了,所以我来请教各位,各位有懂的还希望能不吝赐教,小弟在这里先谢过了
spider-flow 框架的码云地址: https://gitee.com/ssssssss-team/spider-flow
下载项目然后用 idea 打开,在数据库中运行项目提供 db.sql 并指定配置文件中数据库的地址就可以正确运行了,默认访问地址是 localhost:8088
下面是我的构建的爬虫的内容,各位只要将该内容粘贴到 spider-flow 中即可运行,具体点击 XML 编辑的选项
<mxGraphModel>
<root>
<mxCell id="0">
<JsonProperty as="data">
{"spiderName":"食品商务网爬虫(未整合多个网址)","submit-strategy":"random","threadCount":""}
</JsonProperty>
</mxCell>
<mxCell id="1" parent="0"/>
<mxCell id="2" value="开始" style="start" parent="1" vertex="1">
<mxGeometry x="300" y="80" width="32" height="32" as="geometry"/>
<JsonProperty as="data">
{"shape":"start"}
</JsonProperty>
</mxCell>
<mxCell id="3" value="开始抓取" style="request" parent="1" vertex="1">
<mxGeometry x="490" y="80" width="32" height="32" as="geometry"/>
<JsonProperty as="data">
{"value":"开始抓取","loopVariableName":"","method":"GET","sleep":"","timeout":"","response-charset":"","retryCount":"","retryInterval":"","body-type":"none","body-content-type":"text/plain","loopCount":"","url":"${url}","proxy":"","request-body":"","follow-redirect":"1","tls-validate":"1","cookie-auto-set":"1","repeat-enable":"0","shape":"request"}
</JsonProperty>
</mxCell>
<mxCell id="4" value="定义变量" style="variable" parent="1" vertex="1">
<mxGeometry x="620" y="80" width="32" height="32" as="geometry"/>
<JsonProperty as="data">
{"value":"定义变量","loopVariableName":"","variable-name":["dataList"],"variable-description":[""],"loopCount":"","variable-value":["${extract.xpaths(resp.html,'/html/body/div[2]/div[3]/div/div[2]/div[1]/div[2]/div[2]/ul/li')}"],"shape":"variable"}
</JsonProperty>
</mxCell>
<mxCell id="9" value="" style="strokeWidth=2;sharp=1;" parent="1" source="3" target="4" edge="1">
<mxGeometry relative="1" as="geometry"/>
<JsonProperty as="data">
{"value":"","exception-flow":"0","lineWidth":"2","line-style":"sharp","lineColor":"black","condition":"","transmit-variable":"1"}
</JsonProperty>
</mxCell>
<mxCell id="11" value="循环" style="loop" parent="1" vertex="1">
<mxGeometry x="620" y="170" width="32" height="32" as="geometry"/>
<JsonProperty as="data">
{"value":"循环","loopItem":"","loopVariableName":"index","loopCount":"${list.length(dataList)}","loopStart":"0","loopEnd":"-1","shape":"loop"}
</JsonProperty>
</mxCell>
<mxCell id="12" value="" style="strokeWidth=2;sharp=1;" parent="1" source="4" target="11" edge="1">
<mxGeometry relative="1" as="geometry"/>
<JsonProperty as="data">
{"value":"","exception-flow":"0","lineWidth":"2","line-style":"sharp","lineColor":"black","condition":"","transmit-variable":"1"}
</JsonProperty>
</mxCell>
<mxCell id="13" value="输出" style="output" parent="1" vertex="1">
<mxGeometry x="790" y="334" width="32" height="32" as="geometry"/>
<JsonProperty as="data">
{"value":"输出","loopVariableName":"","tableName":"","csvName":"","csvEncoding":"GBK","output-name":["产品名","市场","规格","最高价格","平均价格","最低价格","日期"],"loopCount":"","output-value":["${name}","${market}","${specifications}","${top}","${avg}","${low}","${dataDate}"],"output-all":"0","output-database":"0","output-csv":"0","shape":"output"}
</JsonProperty>
</mxCell>
<mxCell id="15" value="定义变量" style="variable" parent="1" vertex="1">
<mxGeometry x="620" y="250" width="32" height="32" as="geometry"/>
<JsonProperty as="data">
{"value":"定义变量","loopVariableName":"","variable-name":["name","market","specifications","top","avg","low","dataDate"],"variable-description":["","","","","","",""],"loopCount":"","variable-value":["${dataList[index].selectors('table tbody tr td a')[0].text()}","${dataList[index].selectors('table tbody tr td a')[1].text()}","${dataList[index].selectors('table tbody tr td span')[0].text()}","${dataList[index].selectors('table tbody tr td span')[1].text()}","${dataList[index].selectors('table tbody tr td span')[3].text()}","${dataList[index].selectors('table tbody tr td span')[2].text()}","${dataList[index].selectors('table tbody tr td span')[4].text()}"],"shape":"variable"}
</JsonProperty>
</mxCell>
<mxCell id="16" value="" style="strokeWidth=2;sharp=1;" parent="1" source="11" target="15" edge="1">
<mxGeometry relative="1" as="geometry"/>
<JsonProperty as="data">
{"value":"","exception-flow":"0","lineWidth":"2","line-style":"sharp","lineColor":"black","condition":"","transmit-variable":"1"}
</JsonProperty>
</mxCell>
<mxCell id="18" value="" style="strokeWidth=2;sharp=1;" parent="1" source="15" target="13" edge="1">
<mxGeometry relative="1" as="geometry"/>
<JsonProperty as="data">
{"value":"","exception-flow":"0","lineWidth":"2","line-style":"sharp","lineColor":"black","condition":"","transmit-variable":"1"}
</JsonProperty>
</mxCell>
<mxCell id="27" value="定义变量" style="variable" parent="1" vertex="1">
<mxGeometry x="90" y="440" width="32" height="32" as="geometry"/>
<JsonProperty as="data">
{"value":"定义变量","loopVariableName":"","variable-name":["urlList"],"variable-description":[""],"loopCount":"","variable-value":["[\"https://price.21food.cn/product/939.html\",\"https://price.21food.cn/product/1505.html\",\"https://price.21food.cn/product/196.html\"]"],"shape":"variable"}
</JsonProperty>
</mxCell>
<mxCell id="29" value="循环" style="loop" parent="1" vertex="1">
<mxGeometry x="180" y="440" width="32" height="32" as="geometry"/>
<JsonProperty as="data">
{"value":"循环","loopItem":"","loopVariableName":"urlIndex","loopCount":"${list.length(urlList)}","loopStart":"0","loopEnd":"-1","shape":"loop"}
</JsonProperty>
</mxCell>
<mxCell id="31" value="定义变量" style="variable" parent="1" vertex="1">
<mxGeometry x="262" y="440" width="32" height="32" as="geometry"/>
<JsonProperty as="data">
{"value":"定义变量","loopVariableName":"","variable-name":["url"],"variable-description":[""],"loopCount":"","variable-value":["${urlList[urlIndex]}"],"shape":"variable"}
</JsonProperty>
</mxCell>
<mxCell id="42" value="" style="strokeWidth=2;sharp=1;" edge="1" parent="1" source="27" target="29">
<mxGeometry relative="1" as="geometry"/>
<JsonProperty as="data">
{"value":"","exception-flow":"0","lineWidth":"2","line-style":"sharp","lineColor":"black","condition":"","transmit-variable":"1"}
</JsonProperty>
</mxCell>
<mxCell id="43" value="" style="strokeWidth=2;sharp=1;" edge="1" parent="1" source="29" target="31">
<mxGeometry relative="1" as="geometry"/>
<JsonProperty as="data">
{"value":"","exception-flow":"0","lineWidth":"2","line-style":"sharp","lineColor":"black","condition":"","transmit-variable":"1"}
</JsonProperty>
</mxCell>
<mxCell id="44" value="" style="strokeWidth=2;sharp=1;" edge="1" parent="1" source="2" target="27">
<mxGeometry relative="1" as="geometry"/>
<JsonProperty as="data">
{"value":"","exception-flow":"0","lineWidth":"2","line-style":"sharp","lineColor":"black","condition":"","transmit-variable":"1"}
</JsonProperty>
</mxCell>
<mxCell id="45" value="" style="strokeWidth=2;sharp=1;" edge="1" parent="1" source="31" target="3">
<mxGeometry relative="1" as="geometry"/>
<JsonProperty as="data">
{"value":"","exception-flow":"0","lineWidth":"2","line-style":"sharp","lineColor":"black","condition":"","transmit-variable":"1"}
</JsonProperty>
</mxCell>
</root>
</mxGraphModel>
1
tiRolin OP 还有我想问下这个框架怎么模拟点击操作?我看案例中打开新网页的方法是获取 url 拼接之后开启新的爬虫进行爬取
但是有些我想要爬取数据的网址是不直接存在 html 中,要执行点击操作才会自动跳转到新网址,我在代码上使用 Selenium 框架可以执行操作,但是在 spiderflow 框架中又要怎么做才行? |