请教快速解析文本 html 的工具?

• 请不要在回答技术问题时复制粘贴 AI 生成的内容

This topic created in 2149 days ago, the information mentioned may be changed or developed.

一大堆 html 获取里面的某些文本，字段、

我现在用的就是 1.正则，2.解析成 dom 然后 jquery

请问各位爬虫大佬，你们是如何做的？有什么高效的工具吗？

HTML

文本

正则

解析

6 replies • 2020-06-25 10:49:43 +08:00

Vegetable

Jun 24, 2020

描述的太笼统了，什么技术栈啊？都绕不开你第二个方法的思路，xpath 什么的

jorneyr

Jun 24, 2020

直接用 jQuery 的选择器，用正则容易出错:
var $html = $(htmlContent);
var target = $html(selector);

duan602728596

Jun 24, 2020

jsdom：解析 html 字符串，然后可以使用部分 BOM 、DOM 的 api，对 html 进行操作，然后还可以重新生成 html 字符串
parse5：将 html 字符串解析成 ast 树，也可以根据 ast 树生成 html 字符串
cheerio：感觉没有 jsdom 好用

bigboNed3

Jun 24, 2020

python beautifulsoup
java jsoup
基本每个语言都会有对应的 soup

fivesmallq

Jun 24, 2020

之前做爬虫的时候写的一个小工具。

https://github.com/fivesmallq/web-data-extractor

Extracting and parsing structured data with jQuery Selector, XPath or JsonPath from common web format like HTML, XML and JSON.

omph

Jun 25, 2020

https://github.com/benibela/xidel