当前位置：文江博客话题详情

如何使用 Node.js 解析 HTML 页面

发布于 2024-12-04 07:46:56 字数 196 浏览 2 评论 0原文

我需要解析（服务器端）大量 HTML 页面。
我们都同意正则表达式不适合这里。
在我看来，javascript 是解析 HTML 页面的本机方式，但这种假设依赖于服务器端代码具有 javascript 在浏览器中具有的所有 DOM 功能。

Node.js 是否内置了这种功能？
有没有更好的方法来解决这个问题，在服务器端解析 HTML？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

初见终念 2024-12-11 07:46:56

您可以使用 npm 模块 jsdom 和 htmlparser 在 Node 中创建和解析 DOM。 JS。

其他选项包括：

BeautifulSoup for python
您可以将 html 到 xhtml 并使用 XSLT
HTMLAgilityPack for .NET
CsQuery for .NET（我的新宠）
Spidermonkey 和 rhino JS 引擎具有原生 E4X 支持。仅当您将 html 转换为 xhtml 时，这可能才有用。

在所有这些选项中，我更喜欢使用 Node.js 选项，因为它使用标准 W3C DOM 访问器方法，并且我可以在客户端和服务器上重用代码。我希望 BeautifulSoup 的方法与 W3C dom 更相似，而且我认为将 HTML 转换为 XHTML 来编写 XSLT 简直就是虐待狂。

回复收藏 0 原文

纵山崖 2024-12-11 07:46:56

使用 Cheerio。它不像 jsdom 那么严格，并且针对抓取进行了优化。作为奖励，使用您已经知道的 jQuery 选择器。

❤ 熟悉的语法：Cheerio 实现了核心 jQuery 的子集。凯里欧
从 jQuery 中删除所有 DOM 不一致和浏览器缺陷
库，揭示其真正华丽的 API。
ϟ 速度极快：Cheerio 使用非常简单、一致的 DOM
模型。因此，解析、操作和渲染是令人难以置信的
高效的。初步的端到端基准测试表明，cheerio 是
比 JSDOM 快大约 8 倍。
❁ 极其灵活：Cheerio 包裹着@FB55 的宽容
html解析器。 Cheerio 可以解析几乎任何 HTML 或 XML 文档。

回复收藏 0 原文

↙温凉少女 2024-12-11 07:46:56

2020 年 11 月更新

我搜索了顶级 NodeJS html 解析器库。

因为我的用例不需要具有许多功能的库，所以我可以专注于稳定性和性能。

我所说的稳定性是指我希望社区能够使用该库足够长的时间，以便发现错误，并且仍将对其进行维护，并且将解决未解决的问题。

很难理解开源库的未来，但我根据openbase。

我根据最后一次提交分为 2 组（每组的顺序根据 Github 开始）：

最后一次提交是在过去 6 个月内：

名称	最后一次提交	Open Issues	Github star
jsdom	3 个月	331	14.9K
htmlparser2	8 天	2	2.7K
parse5	2 个月	2	2.5K
< a href="https://github.com/APIDevTools/swagger-parser" rel="noreferrer">swagger-parser	2 个月	48	663
html-parse-stringify	4 个月	3	215
node-html-parser	7 天	15	205

最后一次提交是 6 个月，并且上面：

名称	最后一次提交	未决问题	Github 星
cheerio	1 年	174	22.9K
koa-bodyparser	6 个月	9	1.1K
sax-js	3 年	65	941
draftjs-to-html	1 Year	27	233

我选择 Node-html-parser 因为它看起来安静快速并且此时非常活跃。

(*) Openbase 添加了有关每个库的更多信息，例如贡献者数量（+3 次提交）、每周下载量、每月提交量、版本等。

(**) 上表是根据具体时间和日期的快照 - 我会再次检查参考，并作为第一步检查最近的活动水平，然后深入研究较小的细节。

November 2020 Update

I searched for the top NodeJS html parser libraries.

Because my use cases didn't require a library with many features, I could focus on stability and performance.

By stability I mean that I want the library to be used long enough by the community in order to find bugs and that it will be still maintained and that open issues will be closed.

Its hard to understand the future of an open source library, but I did a small summary based on the top 10 libraries in openbase.

I divided into 2 groups according to the last commit (and on each group the order is according to Github starts):

Last commit is in the last 6 months:

Name	Last commit	Open Issues	Github stars
jsdom	3 Months	331	14.9K
htmlparser2	8 days	2	2.7K
parse5	2 Months	2	2.5K
swagger-parser	2 Months	48	663
html-parse-stringify	4 Months	3	215
node-html-parser	7 days	15	205

Last commit is 6 months and above:

Name	Last commit	Open Issues	Github stars
cheerio	1 year	174	22.9K
koa-bodyparser	6 months	9	1.1K
sax-js	3 Years	65	941
draftjs-to-html	1 Year	27	233

I picked Node-html-parser because it seems quiet fast and very active at this moment.

(*) Openbase adds much more information regarding each library like the number of contributors (with +3 commits), weekly downloads, Monthly commits, Version etc'.

(**) The table above is a snapshot according to the specific time and date - I would check the reference again and as a first step check the level of recent activity and then dive into the smaller details.

回复收藏 0 原文