如何使用 Node.js 解析 HTML 页面

发布于 2024-12-04 07:46:56 字数 196 浏览 2 评论 0原文

我需要解析(服务器端)大量 HTML 页面。
我们都同意正则表达式不适合这里。
在我看来,javascript 是解析 HTML 页面的本机方式,但这种假设依赖于服务器端代码具有 javascript 在浏览器中具有的所有 DOM 功能。

Node.js 是否内置了这种功能?
有没有更好的方法来解决这个问题,在服务器端解析 HTML?

I need to parse (server side) big amounts of HTML pages.
We all agree that regexp is not the way to go here.
It seems to me that javascript is the native way of parsing a HTML page, but that assumption relies on the server side code having all the DOM ability javascript has inside a browser.

Does Node.js have that ability built in?
Is there a better approach to this problem, parsing HTML on the server side?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(6

初见终念 2024-12-11 07:46:56

您可以使用 npm 模块 jsdomhtmlparser 在 Node 中创建和解析 DOM。 JS。

其他选项包括:

在所有这些选项中,我更喜欢使用 Node.js 选项,因为它使用标准 W3C DOM 访问器方法,并且我可以在客户端和服务器上重用代码。我希望 BeautifulSoup 的方法与 W3C dom 更相似,而且我认为将 HTML 转换为 XHTML 来编写 XSLT 简直就是虐待狂。

You can use the npm modules jsdom and htmlparser to create and parse a DOM in Node.JS.

Other options include:

  • BeautifulSoup for python
  • you can convert you html to xhtml and use XSLT
  • HTMLAgilityPack for .NET
  • CsQuery for .NET (my new favorite)
  • The spidermonkey and rhino JS engines have native E4X support. This may be useful, only if you convert your html to xhtml.

Out of all these options, I prefer using the Node.js option, because it uses the standard W3C DOM accessor methods and I can reuse code on both the client and server. I wish BeautifulSoup's methods were more similar to the W3C dom, and I think converting your HTML to XHTML to write XSLT is just plain sadistic.

纵山崖 2024-12-11 07:46:56

使用 Cheerio。它不像 jsdom 那么严格,并且针对抓取进行了优化。作为奖励,使用您已经知道的 jQuery 选择器。

❤ 熟悉的语法:Cheerio 实现了核心 jQuery 的子集。凯里欧
从 jQuery 中删除所有 DOM 不一致和浏览器缺陷
库,揭示其真正华丽的 API。

ϟ 速度极快:Cheerio 使用非常简单、一致的 DOM
模型。因此,解析、操作和渲染是令人难以置信的
高效的。初步的端到端基准测试表明,cheerio 是
比 JSDOM 快大约 8 倍。

❁ 极其灵活:Cheerio 包裹着@FB55 的宽容
html解析器。 Cheerio 可以解析几乎任何 HTML 或 XML 文档。

Use Cheerio. It isn't as strict as jsdom and is optimized for scraping. As a bonus, uses the jQuery selectors you already know.

❤ Familiar syntax: Cheerio implements a subset of core jQuery. Cheerio
removes all the DOM inconsistencies and browser cruft from the jQuery
library, revealing its truly gorgeous API.

ϟ Blazingly fast: Cheerio works with a very simple, consistent DOM
model. As a result parsing, manipulating, and rendering are incredibly
efficient. Preliminary end-to-end benchmarks suggest that cheerio is
about 8x faster than JSDOM.

❁ Insanely flexible: Cheerio wraps around @FB55's forgiving
htmlparser. Cheerio can parse nearly any HTML or XML document.

↙温凉少女 2024-12-11 07:46:56

2020 年 11 月更新

我搜索了顶级 NodeJS html 解析器库。

因为我的用例不需要具有许多功能的库,所以我可以专注于稳定性和性能。

我所说的稳定性是指我希望社区能够使用该库足够长的时间,以便发现错误,并且仍将对其进行维护,并且将解决未解决的问题。

很难理解开源库的未来,但我根据openbase

我根据最后一次提交分为 2 组(每组的顺序根据 Github 开始):

最后一次提交是在过去 6 个月内:

名称最后一次提交Open IssuesGithub star
jsdom3 个月33114.9K
htmlparser28 天22.7K
parse52 个月22.5K
< a href="https://github.com/APIDevTools/swagger-parser" rel="noreferrer">swagger-parser2 个月48663
html-parse-stringify4 个月3215
node-html-parser7 天15205

最后一次提交是 6 个月,并且上面:

名称最后一次提交未决问题Github 星
cheerio1 年17422.9K
koa-bodyparser6 个月91.1K
sax-js3 年65941
draftjs-to-html1 Year27233

我选择 Node-html-parser 因为它看起来安静快速并且此时非常活跃。

(*) Openbase 添加了有关每个库的更多信息,例如贡献者数量(+3 次提交)、每周下载量、每月提交量、版本等。

(**) 上表是根据具体时间和日期的快照 - 我会再次检查参考,并作为第一步检查最近的活动水平,然后深入研究较小的细节。

November 2020 Update

I searched for the top NodeJS html parser libraries.

Because my use cases didn't require a library with many features, I could focus on stability and performance.

By stability I mean that I want the library to be used long enough by the community in order to find bugs and that it will be still maintained and that open issues will be closed.

Its hard to understand the future of an open source library, but I did a small summary based on the top 10 libraries in openbase.

I divided into 2 groups according to the last commit (and on each group the order is according to Github starts):

Last commit is in the last 6 months:

NameLast commitOpen IssuesGithub stars
jsdom3 Months33114.9K
htmlparser28 days22.7K
parse52 Months22.5K
swagger-parser2 Months48663
html-parse-stringify4 Months3215
node-html-parser7 days15205

Last commit is 6 months and above:

NameLast commitOpen IssuesGithub stars
cheerio1 year17422.9K
koa-bodyparser6 months91.1K
sax-js3 Years65941
draftjs-to-html1 Year27233

I picked Node-html-parser because it seems quiet fast and very active at this moment.

(*) Openbase adds much more information regarding each library like the number of contributors (with +3 commits), weekly downloads, Monthly commits, Version etc'.

(**) The table above is a snapshot according to the specific time and date - I would check the reference again and as a first step check the level of recent activity and then dive into the smaller details.

就是爱搞怪 2024-12-11 07:46:56

使用htmlparser2,它的速度更快而且非常简单。请参阅此用法示例:

https://www.npmjs.org/package/htmlparser2#usage

以及现场演示这里:

http://demos.forbeslindesay.co.uk/htmlparser2/

Use htmlparser2, its way faster and pretty straightforward. Consult this usage example:

https://www.npmjs.org/package/htmlparser2#usage

And the live demo here:

http://demos.forbeslindesay.co.uk/htmlparser2/

他不在意 2024-12-11 07:46:56

FB55 的 Htmlparser2 似乎是一个不错的选择。

Htmlparser2 by FB55 seems to be a good alternative.

離殇 2024-12-11 07:46:56

jsdom 过于严格,无法执行任何真正的屏幕抓取之类的事情,但 beautifulsoup 不会因不良标记而窒息。

node-soupselect 是 python 的 beautifulsoup 到 Nodejs 的端口,它工作得很好

jsdom is too strict to do any real screen scraping sort of things, but beautifulsoup doesn't choke on bad markup.

node-soupselect is a port of python's beautifulsoup into nodejs, and it works beautifully

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文