如何使用 Node.js 解析 HTML 页面
我需要解析(服务器端)大量 HTML 页面。
我们都同意正则表达式不适合这里。
在我看来,javascript 是解析 HTML 页面的本机方式,但这种假设依赖于服务器端代码具有 javascript 在浏览器中具有的所有 DOM 功能。
Node.js 是否内置了这种功能?
有没有更好的方法来解决这个问题,在服务器端解析 HTML?
I need to parse (server side) big amounts of HTML pages.
We all agree that regexp is not the way to go here.
It seems to me that javascript is the native way of parsing a HTML page, but that assumption relies on the server side code having all the DOM ability javascript has inside a browser.
Does Node.js have that ability built in?
Is there a better approach to this problem, parsing HTML on the server side?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
data:image/s3,"s3://crabby-images/d5906/d59060df4059a6cc364216c4d63ceec29ef7fe66" alt="扫码二维码加入Web技术交流群"
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
您可以使用 npm 模块 jsdom 和 htmlparser 在 Node 中创建和解析 DOM。 JS。
其他选项包括:
在所有这些选项中,我更喜欢使用 Node.js 选项,因为它使用标准 W3C DOM 访问器方法,并且我可以在客户端和服务器上重用代码。我希望 BeautifulSoup 的方法与 W3C dom 更相似,而且我认为将 HTML 转换为 XHTML 来编写 XSLT 简直就是虐待狂。
You can use the npm modules jsdom and htmlparser to create and parse a DOM in Node.JS.
Other options include:
Out of all these options, I prefer using the Node.js option, because it uses the standard W3C DOM accessor methods and I can reuse code on both the client and server. I wish BeautifulSoup's methods were more similar to the W3C dom, and I think converting your HTML to XHTML to write XSLT is just plain sadistic.
使用 Cheerio。它不像 jsdom 那么严格,并且针对抓取进行了优化。作为奖励,使用您已经知道的 jQuery 选择器。
Use Cheerio. It isn't as strict as jsdom and is optimized for scraping. As a bonus, uses the jQuery selectors you already know.
2020 年 11 月更新
我搜索了顶级 NodeJS html 解析器库。
因为我的用例不需要具有许多功能的库,所以我可以专注于稳定性和性能。
我所说的稳定性是指我希望社区能够使用该库足够长的时间,以便发现错误,并且仍将对其进行维护,并且将解决未解决的问题。
很难理解开源库的未来,但我根据openbase。
我根据最后一次提交分为 2 组(每组的顺序根据 Github 开始):
最后一次提交是在过去 6 个月内:
最后一次提交是 6 个月,并且上面:
我选择 Node-html-parser 因为它看起来安静快速并且此时非常活跃。
(*) Openbase 添加了有关每个库的更多信息,例如贡献者数量(+3 次提交)、每周下载量、每月提交量、版本等。
(**) 上表是根据具体时间和日期的快照 - 我会再次检查参考,并作为第一步检查最近的活动水平,然后深入研究较小的细节。
November 2020 Update
I searched for the top NodeJS html parser libraries.
Because my use cases didn't require a library with many features, I could focus on stability and performance.
By stability I mean that I want the library to be used long enough by the community in order to find bugs and that it will be still maintained and that open issues will be closed.
Its hard to understand the future of an open source library, but I did a small summary based on the top 10 libraries in openbase.
I divided into 2 groups according to the last commit (and on each group the order is according to Github starts):
Last commit is in the last 6 months:
Last commit is 6 months and above:
I picked Node-html-parser because it seems quiet fast and very active at this moment.
(*) Openbase adds much more information regarding each library like the number of contributors (with +3 commits), weekly downloads, Monthly commits, Version etc'.
(**) The table above is a snapshot according to the specific time and date - I would check the reference again and as a first step check the level of recent activity and then dive into the smaller details.
使用htmlparser2,它的速度更快而且非常简单。请参阅此用法示例:
https://www.npmjs.org/package/htmlparser2#usage
以及现场演示这里:
http://demos.forbeslindesay.co.uk/htmlparser2/
Use htmlparser2, its way faster and pretty straightforward. Consult this usage example:
https://www.npmjs.org/package/htmlparser2#usage
And the live demo here:
http://demos.forbeslindesay.co.uk/htmlparser2/
FB55 的 Htmlparser2 似乎是一个不错的选择。
Htmlparser2 by FB55 seems to be a good alternative.
jsdom 过于严格,无法执行任何真正的屏幕抓取之类的事情,但 beautifulsoup 不会因不良标记而窒息。
node-soupselect 是 python 的 beautifulsoup 到 Nodejs 的端口,它工作得很好
jsdom is too strict to do any real screen scraping sort of things, but beautifulsoup doesn't choke on bad markup.
node-soupselect is a port of python's beautifulsoup into nodejs, and it works beautifully