htmlparser2 快速/兼容性 HTML/XML/RSS 解析器

发布于 2021-07-31 22:36:55 字数 6244 浏览 2295 评论 0

htmlparser2 是一个快速和宽容的 HTML/XML/RSS 解析器,解析器可以出来流,并且提供了一个回调接口。

安装

npm install htmlparser2

使用方法

const parser = new htmlparser.Parser(handler /*: Object */, options /*?: Object */);

用法举例

var htmlparser = require("htmlparser2");
var parser = new htmlparser.Parser({
  onopentag: function(name, attribs){
    if(name === "script" && attribs.type === "text/javascript"){
      console.log("JS! Hooray!");
    }
  },
  ontext: function(text){
    console.log("-->", text);
  },
  onclosetag: function(tagname){
    if(tagname === "script"){
      console.log("That's it?!");
    }
  }
}, {decodeEntities: true});
parser.write("Xyz <script type='text/javascript'>var foo = '<<bar>>';</ script>");
parser.end();

输出结果:

--> Xyz 
JS! Hooray!
--> var foo = '<<bar>>';
That's it?!

Usage with streams

While the Parser interface closely resembles Node.js streams, it's not a 100% match. Use the WritableStream interface to process a streaming input:

const { WritableStream } = require("htmlparser2/lib/WritableStream");
const parserStream = new WritableStream({
  ontext(text) {
    console.log("Streaming:", text);
  },
});

const htmlStream = fs.createReadStream("./my-file.html");
htmlStream.pipe(parserStream).on("finish", () => console.log("done"));

Getting a DOM

The DomHandler produces a DOM (document object model) that can be manipulated using the DomUtils helper.

const htmlparser2 = require("htmlparser2");

const dom = htmlparser2.parseDocument();

The DomHandler, while still bundled with this module, was moved to its own module. Have a look at that for further information.

Parsing RSS/RDF/Atom Feeds

const feed = htmlparser2.parseFeed(content, options);

Note: While the provided feed handler works for most feeds, you might want to use danmactough/node-feedparser, which is much better tested and actively maintained.

Performance

After having some artificial benchmarks for some time, @AndreasMadsen published his htmlparser-benchmark, which benchmarks HTML parses based on real-world websites.

At the time of writing, the latest versions of all supported parsers show the following performance characteristics on Travis CI (please note that Travis doesn't guarantee equal conditions for all tests):

gumbo-parser   : 34.9208 ms/file ± 21.4238
html-parser  : 24.8224 ms/file ± 15.8703
html5      : 419.597 ms/file ± 264.265
htmlparser   : 60.0722 ms/file ± 384.844
htmlparser2-dom: 12.0749 ms/file ± 6.49474
htmlparser2  : 7.49130 ms/file ± 5.74368
hubbub     : 30.4980 ms/file ± 16.4682
libxmljs     : 14.1338 ms/file ± 18.6541
parse5     : 22.0439 ms/file ± 15.3743
sax      : 49.6513 ms/file ± 26.6032

Event 事件

对于处理器,下面是可以用的键的名字,注意:只有函数才可以作为值,否则解析器会失败:

  • onopentag(name /*: string */, attributes /*: { [attributeName: string]: string } */)
  • onopentagname(name /*: string */)
  • onattribute(name /*: string */, value /*: string */)
  • ontext(text /*: string */)
  • onclosetag(name /*: string */)
  • onprocessinginstruction(name /*: string */, data /*: string */)
  • oncomment(data /*: string */)
  • oncommentend()
  • oncdatastart()
  • oncdataend()
  • onerror(error /*: Error */)
  • onreset()
  • onend()

Methods 方法

write (alias: parseChunk)

Parses a chunk of data and calls the corresponding callbacks.

end (alias: done)

Parses the end of the buffer and clears the stack, calls onend.

reset

Resets buffer & stack, calls onreset.

parseComplete

Resets the parser, parses the data & calls end.

Option: xmlMode

Indicates whether special tags (<script> and <style>) should get special treatment and if "empty" tags (eg. <br>) can have children. If false, the content of special tags will be text only.

For feeds and other XML content (documents that don't consist of HTML), set this to true. Default: false.

Option: decodeEntities

If set to true, entities within the document will be decoded. Defaults to true.

Option: lowerCaseTags

If set to true, all tags will be lowercased. If xmlMode is disabled, this defaults to true.

Option: lowerCaseAttributeNames

If set to true, all attribute names will be lowercased. This has noticeable impact on speed, so it defaults to false.

Option: recognizeCDATA

If set to true, CDATA sections will be recognized as text even if the xmlMode option is not enabled. NOTE: If xmlMode is set to true then CDATA sections will always be recognized as text.

Option: recognizeSelfClosing

If set to true, self-closing tags will trigger the onclosetag event even if xmlMode is not set to true. NOTE: If xmlMode is set to true then self-closing tags will always be recognized.

项目地址:https://github.com/fb55/htmlparser2

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。
列表为空,暂无数据

关于作者

JSmiles

生命进入颠沛而奔忙的本质状态,并将以不断告别和相遇的陈旧方式继续下去。

文章
评论
84963 人气
更多

推荐作者

夢野间

文章 0 评论 0

doggiejohn

文章 0 评论 0

就此别过

文章 0 评论 0

初见终念

文章 0 评论 0

qq_rvKjBH

文章 0 评论 0

    我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
    原文