htmlparser2 快速/兼容性 HTML/XML/RSS 解析器
htmlparser2 是一个快速和宽容的 HTML/XML/RSS
解析器,解析器可以出来流,并且提供了一个回调接口。
安装
npm install htmlparser2
使用方法
const parser = new htmlparser.Parser(handler /*: Object */, options /*?: Object */);
用法举例
var htmlparser = require("htmlparser2");
var parser = new htmlparser.Parser({
onopentag: function(name, attribs){
if(name === "script" && attribs.type === "text/javascript"){
console.log("JS! Hooray!");
}
},
ontext: function(text){
console.log("-->", text);
},
onclosetag: function(tagname){
if(tagname === "script"){
console.log("That's it?!");
}
}
}, {decodeEntities: true});
parser.write("Xyz <script type='text/javascript'>var foo = '<<bar>>';</ script>");
parser.end();
输出结果:
--> Xyz
JS! Hooray!
--> var foo = '<<bar>>';
That's it?!
Usage with streams
While the Parser
interface closely resembles Node.js streams, it's not a 100% match. Use the WritableStream
interface to process a streaming input:
const { WritableStream } = require("htmlparser2/lib/WritableStream");
const parserStream = new WritableStream({
ontext(text) {
console.log("Streaming:", text);
},
});
const htmlStream = fs.createReadStream("./my-file.html");
htmlStream.pipe(parserStream).on("finish", () => console.log("done"));
Getting a DOM
The DomHandler
produces a DOM (document object model) that can be manipulated using the DomUtils
helper.
const htmlparser2 = require("htmlparser2");
const dom = htmlparser2.parseDocument();
The DomHandler
, while still bundled with this module, was moved to its own module. Have a look at that for further information.
Parsing RSS/RDF/Atom Feeds
const feed = htmlparser2.parseFeed(content, options);
Note: While the provided feed handler works for most feeds, you might want to use danmactough/node-feedparser, which is much better tested and actively maintained.
Performance
After having some artificial benchmarks for some time, @AndreasMadsen published his htmlparser-benchmark
, which benchmarks HTML parses based on real-world websites.
At the time of writing, the latest versions of all supported parsers show the following performance characteristics on Travis CI (please note that Travis doesn't guarantee equal conditions for all tests):
gumbo-parser : 34.9208 ms/file ± 21.4238
html-parser : 24.8224 ms/file ± 15.8703
html5 : 419.597 ms/file ± 264.265
htmlparser : 60.0722 ms/file ± 384.844
htmlparser2-dom: 12.0749 ms/file ± 6.49474
htmlparser2 : 7.49130 ms/file ± 5.74368
hubbub : 30.4980 ms/file ± 16.4682
libxmljs : 14.1338 ms/file ± 18.6541
parse5 : 22.0439 ms/file ± 15.3743
sax : 49.6513 ms/file ± 26.6032
Event 事件
对于处理器,下面是可以用的键的名字,注意:只有函数才可以作为值,否则解析器会失败:
onopentag(name /*: string */, attributes /*: { [attributeName: string]: string } */)
onopentagname(name /*: string */)
onattribute(name /*: string */, value /*: string */)
ontext(text /*: string */)
onclosetag(name /*: string */)
onprocessinginstruction(name /*: string */, data /*: string */)
oncomment(data /*: string */)
oncommentend()
oncdatastart()
oncdataend()
onerror(error /*: Error */)
onreset()
onend()
Methods 方法
write (alias: parseChunk
)
Parses a chunk of data and calls the corresponding callbacks.
end (alias: done
)
Parses the end of the buffer and clears the stack, calls onend.
reset
Resets buffer & stack, calls onreset
.
parseComplete
Resets the parser, parses the data & calls end
.
Option: xmlMode
Indicates whether special tags (<script>
and <style>
) should get special treatment and if "empty" tags (eg. <br>
) can have children. If false, the content of special tags will be text only.
For feeds and other XML content (documents that don't consist of HTML), set this to true. Default: false.
Option: decodeEntities
If set to true, entities within the document will be decoded. Defaults to true.
Option: lowerCaseTags
If set to true, all tags will be lowercased. If xmlMode
is disabled, this defaults to true.
Option: lowerCaseAttributeNames
If set to true, all attribute names will be lowercased. This has noticeable impact on speed, so it defaults to false.
Option: recognizeCDATA
If set to true, CDATA sections will be recognized as text even if the xmlMode
option is not enabled. NOTE: If xmlMode
is set to true
then CDATA sections will always be recognized as text.
Option: recognizeSelfClosing
If set to true, self-closing tags will trigger the onclosetag
event even if xmlMode
is not set to true
. NOTE: If xmlMode
is set to true
then self-closing tags will always be recognized.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论