Xerces-C:解析 HTML 中的 Javascript
我想解析网站的元标记。为此,我使用 xerces-c。
shared_ptr<SAX2XMLReader> parser(XMLReaderFactory::createXMLReader());
//Create and set callback handler with the given callback functions
Handler handler(startElement,endElement,characters);
parser->setContentHandler(&handler);
parser->setErrorHandler(&handler);
//Parse the file with the given callback handler
parser->parse(filename.c_str());
现在有些网站上有 javascript。 javascript 脚本标签内部使用运算符 &&对于逻辑和。
Xerces-C 将此解释为实体引用(例如  )并引发异常,因为它不知道实体引用 &&。
有没有办法将其正确地读取为文本?
或者如果没有 - 有没有办法忽略脚本标签内的所有字符?反正我不需要它们。我只是想解析元标记。
I want to parse websites for their meta tags. For this I use xerces-c.
shared_ptr<SAX2XMLReader> parser(XMLReaderFactory::createXMLReader());
//Create and set callback handler with the given callback functions
Handler handler(startElement,endElement,characters);
parser->setContentHandler(&handler);
parser->setErrorHandler(&handler);
//Parse the file with the given callback handler
parser->parse(filename.c_str());
Some websites now have javascript on it. Inside of the script tags javascript uses the operator && for logical and.
Xerces-C interprets this as entity reference (for example  ) and throws an exception, because it doesn't know the entity reference &&.
Is there a way to read this correctly as text?
Or if not - is there a way to just ignore all characters inside of script tags? I don't need them anyway. I just want to parse the meta tags.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
基本上,
html
不一定是格式良好的xml
,但您可以使用tidy
在提供给 xml 解析器之前。Basically,
html
is not necessarily well-formedxml
, but you can, for instance, preprocess it withtidy
before feeding to xml parser.