Xerces-C：解析 HTML 中的 Javascript

发布于 2024-12-21 19:35:36 字数 650 浏览 8 评论 0原文

我想解析网站的元标记。为此，我使用 xerces-c。

shared_ptr<SAX2XMLReader> parser(XMLReaderFactory::createXMLReader());

//Create and set callback handler with the given callback functions
Handler handler(startElement,endElement,characters);
parser->setContentHandler(&handler);
parser->setErrorHandler(&handler);

//Parse the file with the given callback handler
parser->parse(filename.c_str());

现在有些网站上有 javascript。 javascript 脚本标签内部使用运算符 &&对于逻辑和。

Xerces-C 将此解释为实体引用（例如 &nbsp）并引发异常，因为它不知道实体引用 &&。

有没有办法将其正确地读取为文本？

或者如果没有 - 有没有办法忽略脚本标签内的所有字符？反正我不需要它们。我只是想解析元标记。

原文

I want to parse websites for their meta tags. For this I use xerces-c.

shared_ptr<SAX2XMLReader> parser(XMLReaderFactory::createXMLReader());

//Create and set callback handler with the given callback functions
Handler handler(startElement,endElement,characters);
parser->setContentHandler(&handler);
parser->setErrorHandler(&handler);

//Parse the file with the given callback handler
parser->parse(filename.c_str());

Some websites now have javascript on it. Inside of the script tags javascript uses the operator && for logical and.

Xerces-C interprets this as entity reference (for example ) and throws an exception, because it doesn't know the entity reference &&.

Is there a way to read this correctly as text?

Or if not - is there a way to just ignore all characters inside of script tags? I don't need them anyway. I just want to parse the meta tags.

分享到QQ

分享到微博