这对于 Html 解析器来说不是一个合适的场景吗?
我必须处理 Html 属性中格式错误的 Html 和 Html 标签:
<p class="<sometag attr="something"></sometag>">
<a href="<someothertag></someothertag">Link</a>
</p>
我尝试使用 HtmlAgilityPack 解析内容,但是当您将上述代码加载到 HtmlDocument 中时,OuterHtml
输出:
<p class="<sometag attr=" something"="">">
<a href="<someothertag></someothertag">Link</a>
</p>
p
标签格式错误,并且 a
标签的 href
属性内的 someothertag
未被识别为节点(虽然它实际上是属性内的文本,但我希望它被识别为标签)。
我还可以使用其他东西来帮助我解析这样的不良 Html 吗?
I have to deal with malformed Html and Html tags inside Html attributes:
<p class="<sometag attr="something"></sometag>">
<a href="<someothertag></someothertag">Link</a>
</p>
I tried using HtmlAgilityPack to parse out the content but when you load the above code into an HtmlDocument, the OuterHtml
outputs:
<p class="<sometag attr=" something"="">">
<a href="<someothertag></someothertag">Link</a>
</p>
The p
tag becomes malformed and the someothertag
inside the href
attribute of the a
tag is not recognized as a node (although it's really text inside an attribute, I would like it to be recognized as a tag).
Is there something else I can use to help me parse bad Html like this?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
它不是有效的 html,所以我认为您不能依赖 html 解析器来解析它。
it's not valid html, so i don't think you can rely on an html parser to parse it.
您可能会对解析器提出很多要求,因为这可能是一种罕见的情况。您可能需要自己解决这个问题。
我看到的主要问题是属性值中存在多组双引号。是否保证标记对于每个开头始终具有匹配的结束字符?换句话说,对于每个 < 都会有一个 >,并且对于每个空头 " 或 ', 如果是这样的话,我的建议是获取 HTML 解析器(例如 Html Agility Pack)的源代码,
并为属性解析添加一些功能;对每个开始字符使用堆栈,然后读取。您找到另一个开始或结束字符 If。如果它正在打开,则推送它,如果它正在关闭,则弹出它。
或者,您可以添加对属性值中小于和大于字符的检测,并且在所有包含的标签都关闭之前不识别属性值的结尾。
另一种可能的解决方案是在将源标记传递给解析器之前修改源标记,并将属性值中的非法字符更改为转义字符(与分号),不幸的是,这需要您进行一些初步解析。
You may be asking a lot of a parser since this is probably a rare case. You may need to solve this on your own.
The major problem I see is that there are sets of double quotes within the attribute value. Is it guaranteed that the markup will always have a matching closing character for every opening? In other words, for every < will there be a > and for every opening " or ', a matching closing mark?
If that's the case, my suggestion would be taking the source for an HTML parser such as Html Agility Pack and adding some functionality to the attribute parsing. Use a stack; for every opening character, push it, then read until you find another opening or closing character. If it's opening, push it, if it's closing, pop it.
Alternately, you could add detection for the less-than and greater-than characters in the attribute value and not recognize the end of the attribute value until all the contained tags are closed.
One other possible solution is to modify the source markup before passing it to the parser and changing the illegal characters in the attribute values to escaped characters (ampersand-semicolon). Unfortunately, this would require doing some preliminary parsing on your part.