HTML Agility 包删除了中断标记 close
我正在使用 HTML 敏捷包创建 HTML 文档。我加载一个模板文件,然后将内容附加到其中。所有这些都有效,但是当我查看输出文件时,它已从我的
标记中删除了结束标记,看起来像这样
。是什么原因造成的?
Dim doc As New HtmlDocument()
doc.Load(Server.MapPath("Template.htm"))
Dim title As HtmlNode = doc.DocumentNode.SelectSingleNode("//title")
title.InnerHtml = title.InnerHtml & "CEU Classes"
Dim topContent As HtmlAgilityPack.HtmlNode = doc.GetElementbyId("topContent")
topContent.InnerHtml = html.ToString
doc.OptionWriteEmptyNodes = True
doc.Save(outputFileName, Encoding.UTF8)
更多信息:
在我添加 doc.OptionWriteEmptyNodes = True 后,它正在删除我的结束图像标签,它确实做到了这一点。
更新
这是我现在的代码,删除了结束 BR 标签
Dim html As String = "Words<br/>more words"
Dim doc As New HtmlDocument()
Dim title As HtmlNode
Dim topContent As HtmlNode
HtmlNode.ElementsFlags("br") = HtmlElementFlag.Empty
doc.Load(Server.MapPath("Template.htm"))
Title = doc.DocumentNode.SelectSingleNode("//title")
title.InnerHtml = title.InnerHtml & "CEU Classes"
topContent = doc.GetElementbyId("topContent")
topContent.InnerHtml = html.ToString
doc.OptionWriteEmptyNodes = True
doc.Save(outputFileName, Encoding.UTF8)
更新 2
我最终只是在模板文件中读取标准字符串,然后像这样加载 html
Dim TemplateHTML As String = File.ReadAllText(Server.MapPath("Template.htm"))
TemplateHTML = TemplateHTML.Insert(TemplateHTML.IndexOf("<div id=""topContent"">") + "<div id=""topContent"">".Length, _
html.ToString)
doc.LoadHtml(TemplateHTML)
I am creating an HTML document using HTML agility pack. I load a template file then append content to it. All of this works, but when I view the output file it has removed the closing tag from my <br/>
tags to look like this <br>
. What is causing this?
Dim doc As New HtmlDocument()
doc.Load(Server.MapPath("Template.htm"))
Dim title As HtmlNode = doc.DocumentNode.SelectSingleNode("//title")
title.InnerHtml = title.InnerHtml & "CEU Classes"
Dim topContent As HtmlAgilityPack.HtmlNode = doc.GetElementbyId("topContent")
topContent.InnerHtml = html.ToString
doc.OptionWriteEmptyNodes = True
doc.Save(outputFileName, Encoding.UTF8)
More info:
It was removing my closing image tags, after I added doc.OptionWriteEmptyNodes = True
, it quite doing that.
Update
This is my code as it stands now that removes the closing BR tag
Dim html As String = "Words<br/>more words"
Dim doc As New HtmlDocument()
Dim title As HtmlNode
Dim topContent As HtmlNode
HtmlNode.ElementsFlags("br") = HtmlElementFlag.Empty
doc.Load(Server.MapPath("Template.htm"))
Title = doc.DocumentNode.SelectSingleNode("//title")
title.InnerHtml = title.InnerHtml & "CEU Classes"
topContent = doc.GetElementbyId("topContent")
topContent.InnerHtml = html.ToString
doc.OptionWriteEmptyNodes = True
doc.Save(outputFileName, Encoding.UTF8)
Update 2
I ended up just reading in my template file as a standard string then loading the html like this
Dim TemplateHTML As String = File.ReadAllText(Server.MapPath("Template.htm"))
TemplateHTML = TemplateHTML.Insert(TemplateHTML.IndexOf("<div id=""topContent"">") + "<div id=""topContent"">".Length, _
html.ToString)
doc.LoadHtml(TemplateHTML)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
发生这种情况是因为 Html Agility Pack 以特殊方式处理 BR。它仍然支持旧的(但现在存在于网络上)HTML 3.2 语法,其中 BR 可以在完全没有结束标记的情况下声明(顺便说一下,浏览器仍然可以优雅地处理它......)。
要更改此默认行为,您需要修改
HtmlNode.ElementFlags
属性,如下所示:它将显示:
It happens because the Html Agility Pack handles the BR in a special way. It still supports old (but existing on the web today) HTML 3.2 syntax where the BR could be declared without a closing tag at all (browsers also still handle it gracefully by the way...).
To change this default behavior, you need to modify the
HtmlNode.ElementFlags
property, like this:which will display:
根据 @Simon Mourier,以下 C# 代码在版本 1.4 中工作,
具有以下 postParsed 字符串值
As per @Simon Mourier, the following C# code works in version 1.4
has the following string value for postParsed
似乎这是 Html Agility Pack 中的标准设置。默认情况下,它不符合XHTML,并且许多标签没有关闭。
有两种方法可以做到这一点。在文档级别,您可以执行以下操作,这将打开所有结束标签。 (这是我的首选方法)。
然而,这可能并不理想。还有另一种方法可以在节点级别执行此操作。
Seems this is a standard setting in Html Agility Pack. By default, it does not conform to XHTML and many tags are not closed.
There are 2 ways to do this. At the document level you can do the following which will turn on ALL closing tags. (This is my preferred method).
However, this may not be desirable. There is another way to do it at the node level.
我遇到过同样类型的问题,我通过使用具有正确设置的新 HtmlDocument 对象手动重新解析 HTML 块来解决它。
我看到的问题是 HtmlDocument 具有所有这些不错的设置来让您关闭
标签等,但是当您选择一个节点或对节点进行其他一些软操作并使用它们的 OuterHtml 或 InnerHtml 时,其中一些关闭标签是丢失(可能是因为这些属性不使用与文档本身相同的设置,或者可能还有其他原因)。因此,当您从 InnerHtml 或 OuterHtml 获取不正确的 html 字符串时,您可以再次使用 HtmlDocument 重新解析它,并使用
document.DocumentElement.InnerHtml
来获取正确的 HTML 字符串。I have encountered same kind of problem and I solved it by manually re-parsing HTML chunk using new HtmlDocument object with correct settings.
Problem as I see it is that HtmlDocument has all those nice settings to let you close
tags etc, but when you select a node or do some other soft of operation with nodes and use their OuterHtml or InnerHtml some of those closing tags are lost (probably because those properties do not use same settings as document itself, or meybe there is some other reason). So when you get that incorrect html string from InnerHtml or OuterHtml, you can just re-parse it with HtmlDocument again and use
document.DocumentElement.InnerHtml
to get correct HTML string.