HTML Agility 包删除了中断标记 close

发布于 2024-10-30 20:48:00 字数 1600 浏览 4 评论 0原文

我正在使用 HTML 敏捷包创建 HTML 文档。我加载一个模板文件,然后将内容附加到其中。所有这些都有效,但是当我查看输出文件时,它已从我的
标记中删除了结束标记,看起来像这样
。是什么原因造成的?

Dim doc As New HtmlDocument()
doc.Load(Server.MapPath("Template.htm"))

Dim title As HtmlNode = doc.DocumentNode.SelectSingleNode("//title")

title.InnerHtml = title.InnerHtml & "CEU Classes"
Dim topContent As HtmlAgilityPack.HtmlNode = doc.GetElementbyId("topContent")

topContent.InnerHtml = html.ToString
doc.OptionWriteEmptyNodes = True
doc.Save(outputFileName, Encoding.UTF8)

更多信息:

在我添加 doc.OptionWriteEmptyNodes = True 后,它正在删除我的结束图像标签,它确实做到了这一点。

更新

这是我现在的代码,删除了结束 BR 标签

Dim html As String = "Words<br/>more words"
Dim doc As New HtmlDocument()
Dim title As HtmlNode
Dim topContent As HtmlNode

HtmlNode.ElementsFlags("br") = HtmlElementFlag.Empty
doc.Load(Server.MapPath("Template.htm"))

Title = doc.DocumentNode.SelectSingleNode("//title")
title.InnerHtml = title.InnerHtml & "CEU Classes"

topContent = doc.GetElementbyId("topContent")
topContent.InnerHtml = html.ToString

doc.OptionWriteEmptyNodes = True
doc.Save(outputFileName, Encoding.UTF8)

更新 2

我最终只是在模板文件中读取标准字符串,然后像这样加载 html

Dim TemplateHTML As String = File.ReadAllText(Server.MapPath("Template.htm"))

TemplateHTML = TemplateHTML.Insert(TemplateHTML.IndexOf("<div id=""topContent"">") + "<div id=""topContent"">".Length, _
                                   html.ToString)

doc.LoadHtml(TemplateHTML)

I am creating an HTML document using HTML agility pack. I load a template file then append content to it. All of this works, but when I view the output file it has removed the closing tag from my <br/> tags to look like this <br>. What is causing this?

Dim doc As New HtmlDocument()
doc.Load(Server.MapPath("Template.htm"))

Dim title As HtmlNode = doc.DocumentNode.SelectSingleNode("//title")

title.InnerHtml = title.InnerHtml & "CEU Classes"
Dim topContent As HtmlAgilityPack.HtmlNode = doc.GetElementbyId("topContent")

topContent.InnerHtml = html.ToString
doc.OptionWriteEmptyNodes = True
doc.Save(outputFileName, Encoding.UTF8)

More info:

It was removing my closing image tags, after I added doc.OptionWriteEmptyNodes = True, it quite doing that.

Update

This is my code as it stands now that removes the closing BR tag

Dim html As String = "Words<br/>more words"
Dim doc As New HtmlDocument()
Dim title As HtmlNode
Dim topContent As HtmlNode

HtmlNode.ElementsFlags("br") = HtmlElementFlag.Empty
doc.Load(Server.MapPath("Template.htm"))

Title = doc.DocumentNode.SelectSingleNode("//title")
title.InnerHtml = title.InnerHtml & "CEU Classes"

topContent = doc.GetElementbyId("topContent")
topContent.InnerHtml = html.ToString

doc.OptionWriteEmptyNodes = True
doc.Save(outputFileName, Encoding.UTF8)

Update 2

I ended up just reading in my template file as a standard string then loading the html like this

Dim TemplateHTML As String = File.ReadAllText(Server.MapPath("Template.htm"))

TemplateHTML = TemplateHTML.Insert(TemplateHTML.IndexOf("<div id=""topContent"">") + "<div id=""topContent"">".Length, _
                                   html.ToString)

doc.LoadHtml(TemplateHTML)

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

垂暮老矣 2024-11-06 20:48:00

发生这种情况是因为 Html Agility Pack 以特殊方式处理 BR。它仍然支持旧的(但现在存在于网络上)HTML 3.2 语​​法,其中 BR 可以在完全没有结束标记的情况下声明(顺便说一下,浏览器仍然可以优雅地处理它......)。

要更改此默认行为,您需要修改 HtmlNode.ElementFlags 属性,如下所示:

Dim doc As New HtmlDocument()
HtmlNode.ElementsFlags("br") = HtmlElementFlag.Empty
doc.LoadHtml("<test>before<br/>after</test>")
doc.OptionWriteEmptyNodes = True   
doc.Save(Console.Out)

它将显示:

<test>before<br />after</test>

It happens because the Html Agility Pack handles the BR in a special way. It still supports old (but existing on the web today) HTML 3.2 syntax where the BR could be declared without a closing tag at all (browsers also still handle it gracefully by the way...).

To change this default behavior, you need to modify the HtmlNode.ElementFlags property, like this:

Dim doc As New HtmlDocument()
HtmlNode.ElementsFlags("br") = HtmlElementFlag.Empty
doc.LoadHtml("<test>before<br/>after</test>")
doc.OptionWriteEmptyNodes = True   
doc.Save(Console.Out)

which will display:

<test>before<br />after</test>
作死小能手 2024-11-06 20:48:00

根据 @Simon Mourier,以下 C# 代码在版本 1.4 中工作,

var doc = new HtmlDocument();
HtmlNode.ElementsFlags["br"] = HtmlElementFlag.Empty;
doc.OptionWriteEmptyNodes = true;
doc.LoadHtml("Lorem ipsum dolor sit<br/>Lorem ipsum dolor sit");

var postParsed = doc.DocumentNode.WriteTo();

具有以下 postParsed 字符串值

"Lorem ipsum dolor sit<br />Lorem ipsum dolor sit"

As per @Simon Mourier, the following C# code works in version 1.4

var doc = new HtmlDocument();
HtmlNode.ElementsFlags["br"] = HtmlElementFlag.Empty;
doc.OptionWriteEmptyNodes = true;
doc.LoadHtml("Lorem ipsum dolor sit<br/>Lorem ipsum dolor sit");

var postParsed = doc.DocumentNode.WriteTo();

has the following string value for postParsed

"Lorem ipsum dolor sit<br />Lorem ipsum dolor sit"
无所谓啦 2024-11-06 20:48:00

似乎这是 Html Agility Pack 中的标准设置。默认情况下,它不符合XHTML,并且许多标签没有关闭。

有两种方法可以做到这一点。在文档级别,您可以执行以下操作,这将打开所有结束标签。 (这是我的首选方法)。

HtmlDocument doc = new HtmlDocument();
doc.OptionWriteEmptyNodes = true;
doc.LoadHtml(content);

然而,这可能并不理想。还有另一种方法可以在节点级别执行此操作。

if (HtmlNode.ElementsFlags.ContainsKey("img"))
{
    HtmlNode.ElementsFlags["img"] = HtmlElementFlag.Closed;
}
else
{
    HtmlNode.ElementsFlags.Add("img", HtmlElementFlag.Closed);
}

Seems this is a standard setting in Html Agility Pack. By default, it does not conform to XHTML and many tags are not closed.

There are 2 ways to do this. At the document level you can do the following which will turn on ALL closing tags. (This is my preferred method).

HtmlDocument doc = new HtmlDocument();
doc.OptionWriteEmptyNodes = true;
doc.LoadHtml(content);

However, this may not be desirable. There is another way to do it at the node level.

if (HtmlNode.ElementsFlags.ContainsKey("img"))
{
    HtmlNode.ElementsFlags["img"] = HtmlElementFlag.Closed;
}
else
{
    HtmlNode.ElementsFlags.Add("img", HtmlElementFlag.Closed);
}
不念旧人 2024-11-06 20:48:00

我遇到过同样类型的问题,我通过使用具有正确设置的新 HtmlDocument 对象手动重新解析 HTML 块来解决它。

我看到的问题是 HtmlDocument 具有所有这些不错的设置来让您关闭
标签等,但是当您选择一个节点或对节点进行其他一些软操作并使用它们的 OuterHtml 或 InnerHtml 时,其中一些关闭标签是丢失(可能是因为这些属性不使用与文档本身相同的设置,或者可能还有其他原因)。因此,当您从 InnerHtml 或 OuterHtml 获取不正确的 html 字符串时,您可以再次使用 HtmlDocument 重新解析它,并使用 document.DocumentElement.InnerHtml 来获取正确的 HTML 字符串。

I have encountered same kind of problem and I solved it by manually re-parsing HTML chunk using new HtmlDocument object with correct settings.

Problem as I see it is that HtmlDocument has all those nice settings to let you close
tags etc, but when you select a node or do some other soft of operation with nodes and use their OuterHtml or InnerHtml some of those closing tags are lost (probably because those properties do not use same settings as document itself, or meybe there is some other reason). So when you get that incorrect html string from InnerHtml or OuterHtml, you can just re-parse it with HtmlDocument again and use document.DocumentElement.InnerHtml to get correct HTML string.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文