HTMl 敏捷包错误解析和返回 XElement
我可以解析文档并生成输出,但是由于 ap 标记,输出无法解析为 XElement,字符串中的其他所有内容都已正确解析。
我的输入:
var input = "<p> Not sure why is is null for some wierd reason!<br><br>I have implemented the auto save feature, but does it really work after 100s?<br></p> <p> <i>Autosave?? </i> </p> <p>we are talking...</p><p></p><hr><p><br class=\"GENTICS_ephemera\"></p>";
我的代码:
public static XElement CleanupHtml(string input)
{
HtmlAgilityPack.HtmlDocument htmlDoc = new HtmlAgilityPack.HtmlDocument();
htmlDoc.OptionOutputAsXml = true;
//htmlDoc.OptionWriteEmptyNodes = true;
//htmlDoc.OptionAutoCloseOnEnd = true;
htmlDoc.OptionFixNestedTags = true;
htmlDoc.LoadHtml(input);
// ParseErrors is an ArrayList containing any errors from the Load statement
if (htmlDoc.ParseErrors != null && htmlDoc.ParseErrors.Count() > 0)
{
}
else
{
if (htmlDoc.DocumentNode != null)
{
var ndoc = new HtmlDocument(); // HTML doc instance
HtmlNode p = ndoc.CreateElement("body");
p.InnerHtml = htmlDoc.DocumentNode.InnerHtml;
var result = p.OuterHtml.Replace("<br>", "<br/>");
result = result.Replace("<br class=\"special_class\">", "<br/>");
result = result.Replace("<hr>", "<hr/>");
return XElement.Parse(result, LoadOptions.PreserveWhitespace);
}
}
return new XElement("body");
}
我的输出:
<body>
<p> Not sure why is is null for some wierd reason chappy!
<br/>
<br/>I have implemented the auto save feature, but does it really work after 100s?
<br/>
</p>
<p>
<i>Autosave?? </i>
</p>
<p>we are talking...</p>
**<p>**
<hr/>
<p>
<br/>
</p>
</body>
粗体 p 标签是未正确输出的标签...有没有办法解决这个问题?我的代码做错了吗?
I can parse the document and generate an output however the output cannot be parsed into an XElement because of a p tag, everything else within the string is parsed correctly.
My input:
var input = "<p> Not sure why is is null for some wierd reason!<br><br>I have implemented the auto save feature, but does it really work after 100s?<br></p> <p> <i>Autosave?? </i> </p> <p>we are talking...</p><p></p><hr><p><br class=\"GENTICS_ephemera\"></p>";
My code:
public static XElement CleanupHtml(string input)
{
HtmlAgilityPack.HtmlDocument htmlDoc = new HtmlAgilityPack.HtmlDocument();
htmlDoc.OptionOutputAsXml = true;
//htmlDoc.OptionWriteEmptyNodes = true;
//htmlDoc.OptionAutoCloseOnEnd = true;
htmlDoc.OptionFixNestedTags = true;
htmlDoc.LoadHtml(input);
// ParseErrors is an ArrayList containing any errors from the Load statement
if (htmlDoc.ParseErrors != null && htmlDoc.ParseErrors.Count() > 0)
{
}
else
{
if (htmlDoc.DocumentNode != null)
{
var ndoc = new HtmlDocument(); // HTML doc instance
HtmlNode p = ndoc.CreateElement("body");
p.InnerHtml = htmlDoc.DocumentNode.InnerHtml;
var result = p.OuterHtml.Replace("<br>", "<br/>");
result = result.Replace("<br class=\"special_class\">", "<br/>");
result = result.Replace("<hr>", "<hr/>");
return XElement.Parse(result, LoadOptions.PreserveWhitespace);
}
}
return new XElement("body");
}
My output:
<body>
<p> Not sure why is is null for some wierd reason chappy!
<br/>
<br/>I have implemented the auto save feature, but does it really work after 100s?
<br/>
</p>
<p>
<i>Autosave?? </i>
</p>
<p>we are talking...</p>
**<p>**
<hr/>
<p>
<br/>
</p>
</body>
The bold p tag is the one that did not output correctly... Is there a way around this? Am I doing something wrong with the code?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
您要做的基本上是将 Html 输入转换为 Xml 输出。
当您使用
OptionOutputAsXml
选项时,Html Agility Pack 可以做到这一点,但在这种情况下,您不应使用 InnerHtml 属性,而应让 Html Agility Pack 为您完成基础工作,其中之一HtmlDocument 的Save
方法。下面是一个将 Html 文本转换为 XElement 实例的通用函数:
如您所见,您不必自己做太多工作!请注意,由于您的原始输入文本没有根元素,因此 Html Agility Pack 将自动添加一个封闭的
SPAN
以确保输出是有效的 Xml。在您的情况下,您想要额外处理一些标签,因此,以下是如何处理您的示例:
如您所见,您不应该使用原始字符串函数,而应使用 Html Agility Pack DOM 函数(SelectNodes、Add、Remove) , ETC...)。
What you are trying to do is basically transform an Html input into an Xml output.
Html Agility Pack can do that when you use the
OptionOutputAsXml
option, but in this case, you should not use the InnerHtml property, and instead let the Html Agility Pack do the ground work for you, with one of HtmlDocument'sSave
methods.Here is a generic function to convert an Html text to an XElement instance:
As you see, you don't have to do much work by yourself! Please note that since your original input text has no root element, the Html Agility Pack will automatically add one enclosing
SPAN
to ensure the output is valid Xml.In your case, you want to additionnally process some tags, so, here is how to do with your exemple:
As you see, you should not use raw string function, but instead use the Html Agility Pack DOM functions (SelectNodes, Add, Remove, etc...).
如果您检查
OptionFixNestedTags
的文档注释,您将看到以下内容:因此,我认为这不会帮助您处理未封闭的 HTML
p
标记。根据一个旧的SO问题C#库来清理html虽然HTML Tidy 可能适用于此目的。If you check the documentation comments for
OptionFixNestedTags
you will see the following:So I don't think this will help you with unclosed HTML
p
tags. According to an old SO question C# library to clean up html though HTML Tidy might work for this purpose.