当存在未关闭的 html 标签时,是否可以解决 HtmlAgilityPack 中的问题?
嗯,我有以下问题。
我的 html 格式错误,在这种情况下,我在使用 html 敏捷包选择节点时遇到问题。
代码如下:
string strHtml = @"
<html>
<div>
<p><strong>Elem_A</strong>String_A1_2 String_A1_2</p>
<p><strong>Elem_B</strong>String_B1_2 String_B1_2</p>
</div>
<div>
<p><strong>Elem_A</strong>String_A2_2 <String_A2_2> asdas</p>
<p><strong>Elem_B</strong>String_B2_2 String_B2_2</p>
</div>
</html>";
HtmlAgilityPack.HtmlDocument objHtmlDocument = new HtmlAgilityPack.HtmlDocument();
objHtmlDocument.LoadHtml(strHtml);
HtmlAgilityPack.HtmlNodeCollection colnodePs = objHtmlDocument.DocumentNode.SelectNodes("//p");
List<string> lststrText = new List<string>();
foreach (HtmlAgilityPack.HtmlNode nodeP in colnodePs)
{
lststrText.Add(nodeP.InnerHtml);
}
问题是 String_A2_2 括在括号中。
因此 htmlagility pack 返回 5 个字符串,而不是 lststrText 中的 4 个。
那么是否可以让 htmlagility pack 返回元素 3 作为 “Elem_AString_A2_2
?
或者也许我可以做一些预处理来关闭元素?
lststrText 的当前内容是
lststrText[0] = "<strong>Elem_A</strong>String_A1_2 String_A1_2"
lststrText[1] = "<strong>Elem_B</strong>String_B1_2 String_B1_2"
lststrText[2] = ""
lststrText[3] = ""
lststrText[4] = "<strong>Elem_B</strong>String_B2_2 String_B2_2"
well i have the following problem.
the html i have is malformed and i have problems with selecting nodes using html agility pack when this is the case.
the code is below:
string strHtml = @"
<html>
<div>
<p><strong>Elem_A</strong>String_A1_2 String_A1_2</p>
<p><strong>Elem_B</strong>String_B1_2 String_B1_2</p>
</div>
<div>
<p><strong>Elem_A</strong>String_A2_2 <String_A2_2> asdas</p>
<p><strong>Elem_B</strong>String_B2_2 String_B2_2</p>
</div>
</html>";
HtmlAgilityPack.HtmlDocument objHtmlDocument = new HtmlAgilityPack.HtmlDocument();
objHtmlDocument.LoadHtml(strHtml);
HtmlAgilityPack.HtmlNodeCollection colnodePs = objHtmlDocument.DocumentNode.SelectNodes("//p");
List<string> lststrText = new List<string>();
foreach (HtmlAgilityPack.HtmlNode nodeP in colnodePs)
{
lststrText.Add(nodeP.InnerHtml);
}
the problem is that String_A2_2 is enclosed in brackets.
so htmlagility pack returns 5 strings instead of 4 in the lststrText.
so is it possible to let htmlagility pack return element 3 as"<strong>Elem_A</strong>String_A2_2 <String_A2_2> asdas"
?
or maybe i can do some preprocessing to close the element?
the current content of lststrText is
lststrText[0] = "<strong>Elem_A</strong>String_A1_2 String_A1_2"
lststrText[1] = "<strong>Elem_B</strong>String_B1_2 String_B1_2"
lststrText[2] = ""
lststrText[3] = ""
lststrText[4] = "<strong>Elem_B</strong>String_B2_2 String_B2_2"
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
大多数 html 解析器尝试构建一个有效的 DOM,这意味着不接受悬挂标签。它们将被转换,或以某种方式关闭。
如果仅选择节点对您来说很重要,并且速度和大量数据不是问题,那么您可以获取所有
改为使用正则表达式的标签:
此正则表达式假定
标签结构良好且封闭。
如果您要在程序中多次运行此正则表达式,您应该将其声明为:
[编辑:敏捷包更改]
如果您想使用 HtmlAgility 包,您可以修改 HtmlDocument.cs 中的 PushNodeEnd 函数:
其中AllowedTags 将是以下列表所有已知标签:b、p、br、span、div 等。
输出不是 100% 你想要的,但也许足够接近?
Most html parsers try to build a working DOM, meaning dangling tags are not accepted. They will be converted, or closed in some way.
If only selecting the nodes is of importance to you, and speed and huge amounts of data is not an issue, you could grab all your <p> tags with a regular expression instead:
This regular expression assumes the <p> tags are well formed and closed.
If you are to run this Regex a lot in your program you should declare it as:
[Edit: Agility pack change]
If you want to use HtmlAgility pack you can modify the PushNodeEnd function in HtmlDocument.cs:
where AllowedTags would be a list of all known tags: b, p, br, span, div, etc.
the output is not 100% what you want, but maybe close enough?
您可以使用 TidyNet 进行预处理/后处理你提到。您可以编辑您的答案来解释为什么这不适用于您的情况吗?
You could use TidyNet to do the pre/postprocessing you allude to. Can you edit your answer to explain why that wouldnt be applicable in your case?