当存在未关闭的 html 标签时,是否可以解决 HtmlAgilityPack 中的问题?

发布于 2024-08-15 03:22:44 字数 1476 浏览 9 评论 0原文

嗯,我有以下问题。
我的 html 格式错误,在这种情况下,我在使用 html 敏捷包选择节点时遇到问题。
代码如下:

string strHtml = @"
<html>
  <div>
    <p><strong>Elem_A</strong>String_A1_2 String_A1_2</p>
    <p><strong>Elem_B</strong>String_B1_2 String_B1_2</p>
  </div>
  <div>
    <p><strong>Elem_A</strong>String_A2_2 <String_A2_2> asdas</p>
    <p><strong>Elem_B</strong>String_B2_2 String_B2_2</p>
  </div>
</html>";
HtmlAgilityPack.HtmlDocument objHtmlDocument = new HtmlAgilityPack.HtmlDocument();
objHtmlDocument.LoadHtml(strHtml);
HtmlAgilityPack.HtmlNodeCollection colnodePs = objHtmlDocument.DocumentNode.SelectNodes("//p");
List<string> lststrText = new List<string>();
foreach (HtmlAgilityPack.HtmlNode nodeP in colnodePs)
{
  lststrText.Add(nodeP.InnerHtml);
}

问题是 String_A2_2 括在括号中。
因此 htmlagility pack 返回 5 个字符串,而不是 lststrText 中的 4 个。
那么是否可以让 htmlagility pack 返回元素 3 作为 Elem_AString_A2_2asdas”
或者也许我可以做一些预处理来关闭元素?
lststrText 的当前内容是

lststrText[0] = "<strong>Elem_A</strong>String_A1_2 String_A1_2"  
lststrText[1] = "<strong>Elem_B</strong>String_B1_2 String_B1_2"  
lststrText[2] = ""  
lststrText[3] = ""  
lststrText[4] = "<strong>Elem_B</strong>String_B2_2 String_B2_2"

well i have the following problem.
the html i have is malformed and i have problems with selecting nodes using html agility pack when this is the case.
the code is below:

string strHtml = @"
<html>
  <div>
    <p><strong>Elem_A</strong>String_A1_2 String_A1_2</p>
    <p><strong>Elem_B</strong>String_B1_2 String_B1_2</p>
  </div>
  <div>
    <p><strong>Elem_A</strong>String_A2_2 <String_A2_2> asdas</p>
    <p><strong>Elem_B</strong>String_B2_2 String_B2_2</p>
  </div>
</html>";
HtmlAgilityPack.HtmlDocument objHtmlDocument = new HtmlAgilityPack.HtmlDocument();
objHtmlDocument.LoadHtml(strHtml);
HtmlAgilityPack.HtmlNodeCollection colnodePs = objHtmlDocument.DocumentNode.SelectNodes("//p");
List<string> lststrText = new List<string>();
foreach (HtmlAgilityPack.HtmlNode nodeP in colnodePs)
{
  lststrText.Add(nodeP.InnerHtml);
}

the problem is that String_A2_2 is enclosed in brackets.
so htmlagility pack returns 5 strings instead of 4 in the lststrText.
so is it possible to let htmlagility pack return element 3 as
"<strong>Elem_A</strong>String_A2_2 <String_A2_2> asdas"?
or maybe i can do some preprocessing to close the element?
the current content of lststrText is

lststrText[0] = "<strong>Elem_A</strong>String_A1_2 String_A1_2"  
lststrText[1] = "<strong>Elem_B</strong>String_B1_2 String_B1_2"  
lststrText[2] = ""  
lststrText[3] = ""  
lststrText[4] = "<strong>Elem_B</strong>String_B2_2 String_B2_2"

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

农村范ル 2024-08-22 03:22:44

大多数 html 解析器尝试构建一个有效的 DOM,这意味着不接受悬挂标签。它们将被转换,或以某种方式关闭。

如果仅选择节点对您来说很重要,并且速度和大量数据不是问题,那么您可以获取所有

改为使用正则表达式的标签:

Regex reMatchP = new Regex(@"<(p)>.*?</\1>");
foreach (Match m in reMatchP.Matches(strHtml))
{
   Console.WriteLine(m.Value);
}

此正则表达式假定

标签结构良好且封闭。

如果您要在程序中多次运行此正则表达式,您应该将其声明为:

static Regex reMatchP = new Regex(@"<(p)>.*?</\1>", RegexOptions.Compiled);

[编辑:敏捷包更改]

如果您想使用 HtmlAgility 包,您可以修改 HtmlDocument.cs 中的 PushNodeEnd 函数:

if (HtmlNode.IsCDataElement(CurrentNodeName()))
{
   _state = ParseState.PcData;
   return true;
}

// new code start
if ( !AllowedTags.Contains(_currentnode.Name) )
{
    close = true;
}
// new code end

其中AllowedTags 将是以下列表所有已知标签:b、p、br、span、div 等。

输出不是 100% 你想要的,但也许足够接近?

<strong>Elem_A</strong>String_A1_2 String_A1_2
<strong>Elem_B</strong>String_B1_2 String_B1_2
<strong>Elem_A</strong>String_A2_2 <ignorestring_a2_2></ignorestring_a2_2> asdas
<strong>Elem_B</strong>String_B2_2 String_B2_2

Most html parsers try to build a working DOM, meaning dangling tags are not accepted. They will be converted, or closed in some way.

If only selecting the nodes is of importance to you, and speed and huge amounts of data is not an issue, you could grab all your <p> tags with a regular expression instead:

Regex reMatchP = new Regex(@"<(p)>.*?</\1>");
foreach (Match m in reMatchP.Matches(strHtml))
{
   Console.WriteLine(m.Value);
}

This regular expression assumes the <p> tags are well formed and closed.

If you are to run this Regex a lot in your program you should declare it as:

static Regex reMatchP = new Regex(@"<(p)>.*?</\1>", RegexOptions.Compiled);

[Edit: Agility pack change]

If you want to use HtmlAgility pack you can modify the PushNodeEnd function in HtmlDocument.cs:

if (HtmlNode.IsCDataElement(CurrentNodeName()))
{
   _state = ParseState.PcData;
   return true;
}

// new code start
if ( !AllowedTags.Contains(_currentnode.Name) )
{
    close = true;
}
// new code end

where AllowedTags would be a list of all known tags: b, p, br, span, div, etc.

the output is not 100% what you want, but maybe close enough?

<strong>Elem_A</strong>String_A1_2 String_A1_2
<strong>Elem_B</strong>String_B1_2 String_B1_2
<strong>Elem_A</strong>String_A2_2 <ignorestring_a2_2></ignorestring_a2_2> asdas
<strong>Elem_B</strong>String_B2_2 String_B2_2
瀟灑尐姊 2024-08-22 03:22:44

您可以使用 TidyNet 进行预处理/后处理你提到。您可以编辑您的答案来解释为什么这不适用于您的情况吗?

You could use TidyNet to do the pre/postprocessing you allude to. Can you edit your answer to explain why that wouldnt be applicable in your case?

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文