使用 Html Agility Pack 以上下文敏感的方式解析节点

发布于 2024-10-31 12:46:31 字数 708 浏览 2 评论 0原文

<div class="mvb"><b>Date 1</b></div>
<div class="mxb"><b>Header 1</b></div>
<div>
   inner hmtl 1
</div>

<div class="mvb"><b>Date 2</b></div>
<div class="mxb"><b>Header 2</b></div>
<div>
inner html 2
</div>

我想以这样的方式解析标签之间的内部 html

    * associate the inner html 1 with header 1 and date 1
    * associate the inner html 2 with header 2 and date 2

换句话说,在解析内部 html 1 时,我想知道包含“日期 1”和“标题 1”的 html 节点已被解析(但包含“日期 2”和“标题 2”的节点尚未被解析)

如果我通过常规文本解析执行此操作,我将一次读取一行并记录最后一个“日期”和“标题” “比我解析的要多。然后,当需要解析内部 html 1 时,我可以引用最后解析的“日期”和“标题”对象将它们关联在一起。

<div class="mvb"><b>Date 1</b></div>
<div class="mxb"><b>Header 1</b></div>
<div>
   inner hmtl 1
</div>

<div class="mvb"><b>Date 2</b></div>
<div class="mxb"><b>Header 2</b></div>
<div>
inner html 2
</div>

I would like to parse the inner html between the tags in such a way that I can

    * associate the inner html 1 with header 1 and date 1
    * associate the inner html 2 with header 2 and date 2

In other words, at the time I parse the inner html 1 I would like to know that the html nodes containing "Date 1" and "Header 1" have been parsed (but the nodes containing "Date 2" and "Header 2" have not been parsed)

If I were doing this via regular text parsing, I would read one line at a time and record the last "Date" and "Header" than I had parsed. Then when it came time to parse the inner html 1, I could refer to the last parsed "Date" and "Header" object to associate them together.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

牛↙奶布丁 2024-11-07 12:46:31

使用 Html Agility Pack,您可以利用 XPATH 功能 - 并忘记那些冗长的 xlinq 废话:-)。 XPATH position() 函数是上下文相关的。这是一个示例代码:

    HtmlDocument doc = new HtmlDocument();
    doc.Load("your html file");

    // select all DIV without a CLASS attribute defined
    foreach (HtmlNode div in doc.DocumentNode.SelectNodes("//div[not(@class)]"))
    {
        Console.WriteLine("div=" + div.InnerText.Trim());
        Console.WriteLine("  header=" + div.SelectSingleNode("preceding-sibling::div[position()=1]/b").InnerText);
        Console.WriteLine("  date=" + div.SelectSingleNode("preceding-sibling::div[position()=2]/b").InnerText);
    }

它将与您的示例一起打印:

div=inner hmtl 1
  header=Header 1
  date=Date 1
div=inner html 2
  header=Header 2
  date=Date 2

Using the Html Agility Pack, you can leverage XPATH power - and forget about that verbose xlinq crap :-). The XPATH position() function is context sensitive. Here is a sample code:

    HtmlDocument doc = new HtmlDocument();
    doc.Load("your html file");

    // select all DIV without a CLASS attribute defined
    foreach (HtmlNode div in doc.DocumentNode.SelectNodes("//div[not(@class)]"))
    {
        Console.WriteLine("div=" + div.InnerText.Trim());
        Console.WriteLine("  header=" + div.SelectSingleNode("preceding-sibling::div[position()=1]/b").InnerText);
        Console.WriteLine("  date=" + div.SelectSingleNode("preceding-sibling::div[position()=2]/b").InnerText);
    }

That will prrint this with your sample:

div=inner hmtl 1
  header=Header 1
  date=Date 1
div=inner html 2
  header=Header 2
  date=Date 2
国粹 2024-11-07 12:46:31

好吧,您可以通过多种方式执行此操作...

例如,如果您要解析的 HTML 是您在问题中编写的 HTML,则一种简单的方法可能是:

  1. 将所有日期存储在 HtmlNodeCollection 中
  2. 将所有标头存储在 HtmlNodeCollection
  3. 存储 中另一个 HtmlNodeCollection 中的所有内部文本

如果一切正常并且 HTML 具有该布局,则这 3 个集合中的元素数量将相同。

然后您可以轻松地执行

for (int i = 0; i < innerTexts.Count; i++) {
    //Get Date, Headers and Inner Texts at position i
}

以下操作:

var document = new HtmlWeb().Load("http://www.url.com"); //Or load it from a Stream, local file, etc.

var dateNodes = document.DocumentNode.SelectNodes("//div[@class='mvb']/b");
var headerNodes = document.DocumentNode.SelectNodes("//div[@class='mxb']/b");

var innerTextNodes = (from node in document.DocumentNode.SelectNodes("//div")
                        let previous = node.PreviousSibling
                        where previous.Name == "div" && previous.GetAttributeValue("class", "") == "mxb"
                        select node).ToList();

//Check here if the number of elements of the 3 collections are the same

for (int i = 0; i < dateNodes.Count; i++) {
    var date = dateNodes[i].InnerText;
    var header = headerNodes[i].InnerText;
    var innerText = innerTextNodes[i].InnerText;

    //Now you have the set you want: You have the Date, Header and Inner Text
}

这是执行此操作的一种方法。
当然,您应该检查是否存在异常(.SelectNodes(..) 方法未返回 null),在存储 innerTextNodes 时检查 LINQ 表达式中的错误,并将 for (...) 重构为接收 HtmlNode 并返回 InnerText 属性的方法它。

考虑一下,在您发布的 HTML 代码中,您知道包含内部文本的

标记是什么的唯一方法是假设它是旁边的标记包含标题的

标记。这就是我使用 LINQ 表达式的原因。

另一种了解它的方法可能是

是否具有某些特定属性(如 class="___")或类似属性,或者它内部是否包含一些标签而不仅仅是文字。解析 HTML 时并没有什么魔力:)

编辑:
我没有测试过这段代码。亲自测试一下,让我知道它是否有效。

Well, you can do this in several ways...

For example, if the HTML you want to parse is the one you wrote in your question, an easy way could be:

  1. Store all dates in a HtmlNodeCollection
  2. Store all headers in a HtmlNodeCollection
  3. Store all inner texts in another HtmlNodeCollection

If everything is okay and the HTML has that layout, you will have the same number of elements in both 3 collections.

Then you can easily do:

for (int i = 0; i < innerTexts.Count; i++) {
    //Get Date, Headers and Inner Texts at position i
}

The following should work:

var document = new HtmlWeb().Load("http://www.url.com"); //Or load it from a Stream, local file, etc.

var dateNodes = document.DocumentNode.SelectNodes("//div[@class='mvb']/b");
var headerNodes = document.DocumentNode.SelectNodes("//div[@class='mxb']/b");

var innerTextNodes = (from node in document.DocumentNode.SelectNodes("//div")
                        let previous = node.PreviousSibling
                        where previous.Name == "div" && previous.GetAttributeValue("class", "") == "mxb"
                        select node).ToList();

//Check here if the number of elements of the 3 collections are the same

for (int i = 0; i < dateNodes.Count; i++) {
    var date = dateNodes[i].InnerText;
    var header = headerNodes[i].InnerText;
    var innerText = innerTextNodes[i].InnerText;

    //Now you have the set you want: You have the Date, Header and Inner Text
}

This is a way of doing this.
Of course, you should check for exceptions (that .SelectNodes(..) method are not returning null), check for errors in the LINQ expression when storing innerTextNodes, and refactor the for (...), maybe into a method that receives a HtmlNode and returns the InnerText property of it.

Take in count that the only way you can know, in the HTML code you posted, what is the <div> tag that contains the Inner Text, is to assume it is the one that is next to the <div> tag that contains the Header. That's why I used the LINQ expression.

Another way of knowing it could be if the <div> has some particular attribute (like class="___") or similar, or if it contains some tags inside it and not just text. There is no magic when parsing HTMLs :)

Edit:
I have not tested this code. Test it by yourself and let me know if it worked.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文