HtmlAgilityPack算法问题

发布于 2024-11-19 07:07:16 字数 2020 浏览 3 评论 0原文

我正在使用 HtmlAgilityPack 从网站获取一些 Html。

这是收到的 Html：

<table class="table">
<tr>
    <td>
        <table class="innertable">...</table>
    </td>
</tr>
<tr>
    <td colspan="2"><strong>Contact</strong></td>
</tr>
<tr>
    <td colspan="2">John Doe</td>
</tr>
<tr>
    <td colspan="2">Jane Doe</td>
</tr>
<tr>
    <td colspan="2">&nbsp;</td>
</tr>
<tr>
    <td><strong>Units</strong></td>
    <td>32</td>
</tr>
<tr>
    <td><strong>Year</strong></td>
    <td>1998</td>
</tr>
</table>

上下文：

我正在使用以下代码来获取第一个：

var table = document.DocumentNode.SelectNodes("//table[@class='table']").FirstOrDefault();

我正在使用以下代码来获取内表：

var innerTable = table.SelectNodes("//table[@class=innertable]").FirstOrDefault();

到目前为止一切顺利！

我需要从第一个表中获取一些信息，并从内表中获取一些信息。由于我从第一个表中的信息开始，所以我需要跳过第一行（包含内表），因此我执行以下操作：

var tableCells = table.SelectNodes("tr[position() > 1]/td");

因为我现在拥有所有从第一个表中排除内表的单元格，我开始执行以下操作：

string contact1 = HttpUtility.HtmlDecode(tableCells[1].InnerHtml);
string contact2 = HttpUtility.HtmlDecode(tableCells[2].InnerHtml);

string units = HttpUtility.HtmlDecode(tableCells[5].InnerHtml);
string years = HttpUtility.HtmlDecode(tableCells[7].InnerHtml);

问题：

我通过对 tableCells[] 中的索引进行硬编码而不是获取我想要的值认为布局会移动……不幸的是，它确实移动了。

在某些情况下，我没有“Jane Doe”行（如上面的 Html 所示），这意味着我可能或可能没有两个联系人。

因此，我无法对索引进行硬编码，因为我最终可能会在错误的变量中得到错误的数据。

所以我需要改变我的方法...

有谁知道我如何完善我的算法，以便它可以考虑到我可能有一个或联系人并且可能不使用这一事实硬编码索引？

提前致谢！

弗林采

原文

I’m using HtmlAgilityPack to obtain some Html from a web site.

Here is the received Html:

<table class="table">
<tr>
    <td>
        <table class="innertable">...</table>
    </td>
</tr>
<tr>
    <td colspan="2"><strong>Contact</strong></td>
</tr>
<tr>
    <td colspan="2">John Doe</td>
</tr>
<tr>
    <td colspan="2">Jane Doe</td>
</tr>
<tr>
    <td colspan="2"> </td>
</tr>
<tr>
    <td><strong>Units</strong></td>
    <td>32</td>
</tr>
<tr>
    <td><strong>Year</strong></td>
    <td>1998</td>
</tr>
</table>

The Context:

I’m using the following code to get the first :

var table = document.DocumentNode.SelectNodes("//table[@class='table']").FirstOrDefault();

I’m using the following code to get the inner table :

var innerTable = table.SelectNodes("//table[@class=innertable]").FirstOrDefault();

So far so good!

I need to get some information from the first table and some from the inner table.
Since I begin with the information from the first table I need to skip the first row (which holds the inner table) so I do the following:

var tableCells = table.SelectNodes("tr[position() > 1]/td");

Since I now have all the cells from the first table excluding the inner table, I start doing the following:

string contact1 = HttpUtility.HtmlDecode(tableCells[1].InnerHtml);
string contact2 = HttpUtility.HtmlDecode(tableCells[2].InnerHtml);

string units = HttpUtility.HtmlDecode(tableCells[5].InnerHtml);
string years = HttpUtility.HtmlDecode(tableCells[7].InnerHtml);

The problem:

I’m getting the values I want by hardcoding the index in tableCells[] not thinking the layout would move…unfortunately, it does move.

In some cases I do not have a “Jane Doe” row (as shown in the above Html), this means I may or may not have two contacts.

Because of this, I can’t hardcode the indexes since I might end up having the wrong data in the wrong variables.

So I need to change my approach...

Does anyone know how I could perfect my algorithm so that it can take into account the fact that I may have one or two contacts and perhaps not use hardcoded indexes?

Thanks in advance!

vlince

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

感受沵的脚步 2024-11-26 07:07:16

此类问题从来没有一种唯一的解决方案。这是一个 XPATH，它似乎做了某种事情：

        HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
        doc.Load(yourHtmlFile);

        doc.Save(Console.Out);

        foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//tr[td/strong/text() = 'Contact']/following-sibling::tr/td/text()[. != ' ']"))
        {
            Console.WriteLine(node.OuterHtml);
        }

将显示以下内容：

John Doe
Jane Doe
32
1998

There is never one unique solution to this kind of problem. Here is an XPATH that seems to do some kind of it though:

        HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
        doc.Load(yourHtmlFile);

        doc.Save(Console.Out);

        foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//tr[td/strong/text() = 'Contact']/following-sibling::tr/td/text()[. != ' ']"))
        {
            Console.WriteLine(node.OuterHtml);
        }

will display this: