如何借助 html 敏捷包从 html 文件中提取最里面的表格？

发布于 2024-08-27 08:39:06 字数 1460 浏览 7 评论 0原文

我正在 html 敏捷包的帮助下解析 html 文件中的表格信息。

现在我可以做到并且有效。

但是当我想要提取的表是最里面的时候。

或者我不知道它在嵌套表中的哪个位置。并且可以有任意数量的嵌套表，我想从中提取具有列名名称、地址的表的信息。

前任。

<table>
    <table>
           <tr><td>PHONE NO.</td><td>OTHER INFO.</td></tr>
           <tr><td>
              <table>
                 <tr><td>AMOUNT</td></tr>
                 <tr><td>50000</td></tr>
                 <tr><td>80000</td></tr>
              </table>
           </td></tr>
           <tr><td>
              <table>
                 <tr><td>
                     <table>
                         <tr><td>
                              <table>
                                 <tr><td> NAME </td><td>ADDRESS</td>
                                 <tr><td> ABC  </td><td> kfks   </td>
                                 <tr><td> BCD  </td><td> fdsa   </td>
                              </table>
                         </tr></td>
                     </table>
                 </td></tr>
              </table>
           </td></tr>
        </table>

有很多表，但我想提取具有列名 name、address 的表。那我该怎么办呢？

原文

I am parsing the tabular information from the html file with the help of the html agility pack.

Now I can do it and it works.

But when the table what I want to extract is inner most.

Or I don't know at which position it is in nested tables.And there can be any number of nested tables and from that I want to extract the information of the table which has column name name,address.

Ex.

<table>
    <table>
           <tr><td>PHONE NO.</td><td>OTHER INFO.</td></tr>
           <tr><td>
              <table>
                 <tr><td>AMOUNT</td></tr>
                 <tr><td>50000</td></tr>
                 <tr><td>80000</td></tr>
              </table>
           </td></tr>
           <tr><td>
              <table>
                 <tr><td>
                     <table>
                         <tr><td>
                              <table>
                                 <tr><td> NAME </td><td>ADDRESS</td>
                                 <tr><td> ABC  </td><td> kfks   </td>
                                 <tr><td> BCD  </td><td> fdsa   </td>
                              </table>
                         </tr></td>
                     </table>
                 </td></tr>
              </table>
           </td></tr>
        </table>

There are many tables but I want to extract the table which has column name name,address.
So what should I do ?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

冷情妓 2024-09-03 08:39:06

将文档加载为 HtmlDocument。然后使用 XPath 查询查找不包含其他表且第一行中包含“Name”的 td 的表。

XPath 实现是来自 System.Xml.XPath 的标准 .NET 实现，因此任何有关将 XPath 与 XmlDocument 结合使用的文档都适用。

HtmlDocument doc = new HtmlDocument();
doc.Load("file.html");
HtmlNode el = (HtmlNode) doc.DocumentNode.SelectSingleNode("//table[not(descendant::table) and tr[1]/td['NAME' = normalize-space()]]");

如果“名称”列已修复，您可以使用类似 'Name' = normalize-space(tr[1]/td[2]) 的内容。

根据多个列名查找表，但不是最里面的表条件。

HtmlNode el = (HtmlNode) doc.DocumentNode.SelectSingleNode("//table[tr[1]/td['NAME' = normalize-space()] and tr[1]/td['ADDRESS' = normalize-space()]]");

Load the document as a HtmlDocument. Then use an XPath query to find a table that contains no other tables and which has a td in the first row containing "Name".

The XPath implementation is the standard .NET one from System.Xml.XPath, so any documentation about using XPath with XmlDocument will be applicable.

HtmlDocument doc = new HtmlDocument();
doc.Load("file.html");
HtmlNode el = (HtmlNode) doc.DocumentNode.SelectSingleNode("//table[not(descendant::table) and tr[1]/td['NAME' = normalize-space()]]");

If the "Name" column was fixed, you could use something like 'Name' = normalize-space(tr[1]/td[2]).

To find a table based on several column names, but not the inner most table condition.

HtmlNode el = (HtmlNode) doc.DocumentNode.SelectSingleNode("//table[tr[1]/td['NAME' = normalize-space()] and tr[1]/td['ADDRESS' = normalize-space()]]");

回复收藏 0 原文

ま柒月 2024-09-03 08:39:06

var table = doc.DocumentNode.SelectSingleNode("//table [not(descendant::table) and tr[1]/td[normalize-space()='ADDRESS'] ]");

var table = doc.DocumentNode.SelectSingleNode("//table [not(descendant::table) and tr[1]/td[normalize-space()='ADDRESS'] ]");

回复收藏 0 原文

~没有更多了~

关于作者

小糖芽

暂无简介

文章

26 人气

关注发私信

友情链接

文江博客

如何借助 html 敏捷包从 html 文件中提取最里面的表格？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（2）

关于作者

相关话题

热门标签

推荐作者

西西弗的石头怪

5397313

烟沫凡尘

一个破名字

萌︼了一个春

当爱已成负担

友情链接

如何借助 html 敏捷包从 html 文件中提取最里面的表格？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（2）

关于作者

相关话题

热门标签

推荐作者

西西弗的石头怪

5397313

烟沫凡尘

一个破名字

萌︼了一个春

当爱已成负担

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。