使用 HtmlAgilityPack 解析 html 文档

发布于 2024-12-09 20:18:37 字数 2473 浏览 1 评论 0原文

我正在尝试通过 HtmlAgilityPack 解析以下 html 片段:

<td bgcolor="silver" width="50%" valign="top">
 <table bgcolor="silver" style="font-size: 90%" border="0" cellpadding="2" cellspacing="0"
                                                width="100%">
   <tr bgcolor="#003366">
       <td>
           <font color="white">Info
        </td>
        <td>
           <font color="white">
              <center>Price
                   </td>
                      <td align="right">
                         <font color="white">Hourly
                         </td>
              </tr>
               <tr>
                 <td>
                     <a href='test1.cgi?type=1'>Bookbags</a>
                 </td>
                   <td>
                      $156.42
                    </td>
                    <td align="right">
                        <font color="green">0.11%</font>
                      </td>
                  </tr>
                  <tr>
                    <td>
                       <a href='test2.cgi?type=2'>Jeans</a>
                     </td>
                         <td>
                            $235.92
                               </td>
                                  <td align="right">
                                     <font color="red">100%</font>
                                  </td>
                   </tr>
               </table>
          </td>

我的代码如下所示:

private void ParseHtml(HtmlDocument htmlDoc)
{
    var ItemsAndPrices = new Dictionary<string, int>();
   var findItemPrices = from links in htmlDoc.DocumentNode.Descendants()
                             where links.Name.Equals("table") && 
                             links.Attributes["width"].Equals ("100%") && 
                             links.Attributes["bgcolor"].Equals("silver")
                            select new
                                       {
                                           //select item and price
                                       }

在本例中,我想选择牛仔裤和书包的项目以及它们相关的 < code>prices 下面并将它们存储在字典中。

E.g Jeans at price $235.92

有谁知道如何通过 htmlagility pack 和 LINQ 正确执行此操作?

I'm trying to parse the following html snippet via HtmlAgilityPack:

<td bgcolor="silver" width="50%" valign="top">
 <table bgcolor="silver" style="font-size: 90%" border="0" cellpadding="2" cellspacing="0"
                                                width="100%">
   <tr bgcolor="#003366">
       <td>
           <font color="white">Info
        </td>
        <td>
           <font color="white">
              <center>Price
                   </td>
                      <td align="right">
                         <font color="white">Hourly
                         </td>
              </tr>
               <tr>
                 <td>
                     <a href='test1.cgi?type=1'>Bookbags</a>
                 </td>
                   <td>
                      $156.42
                    </td>
                    <td align="right">
                        <font color="green">0.11%</font>
                      </td>
                  </tr>
                  <tr>
                    <td>
                       <a href='test2.cgi?type=2'>Jeans</a>
                     </td>
                         <td>
                            $235.92
                               </td>
                                  <td align="right">
                                     <font color="red">100%</font>
                                  </td>
                   </tr>
               </table>
          </td>

My code looks something like this:

private void ParseHtml(HtmlDocument htmlDoc)
{
    var ItemsAndPrices = new Dictionary<string, int>();
   var findItemPrices = from links in htmlDoc.DocumentNode.Descendants()
                             where links.Name.Equals("table") && 
                             links.Attributes["width"].Equals ("100%") && 
                             links.Attributes["bgcolor"].Equals("silver")
                            select new
                                       {
                                           //select item and price
                                       }

In this instance, I would like to select the item which are Jeans and Bookbags as well as their associated prices below and store them in a dictionary.

E.g Jeans at price $235.92

Does anyone know how to do this properly via htmlagility pack and LINQ?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

腻橙味 2024-12-16 20:18:37

试试这个:
正则表达式解决方案:

  static Dictionary<string, string> GetProduct(string name, string html)
    {
        Dictionary<string, string> output = new Dictionary<string, string>();
        string clfr = @"[\r\n]*[^\r\n]+";
        string pattern = string.Format(@"href='([^']+)'>{0}</a>.*{1}{1}[\r\n]*([^\$][^\r\n]+)", name, clfr);
        Match products = Regex.Match(html, pattern, RegexOptions.Multiline | RegexOptions.IgnorePatternWhitespace);
        if(products.Success) {
            GroupCollection details = products.Groups;
            output.Add("Name", name);
            output.Add("Link", details[1].Value);
            output.Add("Price", details[2].Value.Trim());
            return output;
        }
        return output;
    }

然后:

 var ProductNames = new string[2] { "Jeans", "Bookbags" };
    for (int i = 0, len = ProductNames.Length; i < len; i++)
    {
        var product = GetProduct(ProductNames[i], html);
          if (product.Count != 0)
          {
            Console.WriteLine("{0} at price {1}", product["Name"], product["Price"]);
          }
    }

输出:

Jeans at price $235.92
Bookbags at price $156.42

注意:
Dictionary 的值不能是 int,因为 $235.92/$156.42 不是有效的 int。要将其转换为有效的 int,您可以删除美元和点符号并使用

int.Parse()

Try this:
Regex solution:

  static Dictionary<string, string> GetProduct(string name, string html)
    {
        Dictionary<string, string> output = new Dictionary<string, string>();
        string clfr = @"[\r\n]*[^\r\n]+";
        string pattern = string.Format(@"href='([^']+)'>{0}</a>.*{1}{1}[\r\n]*([^\$][^\r\n]+)", name, clfr);
        Match products = Regex.Match(html, pattern, RegexOptions.Multiline | RegexOptions.IgnorePatternWhitespace);
        if(products.Success) {
            GroupCollection details = products.Groups;
            output.Add("Name", name);
            output.Add("Link", details[1].Value);
            output.Add("Price", details[2].Value.Trim());
            return output;
        }
        return output;
    }

Then:

 var ProductNames = new string[2] { "Jeans", "Bookbags" };
    for (int i = 0, len = ProductNames.Length; i < len; i++)
    {
        var product = GetProduct(ProductNames[i], html);
          if (product.Count != 0)
          {
            Console.WriteLine("{0} at price {1}", product["Name"], product["Price"]);
          }
    }

Output:

Jeans at price $235.92
Bookbags at price $156.42

Note:
The value of Dictionary can't be an int because $235.92/$156.42 is not an valid int. to transform it to an int valid, you can remove the dollar and dot symbol and use

int.Parse()
独自←快乐 2024-12-16 20:18:37

这就是我的想法:

        var ItemsAndPrices = new Dictionary<string, string>();
        var findItemPrices = from links in htmlDoc.DocumentNode.Descendants("tr").Skip(1)
                             select links;

        foreach (var a in findItemPrices)
        {
            var values = (from tds in a.Descendants("td")
                         select tds.InnerText.Trim()).ToList();

            ItemsAndPrices.Add(values[0], values[1]);
        }

我唯一改变的是你的 ,因为 $156.42 不是 int

Here's what I came up with:

        var ItemsAndPrices = new Dictionary<string, string>();
        var findItemPrices = from links in htmlDoc.DocumentNode.Descendants("tr").Skip(1)
                             select links;

        foreach (var a in findItemPrices)
        {
            var values = (from tds in a.Descendants("td")
                         select tds.InnerText.Trim()).ToList();

            ItemsAndPrices.Add(values[0], values[1]);
        }

The only thing I changed was your <string, int>, because $156.42 isn't an int

小帐篷 2024-12-16 20:18:37

假设可能还有其他行,并且您不仅仅只想要书包和牛仔裤,我会这样做:

var table = htmlDoc.DocumentNode
    .SelectSingleNode("//table[@bgcolor='silver' and @width='100%']");
var query =
    from row in table.Elements("tr").Skip(1) // skip the header row
    let columns = row.Elements("td").Take(2) // take only the first two columns
        .Select(col => col.InnerText.Trim())
        .ToList()
    select new
    {
        Info = columns[0],
        Price = Decimal.Parse(columns[1], NumberStyles.Currency),
    };

Assuming that there could be other rows and you don't specifically want only Bookbags and Jeans, I'd do it like this:

var table = htmlDoc.DocumentNode
    .SelectSingleNode("//table[@bgcolor='silver' and @width='100%']");
var query =
    from row in table.Elements("tr").Skip(1) // skip the header row
    let columns = row.Elements("td").Take(2) // take only the first two columns
        .Select(col => col.InnerText.Trim())
        .ToList()
    select new
    {
        Info = columns[0],
        Price = Decimal.Parse(columns[1], NumberStyles.Currency),
    };
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文