C# asp.net screen-scraping html-agility-pack

使用 HtmlAgilityPack 解析 dl

发布于 2024-12-28 02:33:19 字数 1931 浏览 1 评论 0原文

这是我尝试使用 ASP.Net (C#) 中的 Html Agility Pack 解析的示例 HTML。

<div class="content-div">
    <dl>
        <dt>
            <b><a href="1.html" title="1">1</a></b>
        </dt>
        <dd> First Entry</dd>
        <dt>
            <b><a href="2.html" title="2">2</a></b>
        </dt>
        <dd> Second Entry</dd>
        <dt>
            <b><a href="3.html" title="3">3</a></b>
        </dt>
        <dd> Third Entry</dd>
    </dl>
</div>

我想要的值是：

超链接 -> 1.html
锚文本 ->1
内部文本 od dd ->第一个条目

（我在这里举了第一个条目的示例，但我想要列表中所有条目的这些元素的值）

这是我当前使用的代码，

var webGet = new HtmlWeb();
            var document = webGet.Load(url2);
var parsedValues=
   from info in document.DocumentNode.SelectNodes("//div[@class='content-div']")
   from content in info.SelectNodes("dl//dd")
   from link in info.SelectNodes("dl//dt/b/a")
       .Where(x => x.Attributes.Contains("href"))
   select new 
   {
       Text = content.InnerText,
       Url = link.Attributes["href"].Value,
       AnchorText = link.InnerText,
   };

GridView1.DataSource = parsedValues;
GridView1.DataBind();

问题是我获取了链接和锚文本正确，但对于其内部文本，仅采用第一个条目的值，并为所有其他条目填充相同的值，以达到该元素出现的总次数，然后从第二个条目开始。我的解释可能不太清楚，所以这里是我用这段代码得到的示例输出：

First Entry     1.html  1
First Entry     2.html  2
First Entry     3.html  3
Second Entry    1.html  1
Second Entry    2.html  2
Second Entry    3.html  3
Third Entry     1.html  1
Third Entry     2.html  2
Third Entry     3.html  3

虽然我试图了解

First Entry      1.html     1
Second Entry     2.html     2
Third Entry      3.html     3

我对 HAP 很陌生，并且对 xpath 知之甚少，所以我确信我做错了什么在这里，但即使花了几个小时我也无法让它工作。任何帮助将不胜感激。

原文

This is the sample HTML I am trying to parse with Html Agility Pack in ASP.Net (C#).

<div class="content-div">
    <dl>
        <dt>
            <b><a href="1.html" title="1">1</a></b>
        </dt>
        <dd> First Entry</dd>
        <dt>
            <b><a href="2.html" title="2">2</a></b>
        </dt>
        <dd> Second Entry</dd>
        <dt>
            <b><a href="3.html" title="3">3</a></b>
        </dt>
        <dd> Third Entry</dd>
    </dl>
</div>

The Values I want are :

The hyperlink -> 1.html
The Anchor Text ->1
Inner Text od dd -> First Entry

(I have taken examples of the first entry here but I want the values for these elements for all the entries in the list )

This is the code I am using currently,

var webGet = new HtmlWeb();
            var document = webGet.Load(url2);
var parsedValues=
   from info in document.DocumentNode.SelectNodes("//div[@class='content-div']")
   from content in info.SelectNodes("dl//dd")
   from link in info.SelectNodes("dl//dt/b/a")
       .Where(x => x.Attributes.Contains("href"))
   select new 
   {
       Text = content.InnerText,
       Url = link.Attributes["href"].Value,
       AnchorText = link.InnerText,
   };

GridView1.DataSource = parsedValues;
GridView1.DataBind();

The problem is that I get the values for the link and the anchor text correctly but for the inner text of it just takes the value of the first entry and fills the same value for all other entries for the total number of times the element occurs and then it starts over with the second one. I may not be so clear in my explanation so here's a sample output I am getting with this code:

First Entry     1.html  1
First Entry     2.html  2
First Entry     3.html  3
Second Entry    1.html  1
Second Entry    2.html  2
Second Entry    3.html  3
Third Entry     1.html  1
Third Entry     2.html  2
Third Entry     3.html  3

Whereas I am trying to get

First Entry      1.html     1
Second Entry     2.html     2
Third Entry      3.html     3

I am pretty new to HAP and have very little knoweledge on xpath, so I am sure I am doing something wrong here, but I couldn't make it work even after spending hours on it. Any help would be much appreciated.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

半仙 2025-01-04 02:33:19

解决方案 1

我定义了一个函数，给定 dt 节点将返回其后的下一个 dd 节点：

private static HtmlNode GetNextDDSibling(HtmlNode dtElement)
{
    var currentNode = dtElement;

    while (currentNode != null)
    {
        currentNode = currentNode.NextSibling;

        if(currentNode.NodeType == HtmlNodeType.Element && currentNode.Name =="dd")
            return currentNode;
    }

    return null;
}

现在可以转换 LINQ 代码到：

var parsedValues =
    from info in document.DocumentNode.SelectNodes("//div[@class='content-div']")
    from dtElement in info.SelectNodes("dl/dt")
    let link = dtElement.SelectSingleNode("b/a[@href]")
    let ddElement = GetNextDDSibling(dtElement)
    where link != null && ddElement != null
    select new
    {
        Text = ddElement.InnerHtml,
        Url = link.GetAttributeValue("href", ""),
        AnchorText = link.InnerText
    };

解决方案2

没有附加功能：

var infoNode = 
        document.DocumentNode.SelectSingleNode("//div[@class='content-div']");

var dts = infoNode.SelectNodes("dl/dt");
var dds = infoNode.SelectNodes("dl/dd");

var parsedValues = dts.Zip(dds,
    (dt, dd) => new
    {
        Text = dd.InnerHtml,
        Url = dt.SelectSingleNode("b/a[@href]").GetAttributeValue("href", ""),
        AnchorText = dt.SelectSingleNode("b/a[@href]").InnerText
    });

Solution 1

I have defined a function that given a dt node will return the next dd node after it:

private static HtmlNode GetNextDDSibling(HtmlNode dtElement)
{
    var currentNode = dtElement;

    while (currentNode != null)
    {
        currentNode = currentNode.NextSibling;

        if(currentNode.NodeType == HtmlNodeType.Element && currentNode.Name =="dd")
            return currentNode;
    }

    return null;
}

and now the LINQ code can be transformed to:

var parsedValues =
    from info in document.DocumentNode.SelectNodes("//div[@class='content-div']")
    from dtElement in info.SelectNodes("dl/dt")
    let link = dtElement.SelectSingleNode("b/a[@href]")
    let ddElement = GetNextDDSibling(dtElement)
    where link != null && ddElement != null
    select new
    {
        Text = ddElement.InnerHtml,
        Url = link.GetAttributeValue("href", ""),
        AnchorText = link.InnerText
    };

Solution 2

Without additional functions:

var infoNode = 
        document.DocumentNode.SelectSingleNode("//div[@class='content-div']");

var dts = infoNode.SelectNodes("dl/dt");
var dds = infoNode.SelectNodes("dl/dd");

var parsedValues = dts.Zip(dds,
    (dt, dd) => new
    {
        Text = dd.InnerHtml,
        Url = dt.SelectSingleNode("b/a[@href]").GetAttributeValue("href", ""),
        AnchorText = dt.SelectSingleNode("b/a[@href]").InnerText
    });

回复收藏 0 原文

杀手六號 2025-01-04 02:33:19

只是一个如何使用 Html Agility Pack 解析某些元素的示例

public string ParseHtml()
{
    string output = null;
    HtmlDocument htmldocument = new HtmlDocument();
    htmldocument.LoadHtml(YourHTML);

    HtmlNode node = htmldocument.DocumentNode;    

    HtmlNodeCollection dds = node.SelectNodes("//dd"); //Select all dd tags
    HtmlNodeCollection anchors = node.SelectNodes("//b/a[@href]"); //Select all 'a' tags that contais href attribute

    for (int i = 0; i < dds.Count; i++)
    {
        string atributteValue = null.
        Text = dds[i].InnerText;
        Url = anchors[i].GetAttributeValue("href", atributteValue);
        AnchorText = anchors[i].InnerText;

        //Your code...
    }
    return output;
}

Just a e.g. of how can you parse some elements using Html Agility Pack

public string ParseHtml()
{
    string output = null;
    HtmlDocument htmldocument = new HtmlDocument();
    htmldocument.LoadHtml(YourHTML);

    HtmlNode node = htmldocument.DocumentNode;    

    HtmlNodeCollection dds = node.SelectNodes("//dd"); //Select all dd tags
    HtmlNodeCollection anchors = node.SelectNodes("//b/a[@href]"); //Select all 'a' tags that contais href attribute

    for (int i = 0; i < dds.Count; i++)
    {
        string atributteValue = null.
        Text = dds[i].InnerText;
        Url = anchors[i].GetAttributeValue("href", atributteValue);
        AnchorText = anchors[i].InnerText;

        //Your code...
    }
    return output;
}

回复收藏 0 原文

~没有更多了~