如何在 C# 中使用 htmlagilitypack 以单一方法访问更多标签?

发布于 2024-12-25 00:16:55 字数 2283 浏览 0 评论 0原文

我正在设计一个程序来抓取网页 thenextweb.com 的帖子(链接、帖子内容、图像、日期、作者等)。

其一篇帖子的 html 如下:

<div class="media-data">

                     <h4><a href="http://thenextweb.com/mobile/2012/01/05/nokia-reportedly-to-appoint-f-secure-founder-risto-siilasmaa-as-new-chairman/">Nokia to Name Risto Siilasmaa as New Chairman</a></h4>
                     <p class="article-meta"><a href="http://thenextweb.com/mobile/">TNW Mobile</a> &#8226; <a href="http://thenextweb.com/author/matt/" title="Posts by Matt Brian" rel="author">Matt Brian</a>  &#8226; <span class="date" title="1325748846">January 5, 2012</span></a></p>

                     <p>Nokia is reportedly planning to nominate and name Risto Siilasmaa, founder of Finnish anti-virus and computer security F-Secure, as its new chairman by the end of the month, Finland&#8217;s Helsingin Sanomat reports&#8230;</p>
                 </div>

这是主页上接下来 15 篇帖子的 html 。 为了访问其内容,我使用了:

  var webGet = new HtmlWeb();
  var document = webGet.Load(url);
 var infos = from info in document.DocumentNode.SelectNodes("//div[@class ='media-data']//h4//a")
                        select new
                        {
                            LinkURL = info.Attributes["href"].Value,
                            Text = info.InnerText
                        };

            lvLinks.DataSource = infos;
            lvLinks.DataBind();

并访问作者、日期等信息,我使用了:

 var infos = from info in document.DocumentNode.SelectNodes("//div[@class ='media-data']//p[@rel = 'author']")
                        select new
                        {
                           Author = info.InnerText
                        };

            lvLinks.DataSource = infos;
            lvLinks.DataBind();

我使用列表视图控件将 ASP 页面上的数据显示为

  • ; <%# Eval("文本") %> - <%# Eval("LinkUrl") %>
  • 但我想要一种方法,以便我可以一次访问所有这些...不需要为 链接、内容 和其他 编写不同的代码>作者,日期等。

    是否有一种方法可以让我在

    ...
    下写入和检索信息>我想要的任何节点的标签并存储它?

    请提出这一建议,因为在帖子链接本身中附加作者、日期信息非常重要。我做不到。

    谢谢

    I am designing a program to crawl the web page thenextweb.com for its posts (links, post content, image, date, author etc.)

    The html for its one post is as :

    <div class="media-data">
    
                         <h4><a href="http://thenextweb.com/mobile/2012/01/05/nokia-reportedly-to-appoint-f-secure-founder-risto-siilasmaa-as-new-chairman/">Nokia to Name Risto Siilasmaa as New Chairman</a></h4>
                         <p class="article-meta"><a href="http://thenextweb.com/mobile/">TNW Mobile</a> • <a href="http://thenextweb.com/author/matt/" title="Posts by Matt Brian" rel="author">Matt Brian</a>  • <span class="date" title="1325748846">January 5, 2012</span></a></p>
    
                         <p>Nokia is reportedly planning to nominate and name Risto Siilasmaa, founder of Finnish anti-virus and computer security F-Secure, as its new chairman by the end of the month, Finland’s Helsingin Sanomat reports…</p>
                     </div>
    

    This is the html for next 15 posts on home page.
    For accessing its content i have used :

      var webGet = new HtmlWeb();
      var document = webGet.Load(url);
     var infos = from info in document.DocumentNode.SelectNodes("//div[@class ='media-data']//h4//a")
                            select new
                            {
                                LinkURL = info.Attributes["href"].Value,
                                Text = info.InnerText
                            };
    
                lvLinks.DataSource = infos;
                lvLinks.DataBind();
    

    and to access the information of Authors, date etc i used :

     var infos = from info in document.DocumentNode.SelectNodes("//div[@class ='media-data']//p[@rel = 'author']")
                            select new
                            {
                               Author = info.InnerText
                            };
    
                lvLinks.DataSource = infos;
                lvLinks.DataBind();
    

    I have used list view control to show the data on ASP page as <li> <%# Eval("Text") %> - <%# Eval("LinkUrl") %> </li>

    But i want a way so that i can access all of them in one go...No need to write different code for links, content and other for author, date etc.

    Can there be a method so that i can write and retrieve information under <div class="media-data">... </div>tags for any node i want and store it ?

    Please suggest this as it is very important to attach the authors, date information with the post link itself. I am not able to do that.

    Thanks

    如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

    扫码二维码加入Web技术交流群

    发布评论

    需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

    评论(1

    A君 2025-01-01 00:16:55

    您可以先选择

    类型的节点,然后选择其中所有必需的子节点:

    var nodes = doc.DocumentNode.SelectNodes("//div[@class ='media-data']")
        .Select(div => new
        {
            Link = div.SelectSingleNode("h4/a"),
            Author = div.SelectSingleNode("//*[@rel='author']"),
            Date = div.SelectSingleNode("//*[@class='date']"),
            Content = div.SelectSingleNode("p[last()]")
        }).Select(data => new
        {
            LinkURL = data.Link.Attributes["href"].Value,
            Text = data.Link.InnerText,
            Author = data.Author.InnerText,
            Date = DateTime.Parse(data.Date.InnerText),
            Content = data.Content.InnerHtml
        });
    

    You could select nodes of type <div class="media-data"> first, and then select all the necessary sub-nodes inside it:

    var nodes = doc.DocumentNode.SelectNodes("//div[@class ='media-data']")
        .Select(div => new
        {
            Link = div.SelectSingleNode("h4/a"),
            Author = div.SelectSingleNode("//*[@rel='author']"),
            Date = div.SelectSingleNode("//*[@class='date']"),
            Content = div.SelectSingleNode("p[last()]")
        }).Select(data => new
        {
            LinkURL = data.Link.Attributes["href"].Value,
            Text = data.Link.InnerText,
            Author = data.Author.InnerText,
            Date = DateTime.Parse(data.Date.InnerText),
            Content = data.Content.InnerHtml
        });
    
    ~没有更多了~
    我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
    原文