C# asp.net web-crawler html-agility-pack

在 C# 中使用 htmlagilitypack 从网页中提取链接和发布的方法？

发布于 2024-12-25 11:44:39 字数 703 浏览 0 评论 0原文

我正在设计一个网站，该网站会抓取顶级技术网站，例如 thenextweb.com、mashable.com 和 readwriteweb.com 等。

现在，使用 Html Agility Pack 抓取一个网站，然后获取其文章链接和内容。根据其即使用

.....

并通过它获取链接。以相同的方式为每个网站设计算法（因为每个网站的标签不同）。

以下是我用于从thenextweb.com 网站主页获取链接的方法：

var webGet = new HtmlWeb(); 
var document = webGet.Load(url); 
var infos = from info in 
            document.DocumentNode.SelectNodes("//div[@class='article-listing']") 
               select new 
                { 
                Contr = info.InnerHtml 
                 }; 

lvLinks.DataSource = infos; 
lvLinks.DataBind();

是否有其他简单的方式可以提取链接和内容（帖子及其图像、日期等）？

原文

I am designing a website which scrapes top technology websites such as thenextweb.com, mashable.com and readwriteweb.com etc.

Now one way to scrap using the Html Agility Pack is taking one website let thenextweb.com and fetch its article links and content according to its <tags> i.e. using <div class ="article-listing"> ..... </div> and fetch links through that.
In the same manner design algorithm for each and every website (as tags are different for each website).

Here's what I used for getting links from the website thenextweb.com's home page:

var webGet = new HtmlWeb(); 
var document = webGet.Load(url); 
var infos = from info in 
            document.DocumentNode.SelectNodes("//div[@class='article-listing']") 
               select new 
                { 
                Contr = info.InnerHtml 
                 }; 

lvLinks.DataSource = infos; 
lvLinks.DataBind();

Is there any other simple manner through which I can extract links and content (post and its images, date etc)?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

如梦 2025-01-01 11:44:39

所有这些网站都应该有 RSS feed，这是获取数据的最快捷方式。例如，The Next Web 有以下标签（您实际上并不需要标签，只需 URL）：

<link rel="alternate" type="application/rss+xml" title="TNW Network All Stories RSS Feed" href="http://feeds2.feedburner.com/thenextweb" />
<link rel="alternate" type="application/rss+xml" title="TNW Network Top Stories RSS Feed" href="http://feeds2.feedburner.com/thenextwebtopstories" />

http://feeds2.feedburner.com/thenextwebtopstories" rel="nofollow">http:// /feeds2.feedburner.com/thenextwebtopstories

提要应采用相同的格式（或至少相似的格式），这比原始 HTML 更容易理解，并且不太可能更改。寻找 .Net RSS 解析器应该不会有任何困难。

All of these sites should have RSS feeds, which are the bast way to get data. For example, The Next Web has these tags (you don't really need the tags, just the URL):

<link rel="alternate" type="application/rss+xml" title="TNW Network All Stories RSS Feed" href="http://feeds2.feedburner.com/thenextweb" />
<link rel="alternate" type="application/rss+xml" title="TNW Network Top Stories RSS Feed" href="http://feeds2.feedburner.com/thenextwebtopstories" />

http://feeds2.feedburner.com/thenextwebtopstories

The feeds should be in the same format (or at least a similar format), which is much easier to understand than raw HTML, and isn't likely to change. You shouldn't have any trouble finding a .Net RSS parser.

回复收藏 0 原文

公布 2025-01-01 11:44:39

使用更多“from”子句来提取链接的方法

我已经找到了在 LINQ 中

，我可以使用

var infos = from info in document.DocumentNode.SelectNodes("//div[@class='article-listing']") 
               from link in info.SelectNodes("h4//a").Where(x => .Attributes.Contains("href"))

                select new 
                { 
                LinkURL = link.Attributes["href"].value
                 };

这种方式链接，可以获取图像。

谢谢...现在没问题了

I have got the way to extract the links using

more "from" clauses in the LINQ

i can use

var infos = from info in document.DocumentNode.SelectNodes("//div[@class='article-listing']") 
               from link in info.SelectNodes("h4//a").Where(x => .Attributes.Contains("href"))

                select new 
                { 
                LinkURL = link.Attributes["href"].value
                 };

In this way links, images can be fetched.

Thanks...No issue now

回复收藏 0 原文

~没有更多了~