在 C# 中使用 htmlagilitypack 从网页中提取链接和发布的方法?
我正在设计一个网站,该网站会抓取顶级技术网站,例如 thenextweb.com、mashable.com 和 readwriteweb.com 等。
现在,使用 Html Agility Pack 抓取一个网站,然后获取其文章链接和内容。根据其
即使用
并通过它获取链接。 以相同的方式为每个网站设计算法(因为每个网站的标签不同)。
以下是我用于从thenextweb.com 网站主页获取链接的方法:
var webGet = new HtmlWeb();
var document = webGet.Load(url);
var infos = from info in
document.DocumentNode.SelectNodes("//div[@class='article-listing']")
select new
{
Contr = info.InnerHtml
};
lvLinks.DataSource = infos;
lvLinks.DataBind();
是否有其他简单的方式可以提取链接和内容(帖子及其图像、日期等)?
I am designing a website which scrapes top technology websites such as thenextweb.com, mashable.com and readwriteweb.com etc.
Now one way to scrap using the Html Agility Pack is taking one website let thenextweb.com and fetch its article links and content according to its <tags>
i.e. using <div class ="article-listing"> ..... </div>
and fetch links through that.
In the same manner design algorithm for each and every website (as tags are different for each website).
Here's what I used for getting links from the website thenextweb.com's home page:
var webGet = new HtmlWeb();
var document = webGet.Load(url);
var infos = from info in
document.DocumentNode.SelectNodes("//div[@class='article-listing']")
select new
{
Contr = info.InnerHtml
};
lvLinks.DataSource = infos;
lvLinks.DataBind();
Is there any other simple manner through which I can extract links and content (post and its images, date etc)?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
所有这些网站都应该有 RSS feed,这是获取数据的最快捷方式。例如,The Next Web 有以下标签(您实际上并不需要标签,只需 URL):
http://feeds2.feedburner.com/thenextwebtopstories" rel="nofollow">http:// /feeds2.feedburner.com/thenextwebtopstories
提要应采用相同的格式(或至少相似的格式),这比原始 HTML 更容易理解,并且不太可能更改。寻找 .Net RSS 解析器应该不会有任何困难。
All of these sites should have RSS feeds, which are the bast way to get data. For example, The Next Web has these tags (you don't really need the tags, just the URL):
http://feeds2.feedburner.com/thenextwebtopstories
The feeds should be in the same format (or at least a similar format), which is much easier to understand than raw HTML, and isn't likely to change. You shouldn't have any trouble finding a .Net RSS parser.
使用更多“from”子句来提取链接的方法
我已经找到了在 LINQ 中
,我可以使用
这种方式链接,可以获取图像。
谢谢...现在没问题了
I have got the way to extract the links using
more "from" clauses in the LINQ
i can use
In this way links, images can be fetched.
Thanks...No issue now