We don’t allow questions seeking recommendations for software libraries, tutorials, tools, books, or other off-site resources. You can edit the question so it can be answered with facts and citations.
Closed 9 years ago.
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
接受
或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
发布评论
评论(1)
建造机器人并不难,有很多书描述了这样做的通用算法(简单的谷歌搜索就会找到很多算法)。
从 .NET 的角度来看,它的要点是递归地:
下载页面 - 这是通过
HttpWebRequest
/HttpWebResponse
,或WebClient
类。此外,您还可以使用 CodePlex 中的新的 WCF Web API,对上述内容的巨大改进,专门用于生成/消费REST内容,它对于蜘蛛抓取目的非常有效(主要是因为它的可扩展性)解析下载的内容 -我强烈推荐Html Agility Pack以及fizzler Html Agility Pack 扩展程序。 Html Agility Pack 将处理格式错误的 HTML,并允许您使用 XPath(或其子集)查询 HTML 元素。此外,如果您熟悉 CSS 选择器,fizzler 将允许您使用 < a href="http://api.jquery.com/category/selectors/" rel="nofollow noreferrer">在 jQuery 中使用它们。
获得结构化格式的 HTML 后,扫描该结构以查找与您相关的内容并对其进行处理。
扫描外部链接的结构化格式并将其放入要处理的队列中(无论您希望应用程序受到什么限制,您都不会为整个网络建立索引,不是吗?)。
< /里>
获取队列中的下一个项目,然后再次重复该过程。
Building robots isn't that hard, and there are a number of books that describe the general algorithm for doing so (a simple Google search will turn up a number of algorithms).
The jist of it from a .NET perspecitve is to recursively:
Download pages - This is done through the
HttpWebRequest
/HttpWebResponse
, or theWebClient
classes. Also, you can use the new WCF Web API from CodePlex, which is a vast improvement over the above, meant specifically for producing/consuming REST content, it works wonderfully for spidering purposes (mainly because of it's extensibility)Parse the downloaded content - I highly recommend the Html Agility Pack as well as the fizzler extension for the Html Agility Pack. The Html Agility Pack will handle malformed HTML and allow you to query HTML elements using XPath (or a subset of). Additionally, fizzler will allow you to use CSS selectors if you are familiar with using them in jQuery.
Once you have the HTML in a structured format, scan the structure for the content that is relevant to you and process it.
Scan the structured format for external links and place in the queue to be processed (against whatever constraints you want for your app, you aren't indexing the entire web, are you?).
Get the next item in the queue, and repeat the process again.