C# 基于网络的爬虫
我有几个关于爬虫的问题。
我可以创建一个纯粹在网络上运行的爬虫吗?我的意思是,一个可以从网络项目的管理页面启动或停止的爬虫。
用什么语言编写爬虫最方便?我本来打算用c#来写。
最重要的一点:爬虫是如何工作的?我的意思是,我知道您通过使用
HttpWebRequest
和HttpWebResponse
创建它们,并且我猜想在每次页面访问后,爬虫都会回来并且代码将评估然后创建一个队列将爬虫发送到其他网站。所以基本上如果这个信息是真的,考虑到我将使用一个网络项目创建爬虫,我应该保持页面始终处于打开状态吗?爬虫对服务器的负担有多大?它会减慢服务器的速度还是它的工作量相对较小?
我知道,这里有很多问题,我非常感谢您的回答:)
I have several questions concerning crawlers.
Can I create a crawler that works purely on web? I mean, a crawler that can be launched or stopped from an admin page of a web-project.
What is the most convenient language to write a crawler? I was planning to write it with c#.
The most important one: How do the crawlers work? I mean, I know that you create them through the use of
HttpWebRequest
andHttpWebResponse
, and I guess that after each page visit, the crawlers will come back and the code will evaluate the result and then create a queue to send the crawler to other websites. So basically if this information is true, considering that I will create the crawler by using a web-project, should I keep the page always up and how big will be the burden of the crawler for the server? Will it slow down the server or it's a relatively small work for it?
I know, there are many questions here and I will really appreciate the answers :)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
1)爬虫绝对可以在网络上正常工作。您的爬网程序可以是 ASP.NET 应用程序,或者您的管理页面可以启动或停止服务器上的任务(网络爬网程序)。
2) VB.NET 或 C# 都可以。他们都有大量用于网络工作的库。
3)我想你正在寻找的是一个递归函数。首先,选择互联网上的一个页面(包含很多链接)。对于页面中的每个链接,再次运行爬虫的 main 方法。一遍又一遍地这样做。您可能需要限制爬行的“深度”。我想您也想在每个页面中做一些工作。
1) Absolutely a crawler could work purley on the web. Your crawler could be either an ASP.NET application, or your administration page could start or stop a task (the web crawler) on the server.
2) VB.NET or C# works. They both have extensive libraries for working with the web.
3) I'd imagine what you're looking for is a recursive function. First, choose a page to start with on the internet (that contains a lot of links). For each link within the page, run the crawler's main method again. Do this over and over. You'll probably want to limit how "deep" down to crawl. I'd imagine you'd want to do some work within each page also.