C# Windows.forms 中的网络蜘蛛/爬虫
我用 VC# 创建了一个网络爬虫。爬网程序通过暴力破解所有可能的 .nl 地址(以 http://aa.nl< 开头)来索引 .nl 站点中的某些信息。 /a>(理论上)http://zzzzzzzzzzzzzzzzzzzz.nl。
它工作得很好,只是需要非常长的时间才能遍历两个字母的域 - aa、ab ... zz。我计算了以这种方式遍历所有领域需要多长时间,我得到了大约一千年。
我尝试通过线程来加速这一过程,但同时运行 1300 个线程,WebClient 不断失败,导致生成的数据文件太不准确而无法使用。
除了 Win7 上的 5Mb/s 互联网连接、E6300 Core2duo 和 2GB 533@667mhz RAM 之外,我无法访问其他任何东西。
有人知道该怎么做才能使这项工作成功吗?任何想法都可以。 谢谢
I have created a web crawler in VC#. The crawler indexes certain information from .nl sites by brute-forcing all of the possible .nl addresses, starting with http://aa.nl to (theoretically) http://zzzzzzzzzzzzzzzzzzzz.nl.
It works all right except that it takes incredibly long time only to go through the two-letter domains - aa, ab ... zz. I calculated how long it would take me to go through all of the domains in this fashion and I got about a thousand years.
I tried to accelerate this by threading but with 1300 threads running at the same time, WebClient just kept failing, making the resultant data file too inaccurate to be usable.
I do not have access to anything else that a 5Mb/s internet connection, E6300 Core2duo and 2GB of 533@667mhz RAM on Win7.
Does anybody have an idea what to do to make this work? Any idea will do.
Thank you
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
组合爆炸使得这不可能实现(除非你至少可以等待几个月)。我会尝试联系 SIDN,他是 .nl TLD 的权威机构,并向他们询问列表。
The combinatorial explosion makes this impossible to do (unless you can wait several months at the very least). What I would try instead is to contact SIDN, who is the authority for the .nl TLD and ask them for the list.
IMO 这样的网络爬虫实现是不合适的
总时间估计 3*104*1029 毫秒 ~ 3*1023 年。如果我错了,请纠正我。
如果您想利用线程,则每个线程都需要有一个专用的核心。每个线程至少占用 1+ MB 的内存。
线程在这里对您没有帮助,理论上您将能够将时间减少到 ~ 3*1020 年
您得到的异常可能是线程的结果同步问题。
IMO such implementation of a web crawler is not appropriate
Total time estimate 3*104*1029 ms ~ 3*1023 years. Please correct me if I am wrong.
If you want to take advantage of threading you need to have a dedicated core per each thread. Each thread will at least take 1+ MB of your memory.
Threading will not help you here, you will be able to hypotheoretically reduce the time to ~ 3*1020 years
Exceptions that you get are likely to be the result of the thread synchronization issues.
默认情况下,.Net 中的 HTTP 支持的最大并发连接数限制约为 8 个(无论如何都在这个数字附近)
如果您创建更多 HTTP 请求,其中许多请求将被迫等待可用连接,因此会花费时间早在他们让一个有效的 URI 看起来无效之前就已经出局了。
The HTTP support in .Net has a maximum concurrent connections limit of around 8 by default I think (somewhere around that figure anyway)
If you create more HTTP requests many of them will be forced to wait for an available connection and as a result will time out long before they ever get one leading valid URIs to appear invalid.