检查很多 URL 看看是否返回 200。最聪明的方法是什么?
我需要检查大量(约 1000 万)个 URL 来查看它们是否存在(返回 200)。我编写了以下代码来针对每个 URL 执行此操作,但执行所有 URL 大约需要很长时间。
def is_200(url):
try:
parsed = urlparse(url)
conn = httplib.HTTPConnection(parsed.netloc)
conn.request("HEAD", parsed.path)
res = conn.getresponse()
return res.status == 200
except KeyboardInterrupt, e:
raise e
except:
return False
这些 URL 分布在大约十几台主机上,因此我似乎应该能够利用这一点来管道化我的请求并减少连接开销。你会如何构建这个?我对任何编程/脚本语言持开放态度。
I need to check a lot (~10 million) of URLs to see if they exist (return 200). I've written the following code to do this per-URL, but to do all of the URLs will take approximately forever.
def is_200(url):
try:
parsed = urlparse(url)
conn = httplib.HTTPConnection(parsed.netloc)
conn.request("HEAD", parsed.path)
res = conn.getresponse()
return res.status == 200
except KeyboardInterrupt, e:
raise e
except:
return False
The URLs are spread across about a dozen hosts, so it seems like I should be able to take advantage of this to pipeline my requests and reduce connection overhead. How would you build this? I'm open to any programming/scripting language.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
查看 urllib3。它支持每主机连接重用。
另外,使用多个进程/线程或异步 I/O 将是一个好主意。
Have a look at urllib3. It supports per-host connection re-using.
Additionally using multiple processes/threads or async I/O would be a good idea.
所有这些都在 Python 3.x 版本中。
我将创建检查 200 的工作线程。我将举一个例子。线程池(放入 threadpool.py 中):
现在,如果 urllist 包含您的 url,那么您的主文件应该遵循以下内容:
请注意,该程序可根据此处发布的其他建议进行扩展,这是仅依赖于
is_200()
。All of this is in Python, version 3.x.
I would create worker threads that check for 200. I'll give an example. The threadpool (put in threadpool.py):
Now, if
urllist
contains your urls then your main file should be along the lines of this:Note that this program scales with the other suggestions posted here, this is only dependent on
is_200()
.