检查很多 URL 看看是否返回 200。最聪明的方法是什么?

发布于 2024-11-02 19:17:32 字数 514 浏览 1 评论 0原文

我需要检查大量(约 1000 万)个 URL 来查看它们是否存在(返回 200)。我编写了以下代码来针对每个 URL 执行此操作,但执行所有 URL 大约需要很长时间。

def is_200(url):             
    try:
        parsed = urlparse(url)
        conn = httplib.HTTPConnection(parsed.netloc)
        conn.request("HEAD", parsed.path)
        res = conn.getresponse()
        return res.status == 200
    except KeyboardInterrupt, e:
        raise e
    except:
        return False

这些 URL 分布在大约十几台主机上,因此我似乎应该能够利用这一点来管道化我的请求并减少连接开销。你会如何构建这个?我对任何编程/脚本语言持开放态度。

I need to check a lot (~10 million) of URLs to see if they exist (return 200). I've written the following code to do this per-URL, but to do all of the URLs will take approximately forever.

def is_200(url):             
    try:
        parsed = urlparse(url)
        conn = httplib.HTTPConnection(parsed.netloc)
        conn.request("HEAD", parsed.path)
        res = conn.getresponse()
        return res.status == 200
    except KeyboardInterrupt, e:
        raise e
    except:
        return False

The URLs are spread across about a dozen hosts, so it seems like I should be able to take advantage of this to pipeline my requests and reduce connection overhead. How would you build this? I'm open to any programming/scripting language.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

放血 2024-11-09 19:17:32

查看 urllib3。它支持每主机连接重用。
另外,使用多个进程/线程或异步 I/O 将是一个好主意。

Have a look at urllib3. It supports per-host connection re-using.
Additionally using multiple processes/threads or async I/O would be a good idea.

森林散布 2024-11-09 19:17:32

所有这些都在 Python 3.x 版本中。

我将创建检查 200 的工作线程。我将举一个例子。线程池(放入 threadpool.py 中):

# http://code.activestate.com/recipes/577187-python-thread-pool/

from queue import Queue
from threading import Thread

class Worker(Thread):
    def __init__(self, tasks):
        Thread.__init__(self)
        self.tasks = tasks
        self.daemon = True
        self.start()

    def run(self):
        while True:
            func, args, kargs = self.tasks.get()
            try: func(*args, **kargs)
            except Exception as exception: print(exception)
            self.tasks.task_done()

class ThreadPool:
    def __init__(self, num_threads):
        self.tasks = Queue(num_threads)
        for _ in range(num_threads): Worker(self.tasks)

    def add_task(self, func, *args, **kargs):
        self.tasks.put((func, args, kargs))

    def wait_completion(self):
        self.tasks.join()

现在,如果 urllist 包含您的 url,那么您的主文件应该遵循以下内容:

numconns = 40
workers = threadpool.ThreadPool(numconns)
results = [None] * len(urllist)

def check200(url, index):
    results[index] = is_200(url)

for index, url in enumerate(urllist):
    try:
        workers.add_task(check200, url, index)

    except KeyboardInterrupt:
        print("Shutting down application, hang on...")
        workers.wait_completion()

        break

请注意,该程序可根据此处发布的其他建议进行扩展,这是仅依赖于 is_200()

All of this is in Python, version 3.x.

I would create worker threads that check for 200. I'll give an example. The threadpool (put in threadpool.py):

# http://code.activestate.com/recipes/577187-python-thread-pool/

from queue import Queue
from threading import Thread

class Worker(Thread):
    def __init__(self, tasks):
        Thread.__init__(self)
        self.tasks = tasks
        self.daemon = True
        self.start()

    def run(self):
        while True:
            func, args, kargs = self.tasks.get()
            try: func(*args, **kargs)
            except Exception as exception: print(exception)
            self.tasks.task_done()

class ThreadPool:
    def __init__(self, num_threads):
        self.tasks = Queue(num_threads)
        for _ in range(num_threads): Worker(self.tasks)

    def add_task(self, func, *args, **kargs):
        self.tasks.put((func, args, kargs))

    def wait_completion(self):
        self.tasks.join()

Now, if urllist contains your urls then your main file should be along the lines of this:

numconns = 40
workers = threadpool.ThreadPool(numconns)
results = [None] * len(urllist)

def check200(url, index):
    results[index] = is_200(url)

for index, url in enumerate(urllist):
    try:
        workers.add_task(check200, url, index)

    except KeyboardInterrupt:
        print("Shutting down application, hang on...")
        workers.wait_completion()

        break

Note that this program scales with the other suggestions posted here, this is only dependent on is_200().

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文