队列(maxsize=) 不起作用？

发布于 2024-09-29 15:33:10 字数 1597 浏览 9 评论 0原文

我已经在另一个线程中正在处理的项目中实现了一些线程，但是评论和问题已经远离了原始帖子的主题，所以我认为最好的办法是提出一个新问题。问题是这样的。我希望我的程序在命令行指定的迭代次数之后停止迭代 while 循环。我在以下代码段中传递 Queue.Queue(maxsize=10)：

THREAD_NUMBER = 5
def main():
    queue = Queue.Queue(maxsize=sys.argv[2])
    mal_urls = set(make_mal_list())

    for i in xrange(THREAD_NUMBER):
        crawler = Crawler(queue, mal_urls)
        crawler.start()

    queue.put(sys.argv[1])
    queue.join()

这是运行函数：

class Crawler(threading.Thread):

    def __init__(self, queue, mal_urls):
        self.queue = queue
        self.mal_list = mal_urls
        self.crawled_links = []

        threading.Thread.__init__(self) 

    def run(self):
        while True:
            self.crawled = set(self.crawled_links)
            url = self.queue.get()
            if url not in self.mal_list:
                self.crawl(url)
            else:
                print("Malicious Link Found: {0}".format(url))

            self.queue.task_done()

self.crawl 是一个函数，它执行一些 lxml.html 解析，然后调用另一个函数执行一些字符串处理使用 lxml 解析链接，然后调用 self.queue.put(link)，如下所示：

def queue_links(self, link, url):

    if link.startswith('/'):
        link = "http://" + url.netloc + link

    elif link.startswith("#"):
        return

    elif not link.startswith("http"):
        link = "http://" + url.netloc + "/" + link

    # Add urls extracted from the HTML text to the queue to fetch them
    if link not in self.crawled:
        self.queue.put(link)
    else:
        return

有没有人发现我可能搞砸的地方，这会导致程序永远不会停止运行，以及为什么链接已经被爬网不被承认吗？

原文

I've implemented some threading into a project I've been working on in another thread, but the comments and questions have grown way off topic of the original post, so I figured best thing to do was to make a new question. The problem is this. I want my program to stop iterating over a while loop after an amount of iterations specified by the command line. I'm passing Queue.Queue(maxsize=10), in the following segments of code:

THREAD_NUMBER = 5
def main():
    queue = Queue.Queue(maxsize=sys.argv[2])
    mal_urls = set(make_mal_list())

    for i in xrange(THREAD_NUMBER):
        crawler = Crawler(queue, mal_urls)
        crawler.start()

    queue.put(sys.argv[1])
    queue.join()

And here is the run function:

class Crawler(threading.Thread):

    def __init__(self, queue, mal_urls):
        self.queue = queue
        self.mal_list = mal_urls
        self.crawled_links = []

        threading.Thread.__init__(self) 

    def run(self):
        while True:
            self.crawled = set(self.crawled_links)
            url = self.queue.get()
            if url not in self.mal_list:
                self.crawl(url)
            else:
                print("Malicious Link Found: {0}".format(url))

            self.queue.task_done()

self.crawl is a function which does some lxml.html parsing and then calls another function which does some string handling with the links parsed using lxml, and then calls self.queue.put(link), like so:

def queue_links(self, link, url):

    if link.startswith('/'):
        link = "http://" + url.netloc + link

    elif link.startswith("#"):
        return

    elif not link.startswith("http"):
        link = "http://" + url.netloc + "/" + link

    # Add urls extracted from the HTML text to the queue to fetch them
    if link not in self.crawled:
        self.queue.put(link)
    else:
        return

Does anyone spot where I might have messed up that would be causing the program to never stop running, and why links that have already been crawled are not being recognized as such?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

盛装女皇 2024-10-06 15:33:10

您实际上并没有传递整数 10 作为 maxsize。您正在传递 sys.argv[2]。 sys.argv 是一个字符串列表，因此最多可以将 "10" 作为 maxsize 参数传递。不幸的是，在 Python 2.x 中，任何整数都小于任何字符串。您可能想使用 int(sys.argv[2]) 来代替。