队列(maxsize=) 不起作用?
我已经在另一个线程中正在处理的项目中实现了一些线程,但是评论和问题已经远离了原始帖子的主题,所以我认为最好的办法是提出一个新问题。问题是这样的。我希望我的程序在命令行指定的迭代次数之后停止迭代 while 循环。我在以下代码段中传递 Queue.Queue(maxsize=10):
THREAD_NUMBER = 5
def main():
queue = Queue.Queue(maxsize=sys.argv[2])
mal_urls = set(make_mal_list())
for i in xrange(THREAD_NUMBER):
crawler = Crawler(queue, mal_urls)
crawler.start()
queue.put(sys.argv[1])
queue.join()
这是运行函数:
class Crawler(threading.Thread):
def __init__(self, queue, mal_urls):
self.queue = queue
self.mal_list = mal_urls
self.crawled_links = []
threading.Thread.__init__(self)
def run(self):
while True:
self.crawled = set(self.crawled_links)
url = self.queue.get()
if url not in self.mal_list:
self.crawl(url)
else:
print("Malicious Link Found: {0}".format(url))
self.queue.task_done()
self.crawl 是一个函数,它执行一些 lxml.html 解析,然后调用另一个函数执行一些字符串处理使用 lxml 解析链接,然后调用 self.queue.put(link),如下所示:
def queue_links(self, link, url):
if link.startswith('/'):
link = "http://" + url.netloc + link
elif link.startswith("#"):
return
elif not link.startswith("http"):
link = "http://" + url.netloc + "/" + link
# Add urls extracted from the HTML text to the queue to fetch them
if link not in self.crawled:
self.queue.put(link)
else:
return
有没有人发现我可能搞砸的地方,这会导致程序永远不会停止运行,以及为什么链接已经被爬网不被承认吗?
I've implemented some threading into a project I've been working on in another thread, but the comments and questions have grown way off topic of the original post, so I figured best thing to do was to make a new question. The problem is this. I want my program to stop iterating over a while loop after an amount of iterations specified by the command line. I'm passing Queue.Queue(maxsize=10), in the following segments of code:
THREAD_NUMBER = 5
def main():
queue = Queue.Queue(maxsize=sys.argv[2])
mal_urls = set(make_mal_list())
for i in xrange(THREAD_NUMBER):
crawler = Crawler(queue, mal_urls)
crawler.start()
queue.put(sys.argv[1])
queue.join()
And here is the run function:
class Crawler(threading.Thread):
def __init__(self, queue, mal_urls):
self.queue = queue
self.mal_list = mal_urls
self.crawled_links = []
threading.Thread.__init__(self)
def run(self):
while True:
self.crawled = set(self.crawled_links)
url = self.queue.get()
if url not in self.mal_list:
self.crawl(url)
else:
print("Malicious Link Found: {0}".format(url))
self.queue.task_done()
self.crawl is a function which does some lxml.html parsing and then calls another function which does some string handling with the links parsed using lxml, and then calls self.queue.put(link), like so:
def queue_links(self, link, url):
if link.startswith('/'):
link = "http://" + url.netloc + link
elif link.startswith("#"):
return
elif not link.startswith("http"):
link = "http://" + url.netloc + "/" + link
# Add urls extracted from the HTML text to the queue to fetch them
if link not in self.crawled:
self.queue.put(link)
else:
return
Does anyone spot where I might have messed up that would be causing the program to never stop running, and why links that have already been crawled are not being recognized as such?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
您实际上并没有传递整数
10
作为 maxsize。您正在传递sys.argv[2]
。sys.argv
是一个字符串列表,因此最多可以将"10"
作为 maxsize 参数传递。不幸的是,在 Python 2.x 中,任何整数都小于任何字符串。您可能想使用int(sys.argv[2])
来代替。You're not actually passing the integer
10
as the maxsize. You're passingsys.argv[2]
.sys.argv
is a list of strings, so at best you're passing"10"
as the maxsize argument. And unfortunately, in Python 2.x, any integer is less than any string. You probably want to useint(sys.argv[2])
instead.