为什么我的数据挖掘器线程多次收集某些 ID，而另一些则根本不收集？

发布于 2024-12-11 14:40:10 字数 1084 浏览 0 评论 0原文

我正在用 urllib2 和 BeautifulSoup 用 python 编写一个数据挖掘器来解析一些网站，并尝试将其进程划分为几个线程，我得到以下输出：

成功抓取 ID 301
成功抓取 ID 301
ID 301 处的结果为空

“成功”意味着我获得了所需的数据。 “空”意味着该页面没有我需要的内容。 “ID”是附加到 URL 的整数，例如 site.com/blog/post/。

首先，每个线程应该解析不同的 URL，而不是多次解析相同的 URL。其次，对于同一 URL，我不应该得到不同的结果。

我按以下方式对进程进行线程化：我实例化一些线程，向每个线程传递要解析的 URL 列表的份额，然后以愉快的方式发送它们。代码如下：

def constructURLs(settings,idList):
    assert type(settings) is dict
    url = settings['url']
    return [url.replace('<id>',str(ID)) for ID in idList]        

def miner(urls,results):
    for url in urls:
        data = spider.parse(url)
        appendData(data,results)

def mine(settings,results):
    (...)
    urls = constructURLs(settings,idList)
    threads = 3 # number of threads
    urlList = [urls[i::threads] for i in xrange(threads)]
    for urls in urlList:
        t = threading.Thread(target=miner,args=(urls,results))
        t.start()

那么，为什么我的线程会多次解析相同的结果，而它们都应该具有唯一的列表呢？为什么即使使用相同的 ID，它们也会返回不同的结果？如果您想查看更多代码，请询问，我很乐意提供。感谢您提供的任何见解！

原文

I'm writing a data miner in python with urllib2 and BeautifulSoup to parse some websites, and in attempting to divide its processes across a few threads, I get the following output:

Successfully scraped ID 301
Successfully scraped ID 301
Empty result at ID 301

"Successful" means I got the data I needed. "Empty" means the page doesn't have what I need. "ID" is an integer affixed to the URL, like site.com/blog/post/.

First off, each thread should be parsing different URLs, not the same URLs many times. Second, I shouldn't be getting different results for the same URL.

I'm threading the processes in the following way: I instantiate some threads, pass each of them shares of a list of URLs to parse, and send them on their merry way. Here's the code:

def constructURLs(settings,idList):
    assert type(settings) is dict
    url = settings['url']
    return [url.replace('<id>',str(ID)) for ID in idList]        

def miner(urls,results):
    for url in urls:
        data = spider.parse(url)
        appendData(data,results)

def mine(settings,results):
    (...)
    urls = constructURLs(settings,idList)
    threads = 3 # number of threads
    urlList = [urls[i::threads] for i in xrange(threads)]
    for urls in urlList:
        t = threading.Thread(target=miner,args=(urls,results))
        t.start()

So why are my threads parsing the same results many times, when they should all have unique lists? Why do they return different results, even on the same ID? If you'd like to see more of the code, just ask and I will happily provide. Thank you for whatever insight you can provide!

分享到QQ

分享到微博