大型自然语言单词集的哈希表

发布于 2024-10-18 14:15:31 字数 1359 浏览 5 评论 0原文

我正在用 python 编写一个程序来对电影评论进行一元（最终是二元等）分析。目标是创建特征向量以输入 libsvm。我的特征向量中有 50,000 个奇怪的独特单词（这对我来说似乎相当大，但我相对确定我是对的）。

我使用 python 字典实现作为哈希表来跟踪遇到的新单词，但我注意到在处理前 1000 个奇数文档后速度大幅下降。如果我使用几个较小的哈希表/字典，我的效率会更高（考虑到自然语言的分布）还是会相同/更差？

更多信息：

数据被分成大约 1500 个文档，每个文档大约 500 个单词。每个文档中有 100 到 300 个唯一单词（相对于所有以前的文档）。

我当前的代码：

#processes each individual file, tok == filename, v == predefined class
def processtok(tok, v):
    #n is the number of unique words so far, 
    #reference is the mapping reference in case I want to add new data later
    #hash is the hashtable
    #statlist is the massive feature vector I'm trying to build
    global n
    global reference
    global hash
    global statlist
    cin=open(tok, 'r')
    statlist=[0]*43990
    statlist[0] = v
    lines = cin.readlines()
    for l in lines:
        line = l.split(" ")
        for word in line:
            if word in hash.keys():
                if statlist[hash[word]] == 0:
                    statlist[hash[word]] = 1
            else:
                hash[word]=n
                n+=1
                ref.write('['+str(word)+','+str(n)+']'+'\n')
                statlist[hash[word]] = 1
    cin.close()
    return statlist

另请记住，我的输入数据约为 6mb，输出数据约为 300mb。我只是对这需要多长时间感到惊讶，而且我觉得它不应该在运行时如此显着地减慢速度。

放慢速度：前50个文档大约需要5秒，后50个文档大约需要5分钟。

原文

I'm writing a program in python to do a unigram (and eventually bigram etc) analysis of movie reviews. The goal is to create feature vectors to feed into libsvm. I have 50,000 odd unique words in my feature vector (which seems rather large to me, but I ham relatively sure I'm right about that).

I'm using the python dictionary implementation as a hashtable to keep track of new words as I meet them, but I'm noticing an enormous slowdown after the first 1000 odd documents are processed. Would I have better efficiency (given the distribution of natural language) if I used several smaller hashtable/dictionaries or would it be the same/worse?

More info:

The data is split into 1500 or so documents, 500-ish words each. There are between 100 and 300 unique words (with respect to all previous documents) in each document.

My current code:

#processes each individual file, tok == filename, v == predefined class
def processtok(tok, v):
    #n is the number of unique words so far, 
    #reference is the mapping reference in case I want to add new data later
    #hash is the hashtable
    #statlist is the massive feature vector I'm trying to build
    global n
    global reference
    global hash
    global statlist
    cin=open(tok, 'r')
    statlist=[0]*43990
    statlist[0] = v
    lines = cin.readlines()
    for l in lines:
        line = l.split(" ")
        for word in line:
            if word in hash.keys():
                if statlist[hash[word]] == 0:
                    statlist[hash[word]] = 1
            else:
                hash[word]=n
                n+=1
                ref.write('['+str(word)+','+str(n)+']'+'\n')
                statlist[hash[word]] = 1
    cin.close()
    return statlist

Also keep in mind that my input data is about 6mb and my output data is about 300mb. I'm simply startled at how long this takes, and I feel that it shouldn't be slowing down so dramatically as it's running.

Slowing down: the first 50 documents take about 5 seconds, the last 50 take about 5 minutes.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

痴情换悲伤 2024-10-25 14:15:31

@ThatGuy 已经修复了，但实际上并没有告诉你这一点：

速度变慢的主要原因是行

if word in hash.keys():

，它费力地制作了所有键的列表到目前为止，然后费力地在该列表中搜索“单词”。所花费的时间与键的数量（即到目前为止找到的唯一单词的数量）成正比。这就是为什么它开始很快，然后变得越来越慢。

您所需要的只是 if word in hash: ，在 99.9999999% 的情况下，它所花费的时间与键的数量无关——这是拥有字典的主要原因之一。

对 statlist[hash[word]] 的花哨也没有帮助。顺便说一下，statlist=[0]*43990 中的固定大小需要解释一下。

更多问题

问题 A：要么 (1) 您的代码在发布时出现缩进失真，要么 (2) hash 永远不会被该函数更新。很简单，如果 word 不在 hash 中，即这是您第一次看到它，则绝对不会发生任何事情。不执行 hash[word] = n 语句（唯一更新 hash 的代码）。因此，hash 中不会出现任何单词。

看起来这段代码需要向左移动 4 列，以便与外部 if 对齐：

else:
    hash[word]=n
    ref.write('['+str(word)+','+str(n)+']'+'\n')
    statlist[hash[word]] = 1

问题 B：根本没有代码可以更新 n （据称到目前为止唯一单词的数量）。

我强烈建议你尽可能多地采纳 @ThatGuy 和我提出的建议，删除所有的全局内容，修复你的代码，在显着的位置插入一些打印语句点，然后运行它，假设 2 个文档，每行 3 行，每行大约 4 个单词。确保其正常工作。然后在您的大数据集上运行它（抑制打印）。在任何情况下，您可能希望定期发布统计数据（例如文档数、行数、单词数、唯一单词数、经过的时间等）。

另一个问题

问题C：我在@ThatGuy的回答的评论中提到了这一点，他同意我的观点，但你没有提到接受它：

>>> line = "foo bar foo\n"
>>> line.split(" ")
['foo', 'bar', 'foo\n']
>>> line.split()
['foo', 'bar', 'foo']
>>>

你使用 .split(" ") 会导致伪造“单词”并扭曲您的统计数据，包括您拥有的唯一单词的数量。您可能会发现需要更改该硬编码的幻数。

我再说一遍：函数中没有更新 n 的代码。执行 hash[word] = n 看起来很奇怪，即使每个文档都更新了 n 。

@ThatGuy has made the fix, but hasn't actually told you this:

The major cause of your slowdown is the line

if word in hash.keys():

which laboriously makes a list of all the keys so far, then laboriously searches that list for `word'. The time taken is proportional to the number of keys i.e. the number of unique words found so far. That's why it starts fast and becomes slower and slower.

All you need is if word in hash: which in 99.9999999% of cases takes time independent of the number of keys -- one of the major reasons for having a dict.

The faffing about with statlist[hash[word]] doesn't help, either. By the way, the fixed size in statlist=[0]*43990 needs explanation.

More problems

Problem A: Either (1) your code suffered from indentation distortion when you published it, or (2) hash will never be updated by that function. Quite simply, if word is not in hash i.e it's the first time you've seen it, absolutely nothing happens. The hash[word] = n statement (the ONLY code that updates hash) is NOT executed. So no word will ever be in hash.

It looks like this block of code needs to be shifted left 4 columns, so that it's aligned with the outer if:

else:
    hash[word]=n
    ref.write('['+str(word)+','+str(n)+']'+'\n')
    statlist[hash[word]] = 1

Problem B: There is no code at all to update n (allegedly the number of unique words so far).

I strongly suggest that you take as many of the suggestions that @ThatGuy and I have made as you care to, rip out all the global stuff, fix up your code, chuck in a few print statements at salient points, and run it over say 2 documents each of 3 lines with about 4 words in each. Ensure that it is working properly. THEN run it on your big data set (with the prints suppressed). In any case you may want to put out stats (like number of documents, lines, words, unique words, elapsed time, etc) at regular intervals.

Another problem

Problem C: I mentioned this in a comment on @ThatGuy's answer, and he agreed with me, but you haven't mentioned taking it up:

>>> line = "foo bar foo\n"
>>> line.split(" ")
['foo', 'bar', 'foo\n']
>>> line.split()
['foo', 'bar', 'foo']
>>>

Your use of .split(" ") will lead to spurious "words" and distort your statistics, including the number of unique words that you have. You may well find the need to change that hard-coded magic number.

I say again: There is no code that updates n in the function . Doing hash[word] = n seems very strange, even if n is updated for each document.

回复收藏 0 原文

情愿 2024-10-25 14:15:31

我不认为 Python 字典与你的速度下降有任何关系。特别是当你说条目大约有 100 个时。我希望你指的是插入和检索，它们在字典中都是 O(1) 的。问题可能是您在创建字典时没有使用迭代器（或一次加载一个键值对），而是将整个单词加载到内存中。在这种情况下，速度减慢是由于内存消耗造成的。

回复收藏 0 原文

要走干脆点 2024-10-25 14:15:31

我认为你这里遇到了一些问题。大多数情况下，我不确定您要使用 statlist 完成什么任务。在我看来，它就像是你的字典的糟糕副本。找到所有单词后创建它。

这是我对您想要的内容的猜测：

def processtok(tok, v):
    global n
    global reference
    global hash
    cin=open(tok, 'rb')
    for l in cin:
        line = l.split(" ")
        for word in line:
            if word in hash:
                hash[word] += 1
            else:
                hash[word] = 1
                n += 1
                ref.write('['+str(word)+','+str(n)+']'+'\n')

    cin.close()
    return hash

请注意，这意味着您不再需要“n”，因为您可以通过执行 len(n) 来发现这一点。

I think you've got a few problems going on here. Mostly, I am unsure of what you are tying to accomplish with statlist. It seems to me like it is serving as a poor duplicate of your dictionary. Create it after you have found all of your words.

Here is my guess as to what you want:

def processtok(tok, v):
    global n
    global reference
    global hash
    cin=open(tok, 'rb')
    for l in cin:
        line = l.split(" ")
        for word in line:
            if word in hash:
                hash[word] += 1
            else:
                hash[word] = 1
                n += 1
                ref.write('['+str(word)+','+str(n)+']'+'\n')

    cin.close()
    return hash

Note, that this means you no longer need an "n" as you can discover this by doing len(n).

回复收藏 0 原文

~没有更多了~