用于生成“顶级列表”的算法 使用词频

发布于 2024-07-27 20:14:25 字数 55 浏览 1 评论 0原文

我收集了大量人类生成的内容。 我想找到最常出现的单词或短语。 什么是有效的方法来做到这一点?

I have a big collection of human generated content. I want to find the words or phrases that occur most often. What is an efficient way to do this?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(6

死开点丶别碍眼 2024-08-03 20:14:25

不要重新发明轮子。 使用全文搜索引擎,例如 Lucene

Don't reinvent the wheel. Use a full text search engine such as Lucene.

醉态萌生 2024-08-03 20:14:25

简单/天真的方法是使用哈希表。 浏览单词并边读边增加计数。

在该过程结束时,按计数对键/值对进行排序。

The simple/naive way is to use a hashtable. Walk through the words and increment the count as you go.

At the end of the process sort the key/value pairs by count.

辞慾 2024-08-03 20:14:25

基本思想很简单——在可执行的伪代码中,

from collections import defaultdict

def process(words):
  d = defaultdict(int)
  for w in words: d[w] += 1
  return d

当然,问题在于细节——如何将大集合变成一个产生单词的迭代器? 它是否足够大,以至于您无法在一台机器上处理它,而是需要一种 MapReduce 方法,例如通过 hadoop? 等等。 NLTK 可以在语言方面提供帮助(隔离无法将它们完全分开的语言中的单词) )。

在单机执行(mapreduce 网络)上,可能出现的一个问题是,简单的想法会给您带来太多的单例或类似情况(单词出现一次或仅出现几次),从而填满内存。 对此的一个概率反驳是进行两次传递:一次进行随机采样(仅获取十分之一的单词,或一百分之一)以生成一组作为最高排名候选的单词,然后第二次传递跳过那些单词不在候选集中。 根据您采样的单词数量以及您希望在结果中包含多少个单词,可以计算出通过这种方式错过重要单词的概率的上限(对于合理的数字和任何自然语言) ,我向你保证你会没事的)。

一旦你有了字典将单词映射到出现次数,你只需要按出现次数选择前 N 个单词 - 如果字典太大而无法按整个出现次数排序(例如在我的字典中),堆队列将有所帮助。例如,最喜欢的可执行伪代码,heapq.nlargest 就会做到这一点)。

the basic idea is simple -- in executable pseudocode,

from collections import defaultdict

def process(words):
  d = defaultdict(int)
  for w in words: d[w] += 1
  return d

Of course, the devil is in the details -- how do you turn the big collection into an iterator yielding words? Is it big enough that you can't process it on a single machine but rather need a mapreduce approach e.g. via hadoop? Etc, etc. NLTK can help with the linguistic aspects (isolating words in languages that don't separate them cleanly).

On a single-machine execution (net of mapreduce), one issue that can arise is that the simple idea gives you far too many singletons or thereabouts (words occurring once or just a few times), which fill memory. A probabilistic retort to that is to do two passes: one with random sampling (get only one word in ten, or one in a hundred) to make a set of words that are candidates for the top ranks, then a second pass skipping words that are not in the candidate set. Depending on how many words you're sampling and how many you want in the result, it's possible to compute an upper bound on the probability that you're going to miss an important word this way (and for reasonable numbers, and any natural language, I assure you that you'll be just fine).

Once you have your dictionary mapping words to numbers of occurrences you just need to pick the top N words by occurrences -- a heap-queue will help there, if the dictionary is just too large to sort by occurrences in its entirety (e.g. in my favorite executable pseudocode, heapq.nlargest will do it, for example).

玩心态 2024-08-03 20:14:25

查看Apriori 算法。 它可用于查找频繁项和/或频繁项集。

就像维基百科文章所述,有更有效的算法可以做同样的事情,但这可能是一个很好的开始,看看这是否适用于您的情况。

Look into the Apriori algorithm. It can be used to find frequent items and/or frequent sets of items.

Like the wikipedia article states, there are more efficient algorithms that do the same thing, but this could be a good start to see if this will apply to your situation.

凉栀 2024-08-03 20:14:25

为什么不使用一个简单的映射,将键作为单词,将计数器作为值。
它将通过获取高值计数器来给出最常用的单词。
这只是一个 O(N) 操作。

Why not a simple map with key as the word and the Counter as the Value.
It will give the top used words, by taking the high value counter.
It is just a O(N) operation.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文