跟踪/计算词频

发布于 2024-09-02 02:06:33 字数 440 浏览 13 评论 0原文

我希望就能够存储和查询词频计数的良好设计获得一些社区共识。我正在构建一个应用程序，我必须在其中解析文本输入并存储单词出现的次数（随着时间的推移）。因此，给出以下输入：

“杀死一只知更鸟”
“嘲笑钢琴演奏者”

将存储以下值：

Word    Count
-------------
To      1
Kill    1
A       2
Mocking 2
Bird    1
Piano   1
Player  1

稍后能够快速查询给定任意单词的计数值。

我当前的计划是简单地将单词和计数存储在数据库中，并依赖于缓存单词计数值......但我怀疑我不会获得足够的缓存命中来使这成为长期可行的解决方案。

任何人都可以提出算法、数据结构或任何其他可能使其成为性能良好的解决方案的想法吗？

原文

I'd like to get some community consensus on a good design to be able to store and query word frequency counts. I'm building an application in which I have to parse text inputs and store how many times a word has appeared (over time). So given the following inputs:

"To Kill a Mocking Bird"
"Mocking a piano player"

Would store the following values:

Word    Count
-------------
To      1
Kill    1
A       2
Mocking 2
Bird    1
Piano   1
Player  1

And later be able to quickly query for the count value of a given arbitrary word.

My current plan is to simply store the words and counts in a database, and rely on caching word count values ... But I suspect that I won't get enough cache hits to make this a viable solution long term.

Can anyone suggest algorithms, or data structures, or any other idea that might make this a well-performing solution?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

苍风燃霜 2024-09-09 02:06:33

字数统计是 MapReduce 程序的典型示例（来自维基百科的伪代码）：

void map(String name, String document):
  for each word w in document:
     EmitIntermediate(w, "1");

void reduce(String word, Iterator partialCounts):
  int result = 0;
  for each pc in partialCounts:
    result += ParseInt(pc);
  Emit(AsString(result));

我并不是说这是实现这一目标的方法，但如果您需要在不同单词的数量超出单台机器上可用内存时能够很好地扩展的东西，那么它绝对是一个选择。只要您能够保持在内存限制以下，更新哈希表的简单循环就可以解决问题。

Word counting is the canonical example of a MapReduce program (pseudo code from Wikipedia):

void map(String name, String document):
  for each word w in document:
     EmitIntermediate(w, "1");

void reduce(String word, Iterator partialCounts):
  int result = 0;
  for each pc in partialCounts:
    result += ParseInt(pc);
  Emit(AsString(result));

I am not saying that this is the way to do it, but it is definitely an option if you need something that scales well when the number of distinct words outsizes the memory available on a single machine. As long as you are able to stay below the memory limit, a simple loop updating a hash table should do the trick.

回复收藏 0 原文