跟踪/计算词频
我希望就能够存储和查询词频计数的良好设计获得一些社区共识。我正在构建一个应用程序,我必须在其中解析文本输入并存储单词出现的次数(随着时间的推移)。因此,给出以下输入:
- “杀死一只知更鸟”
- “嘲笑钢琴演奏者”
将存储以下值:
Word Count
-------------
To 1
Kill 1
A 2
Mocking 2
Bird 1
Piano 1
Player 1
稍后能够快速查询给定任意单词的计数值。
我当前的计划是简单地将单词和计数存储在数据库中,并依赖于缓存单词计数值......但我怀疑我不会获得足够的缓存命中来使这成为长期可行的解决方案。
任何人都可以提出算法、数据结构或任何其他可能使其成为性能良好的解决方案的想法吗?
I'd like to get some community consensus on a good design to be able to store and query word frequency counts. I'm building an application in which I have to parse text inputs and store how many times a word has appeared (over time). So given the following inputs:
- "To Kill a Mocking Bird"
- "Mocking a piano player"
Would store the following values:
Word Count
-------------
To 1
Kill 1
A 2
Mocking 2
Bird 1
Piano 1
Player 1
And later be able to quickly query for the count value of a given arbitrary word.
My current plan is to simply store the words and counts in a database, and rely on caching word count values ... But I suspect that I won't get enough cache hits to make this a viable solution long term.
Can anyone suggest algorithms, or data structures, or any other idea that might make this a well-performing solution?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
字数统计是 MapReduce 程序的典型示例(来自维基百科的伪代码):
我并不是说这是实现这一目标的方法,但如果您需要在不同单词的数量超出单台机器上可用内存时能够很好地扩展的东西,那么它绝对是一个选择。只要您能够保持在内存限制以下,更新哈希表的简单循环就可以解决问题。
Word counting is the canonical example of a MapReduce program (pseudo code from Wikipedia):
I am not saying that this is the way to do it, but it is definitely an option if you need something that scales well when the number of distinct words outsizes the memory available on a single machine. As long as you are able to stay below the memory limit, a simple loop updating a hash table should do the trick.
我不明白为什么您认为数据库不是合适的解决方案。您可能只有大约 100000 行,并且表的小尺寸意味着它可以完全存储在内存中。将单词作为主键,查找速度会非常快。
I don't understand why you feel a database would not be a suitable solution. You will probably only have about 100000 rows and the small size of the table will mean that it can be stored entirely in memory. Make the word the primary key and lookups will be very fast.
如果性能是您的主要目标,您可以仅在 RAM 中使用基于哈希或基于 trie 的结构。假设您无论如何都进行了一些有用的过滤(不计算带有非单词字符的术语),则表中的最大单词数将在 10⁶ 到 10⁷ 的范围内(即使涉及多种语言),因此这很容易适合当前 PC 的内存(并完全避免所有数据库处理)。
另一方面,如果您必须自己实现哈希表详细信息,那么您可能会做错更多代码(而数据库人员希望最大限度地调整他们的代码)。因此,即使您自己的实现中的微小细节也可能会再次导致性能损失。
所以这个困境清楚地向我们展示了优化的第一条和第二条规则:
1.不要过早优化。
2. 在优化之前进行测量。
:)
If performance is your main goal, you could use a hash based or trie based structure in RAM only. Assuming that you do some useful filtering anyway (to not count terms with non-word characters), the maximum number of words in your table will be in the range of 10⁶ to 10⁷ (even if multiple languages are involved), so this will easily fit into the memory of a current PC (and completely avoid all the database handling).
On the other hand, if you have to implement the hashing table details yourself, there is just more code that you can do wrong (while the database guys have hopefully tweaked their code to the maximum). So even minor details in your own implementation might lead to performance loss again.
So this dilemma clearly shows us the first and second rule of optimization:
1. Don't optimize prematurely.
2. Measure, before you optimize.
:)
使用哈希表。
Use a hash table.
你的解决方案听起来不错。如果缓存基于最近的使用计数,那么它将保存最常用单词的单词计数。 (单词分布类似于前 100 个单词覆盖 90% 的单词实例),因此您不需要非常大的缓存。
如果您想提高性能并删除数据库,您可以将单词编码为 trie,并将使用计数存储在叶节点中。从本质上讲,如果您对单词文本建立索引,这就是数据库正在做的事情,因此您实际上只是避免了数据库延迟。如果这是目标,那么还有其他方法可以避免数据库延迟,例如使用并行查找。
Your solution sounds fine. The if the cache is based on recent usage count, then it will hold the word counts for most frequent words. (Word distribution is something like first 100 words covers 90% of word instances) so you don't need a very large cache.
If you want to improve performance and drop the db, you can encode the words as a trie, and store usage counts in the leaf nodes. In essense, that's what the database is doing if you index on word text, so you are really only avoiding the db latency. If that is the goal, then there are other ways of avoiding db latency, such as using parallel lookups.