优化字数统计
(到目前为止,这本质上是假设的,所以我没有太多细节可提供。)
我有一个随机(英语)单词的平面文件,每行一个。我需要编写一个有效的程序来计算每个单词出现的次数。该文件很大(大约 1GB),但我有足够的 RAM 来处理所有事情。它们存储在永久介质上,因此读取速度很慢,因此我只需线性读取一次即可。
我的两个突发奇想是使用带有单词 => 的哈希值。不。出现次数,或带有编号的特里树。结束节点出现的次数。我有足够的 RAM 用于哈希数组,但我认为 trie 将具有同样快或更快的查找速度。
什么方法最好?
(This is rather hypothetical in nature as of right now, so I don't have too many details to offer.)
I have a flat file of random (English) words, one on each line. I need to write an efficient program to count the number of occurrences of each word. The file is big (perhaps about 1GB), but I have plenty of RAM for everything. They're stored on permanent media, so read speeds are slow, so I need to just read through it once linearly.
My two off-the-top-of-my-head ideas were to use a hash with words => no. of occurrences, or a trie with the no. of occurrences at the end node. I have enough RAM for a hash array, but I'm thinking that a trie would have as fast or faster lookups.
What approach would be best?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(9)
我认为将计数作为叶子的特里树可能会更快。
任何合适的哈希表实现都需要完整读取单词,使用哈希函数对其进行处理,最后在表中进行查找。
可以实现特里树,以便在您阅读单词时进行搜索。这样,一旦建立了唯一的单词前缀,您通常会发现自己跳过了字符,而不是对单词进行完整的查找。
例如,如果您读过字符:“torto”,则 trie 会知道以这种方式开头的唯一可能的单词是 tortoise。
如果您对单词执行内联搜索的速度比散列算法散列的速度快,那么您应该能够更快。
但是,这完全是矫枉过正。既然你说这纯粹是假设性的,我就胡言乱语了,我想你会想要一个假设性的答案。采用最易于维护的解决方案,在合理的时间内执行任务。微观优化通常浪费的工时时间比节省的 CPU 时数还要多。
I think a trie with the count as the leaves could be faster.
Any decent hash table implementation will require reading the word fully, processing it using a hash function, and finally, a look-up in the table.
A trie can be implemented such that the search occurs as you are reading the word. This way, rather than doing a full look-up of the word, you could often find yourself skipping characters once you've established the unique word prefix.
For example, if you've read the characters: "torto", a trie would know that the only possible word that starts this way is tortoise.
If you can perform this inline searching faster on a word faster than the hashing algorithm can hash, you should be able to be faster.
However, this is total overkill. I rambled on since you said it was purely hypothetical, I figured you'd like a hypothetical-type of answer. Go with the most maintainable solution that performs the task in a reasonable amount of time. Micro-optimizations typically waste more time in man-hours than they save in CPU-hours.
我会使用一个 Dictionary 对象,其中键是将单词转换为小写,值是计数。如果字典不包含该单词,则添加值 1。如果字典包含该单词,则增加该值。
I'd use a Dictionary object where the key is word converted to lower case and the value is the count. If the dictionary doesn't contain the word, add it with a value of 1. If it does contain the word, increment the value.
考虑到阅读速度较慢,它可能不会产生任何明显的差异。无论如何,总时间将完全由读取数据的时间决定,因此这就是您应该努力优化的地方。对于内存中的算法(实际上主要是数据结构),只需使用您认为最舒服的语言中最方便的任何东西即可。
Given slow reading, it's probably not going to make any noticeable difference. The overall time will be completely dominated by the time to read the data anyway, so that's what you should work at optimizing. For the algorithm (mostly data structure, really) in memory, just use whatever happens to be most convenient in the language you find most comfortable.
哈希表(如果做得正确,并且您说您有大量 RAM)用于计算特定单词的复杂度为 O(1),而 trie 将是 O(n),其中 n 是单词的长度。
如果散列空间足够大,则散列表的性能将比 trie 的性能好得多。
A hash table is (if done right, and you said you had lots of RAM) O(1) to count a particular word, while a trie is going to be O(n) where n is the length of the word.
With a sufficiently large hash space, you'll get much better performance from a hash table than from a trie.
我认为 trie 对于你的用例来说是多余的。单词的哈希 => # 的出现次数正是我要使用的。即使使用像 Perl 这样的慢速解释语言,您也可以在几分钟内以这种方式处理 1GB 文件。 (我以前做过这个。)
I think that a trie is overkill for your use case. A hash of word => # of occurrences is exactly what I would use. Even using a slow interpreted language like Perl, you can munge a 1GB file this way in just a few minutes. (I've done this before.)
这段代码会运行多少次?如果你只做一次,我会说优化你的时间而不是你的CPU时间,并且只做最快实现的事情(在合理的范围内)。如果您有一个实现键值接口的标准库函数,只需使用它即可。
如果您多次执行此操作,请获取数据文件的一个子集(或多个子集),并对您的选项进行基准测试。如果不了解更多关于您的数据集的信息,推荐一个数据集而不是另一个数据集是值得怀疑的。
How many times will this code be run? If you're just doing it once, I'd say optimize for your time rather than your CPU's time, and just do whatever's fastest to implement (within reason). If you have a standard library function that implements a key-value interface, just use that.
If you're doing it many times, then grab a subset (or several subsets) of the data file, and benchmark your options. Without knowing more about your data set, it'd be dubious to recommend one over another.
使用Python!
逐行将这些元素添加到集合数据类型中,然后询问它是否在哈希表中。当您知道它在集合中后,请添加字典值 2,因为您之前已经将其添加到集合中一次。
这将占用一些内存和计算量,从而避免每次询问字典,而是更好地处理唯一值的单词,在调用结束时,只需使用 a 将不在字典中的所有单词转储到集合中值为 1。(两个集合相对于集合相交)
Use Python!
Add these elements to a set data type as you go line by line, before asking whether it is in the hash table. After you know it is in the set, then add a dictionary value of 2, since you already added it to the set once before.
This will take some of the memory and computation away from asking the dictionary every single time, and instead will handle unique valued words better, at the end of the call just dump all the words that are not in the dictionary out of the set with a value of 1. (Intersect the two collections in respect to the set)
在很大程度上,这取决于您在捕获数据后希望如何处理数据。请参阅 为什么使用哈希表而不是 Trie(前缀树) )?
To a large extent, it depends on what you want you want to do with the data once you've captured it. See Why Use a Hash Table over a Trie (Prefix Tree)?
一个简单的Python脚本:
a simple python script: