设计一个算法,找到书中最常用的单词
面试问题:
找出书中最常用的单词。
我的想法:
使用哈希表,遍历并标记哈希表。
如果已知书本的大小,如果发现有使用过的单词> 50%,则在接下来的遍历中跳过任何新单词,只计算旧单词。如果书籍尺寸未知怎么办?
时间和空间都是O(n)和O(n)。
还有更好的想法吗?
谢谢
An interview question:
Find the most frequently used word in a book.
My idea:
Use a hash table, traverse and mark the hash table.
If the book's size is known, if any word is found to be used > 50%, then skip any new words in the following traversal and only count old words. What if the book size is unknown?
It is O(n) and O(n) time and space.
Any better ideas?
Thanks
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
为了确定复杂性,我认为您需要考虑两个变量,n = 单词总数,m = 唯一单词数。我想最好的情况复杂度将接近 O(n log(m)) 的速度和 O(m) 的存储,假设每次迭代 n 个单词中的每一个,并基于哈希表构建和搜索或最终包含 m 个元素的其他此类结构。
To determine complexity I think you need to consider two variables, n = total number of words, m = number of unique words. I imagine the best case complexity will come out close to O(n log(m)) for speed, and O(m) for storage, assuming each time you iterate over each of n words, and build and search based on a hash table or other such structure which eventually contains m elements.
这实际上是map reduce的经典示例。
维基百科页面中的示例将为您提供每个唯一单词的字数,但是您可以轻松地在减少步骤中添加一个步骤来跟踪当前最常见的单词(使用某种互斥体来处理并发问题)。
如果您有分布式机器集群或高度并行化的计算机,这将比使用哈希表运行得快得多。
This is actually a classic example of map reduce.
The example in the wikipedia page will give you the word count of each unique word, but you can easily add a step in the reduce step that keeps track of the current most common word(with some kind of mutex to deal with concurrency issues).
If you have a distributed cluster of machines or a highly parallelized computer this will run much faster than using the hash table.
通常,当我们必须确定诸如最/最少使用。
甚至 Python;s Counter.nlargest 这是用于这些目的是通过堆数据结构实现的。
二叉堆数据结构具有以下复杂性
我对哈希(使用Python中的默认字典)和堆(使用Python中的Collections.Counter.nlargest)进行了比较,哈希比堆稍好一些。
Usually Heap is the data-structure which suits well when we have to determine something like most/least used.
Even Python;s Counter.nlargest which is used for these purposes is implemented through the Heap Data-structure.
A Binary Heap Data-structure has the following Complexity
I ran a comparition on Hash (using default dictionary in Python) and Heap (using Collections.Counter.nlargest in python) and the Hash is fairing slightly better than Heap.
您的优化有一个概括 - 如果书籍大小已知并且您看到的任何单词的计数 >剩余单词数 + 次高计数,当前计数最高的单词就是答案。
There is a generalization of your optimization- if the book size is known and any word you have seen has a count > the remaining number of words + the next-highest count, your current highest-counted word is the answer.
从实际的角度来看,您的解决方案是正确的、快速的,并且可能是最好/最简单的。
其他发帖者的解决方案的时间复杂度比您的解决方案更差。对于您正在使用的哈希,时间复杂度确实是 O(n)。每次插入的时间复杂度为 O(1),并且有 n 个单词,因此插入阶段的成本为 O(n)。迭代并找到最大值则为 O(n)。正如你提到的,空间也是 O(n) 。
请注意,您将无法使用 Chris 的解决方案提前终止算法,因为搜索哈希表的成本很高,并且您无法在每次插入后在 O(1) 时间内执行此操作。
堆会花费更多的时间,因为您需要在每次插入期间维护堆。堆插入的时间复杂度为 O(log(n)),因此插入的总成本将为 O(nlog(n))。
Your solution is correct, fast, and probably the best/easiest from a practical standpoint.
The other poster's solutions have worse time complexities than your solution. For a hash, as you are using, the time complexity is indeed O(n). Each insertion is O(1) and there are n words, so the insertion phase costs O(n). Iterating through and finding the max is then O(n). The space is also O(n) as you mentioned.
Note that you will not be able to terminate your algorithm early using Chris's solution because searching your hash table is costly and there is no way for you to perform this in O(1) time after each insertion.
A heap will cost more in time because you need to maintain the heap during each insertion. A heap insertion is O(log(n)) and thus the total cost for insertion will be O(nlog(n)).
如果您正在看一本书,那么您就知道其中的词汇和大概的词频。即使您没有预先获得此信息,您也可以通过扫描随机样本来获得良好的估计。
对于确切的答案,我将使用 k 个最常见单词的完美哈希函数。完美的哈希函数需要 O(k) 内存并保证最坏情况下的快速 O(1) 查找。
对于不常见的单词,我会使用作为堆或自平衡树实现的优先级队列。常规哈希表也可能是一个不错的选择。
If you are dealing with a book, then you know the vocabulary and the approximate word frequencies. Even if you are not given this information up front, you can get a good estimate by scanning a random sample.
For the exact answer, I would use a perfect hash function of the k most common words. A perfect hash function requires O(k) memory and guarantees fast worst-case O(1) lookup.
For the uncommon words, I would use a priority queue implemented as a heap or a self-balancing tree. A regular hash table might also be a good choice.