当前位置：文江博客话题详情

设计一个算法，找到书中最常用的单词

发布于 2024-12-25 02:21:07 字数 200 浏览 0 评论 0原文

面试问题：

找出书中最常用的单词。

我的想法：

使用哈希表，遍历并标记哈希表。

如果已知书本的大小，如果发现有使用过的单词> 50%，则在接下来的遍历中跳过任何新单词，只计算旧单词。如果书籍尺寸未知怎么办？

时间和空间都是O(n)和O(n)。

还有更好的想法吗？

谢谢

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

独夜无伴 2025-01-01 02:21:07

为了确定复杂性，我认为您需要考虑两个变量，n = 单词总数，m = 唯一单词数。我想最好的情况复杂度将接近 O(n log(m)) 的速度和 O(m) 的存储，假设每次迭代 n 个单词中的每一个，并基于哈希表构建和搜索或最终包含 m 个元素的其他此类结构。

回复收藏 0 原文

沫尐诺 2025-01-01 02:21:07

这实际上是map reduce的经典示例。

维基百科页面中的示例将为您提供每个唯一单词的字数，但是您可以轻松地在减少步骤中添加一个步骤来跟踪当前最常见的单词（使用某种互斥体来处理并发问题）。

如果您有分布式机器集群或高度并行化的计算机，这将比使用哈希表运行得快得多。

回复收藏 0 原文

月朦胧 2025-01-01 02:21:07

通常，当我们必须确定诸如最/最少使用。

甚至 Python;s Counter.nlargest 这是用于这些目的是通过堆数据结构实现的。

二叉堆数据结构具有以下复杂性

CreateHeap - O(1)
FindMin - O(1)
deleteMin - O(logn)
Insert - O(logn)

我对哈希（使用Python中的默认字典）和堆（使用Python中的Collections.Counter.nlargest）进行了比较，哈希比堆稍好一些。

>>> stmt1="""
import collections, random
somedata=[random.randint(1,1000) for i in xrange(1,10000)]
somehash=collections.defaultdict(int)
for d in somedata:
    somehash[d]+=1
maxkey=0
for k,v in somehash.items():
    if somehash[maxkey] > v:
        maxkey=k
"""
>>> stmt2="""
import collections,random
somedata=[random.randint(1,1000) for i in xrange(1,10000)]
collections.Counter(somedata).most_common(1)
"""
>>> t1=timeit.Timer(stmt=stmt1)
>>> t2=timeit.Timer(stmt=stmt2)
>>> print "%.2f usec/pass" % (1000000 * t2.timeit(number=10)/10)
38168.96 usec/pass
>>> print "%.2f usec/pass" % (1000000 * t1.timeit(number=10)/10)
33600.80 usec/pass

Usually Heap is the data-structure which suits well when we have to determine something like most/least used.

Even Python;s Counter.nlargest which is used for these purposes is implemented through the Heap Data-structure.

A Binary Heap Data-structure has the following Complexity

CreateHeap - O(1)
FindMin - O(1)
deleteMin - O(logn)
Insert - O(logn)

I ran a comparition on Hash (using default dictionary in Python) and Heap (using Collections.Counter.nlargest in python) and the Hash is fairing slightly better than Heap.

>>> stmt1="""
import collections, random
somedata=[random.randint(1,1000) for i in xrange(1,10000)]
somehash=collections.defaultdict(int)
for d in somedata:
    somehash[d]+=1
maxkey=0
for k,v in somehash.items():
    if somehash[maxkey] > v:
        maxkey=k
"""
>>> stmt2="""
import collections,random
somedata=[random.randint(1,1000) for i in xrange(1,10000)]
collections.Counter(somedata).most_common(1)
"""
>>> t1=timeit.Timer(stmt=stmt1)
>>> t2=timeit.Timer(stmt=stmt2)
>>> print "%.2f usec/pass" % (1000000 * t2.timeit(number=10)/10)
38168.96 usec/pass
>>> print "%.2f usec/pass" % (1000000 * t1.timeit(number=10)/10)
33600.80 usec/pass

回复收藏 0 原文