如何在内存不足的环境下找到书中的高频词?

发布于 2024-07-17 03:11:23 字数 175 浏览 5 评论 0原文

最近在一次技术面试中,我被要求编写一个程序来查找教科书中的高频词(出现次数最多的词)。 该程序的设计方式应使其能够用最少的内存处理整本教科书。 性能不是问题。 我能够通过编程来查找单词的频率,但它消耗了大量的内存。

如何减少此操作的内存消耗? 有什么策略/解决方案吗?

-斯内哈尔

Recently in a technical interview, I was asked to write a program to find the high frequency words(Words which appear maximum number of times) in a text book. The program should be designed in such a way that, it processes the entire text book with minimum memory. Performance is not a concern. I was able to program to find the frequency of words, but it consumed a lot of memory.

How do you make this operation less memory intensive? Any strategies/solutions?

-Snehal

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(12

淡淡の花香 2024-07-24 03:11:24

与许多好的面试问题一样,该问题的措辞有点含糊/不准确,以迫使受访者提出澄清问题并陈述假设。 我认为这里的许多其他答案都很好,因为它们探讨了这些假设并展示了对大局的理解。

假设文本“离线”存储在某个地方,但是有一种方法可以迭代文本中的每个单词,而无需将整个文本加载到内存中。

然后下面的 F# 代码找到前 N 个单词。 它唯一的数据结构是键值对(单词、频率)的映射,并且只保留其中的前 N ​​个,因此内存使用量为 O(N),很小。 运行时间为 O(numWordsInText^2),虽然很差,但考虑到问题的限制是可以接受的。 该算法的要点很简单,对于文本中的每个单词,计算它出现的次数,如果它在运行的 best-N 中,则将其添加到列表中并删除之前的最小条目。

请注意,下面的实际程序将整个文本加载到内存中,只是为了方便说明。

#light
// some boilerplate to grab a big piece of text off the web for testing
open System.IO 
open System.Net 
let HttpGet (url: string) = 
    let req = System.Net.WebRequest.Create(url) 
    let resp = req.GetResponse() 
    let stream = resp.GetResponseStream() 
    let reader = new StreamReader(stream) 
    let data = reader.ReadToEnd() 
    resp.Close() 
    data 
let text = HttpGet "http://www-static.cc.gatech.edu/classes/cs2360_98_summer/hw1"
let words = text.Split([|' ';'\r';'\n'|], System.StringSplitOptions.RemoveEmptyEntries)
// perhaps 'words' isn't actually stored in memory, but so long as we can 
// 'foreach' over all the words in the text we're good
let N = 5  // how many 'top frequency' words we want to find
let FindMin map =
    // key-value pair with mininum value in a map
    let (Some(seed)) = Map.first (fun k v -> Some(k,v)) map
    map |> Map.fold_left 
        (fun (mk,mv) k v -> if v > mv then (mk,mv) else (k,v)) 
        seed
let Main() =
    let mutable freqCounts = Map.of_list [ ("",0) ]
    for word in words do
        let mutable count = 0
        for x in words do
            if x = word then
                count <- count + 1
        let minStr,minCount = FindMin freqCounts
        if count >= minCount then
            freqCounts <- Map.add word count freqCounts
        if Seq.length freqCounts > N then
            freqCounts <- Map.remove minStr freqCounts
    freqCounts 
    |> Seq.sort_by (fun (KeyValue(k,v)) -> -v) 
    |> Seq.iter (printfn "%A")
Main()

输出:

[the, 75]
[to, 41]
[in, 34]
[a, 32]
[of, 29]

Like many good interview questions, the question is phrased a little ambiguously/imprecisely, to force the interviewee to ask clarifying questions and state assumptions. I think a number of the other answers here are good, as they poke at these assumptions and demonstrate big-picture understanding.

I'm assuming the text is stored 'offline' somewhere, but there is a way to iterate over each word in the text without loading the whole text into memory.

Then the F# code below find the top N words. It's only data structure is a mapping of key-value pairs (word, frequency), and it only keeps the top N of those, so the memory use is O(N), which is small. The runtime is O(numWordsInText^2), which is poor, but acceptable given the problem constraints. The gist of the algorithm is straightforward, for each word in the text, count how many times it occurs, and if it's in the running best-N, then add it to the list and remove the previous minimum entry.

Note that the actual program below loads the entire text into memory, merely for convenience of exposition.

#light
// some boilerplate to grab a big piece of text off the web for testing
open System.IO 
open System.Net 
let HttpGet (url: string) = 
    let req = System.Net.WebRequest.Create(url) 
    let resp = req.GetResponse() 
    let stream = resp.GetResponseStream() 
    let reader = new StreamReader(stream) 
    let data = reader.ReadToEnd() 
    resp.Close() 
    data 
let text = HttpGet "http://www-static.cc.gatech.edu/classes/cs2360_98_summer/hw1"
let words = text.Split([|' ';'\r';'\n'|], System.StringSplitOptions.RemoveEmptyEntries)
// perhaps 'words' isn't actually stored in memory, but so long as we can 
// 'foreach' over all the words in the text we're good
let N = 5  // how many 'top frequency' words we want to find
let FindMin map =
    // key-value pair with mininum value in a map
    let (Some(seed)) = Map.first (fun k v -> Some(k,v)) map
    map |> Map.fold_left 
        (fun (mk,mv) k v -> if v > mv then (mk,mv) else (k,v)) 
        seed
let Main() =
    let mutable freqCounts = Map.of_list [ ("",0) ]
    for word in words do
        let mutable count = 0
        for x in words do
            if x = word then
                count <- count + 1
        let minStr,minCount = FindMin freqCounts
        if count >= minCount then
            freqCounts <- Map.add word count freqCounts
        if Seq.length freqCounts > N then
            freqCounts <- Map.remove minStr freqCounts
    freqCounts 
    |> Seq.sort_by (fun (KeyValue(k,v)) -> -v) 
    |> Seq.iter (printfn "%A")
Main()

Output:

[the, 75]
[to, 41]
[in, 34]
[a, 32]
[of, 29]
弱骨蛰伏 2024-07-24 03:11:24

您可以结合使用外部合并排序优先级队列。 合并排序将确保您的内存限制得到遵守,优先级队列将维护您的前 K 个搜索。 显然,优先级队列必须足够小以适合内存。

  • 首先,将输入字符串分成块,对每个块进行排序并存储到辅助存储中(外部排序) - O(n log n)
  • 读取每个块并在块内计算单词的频率,因此在这一步结束时,每个块是减少到块内的(唯一词频计数)。 O(n)
  • 开始跨块读取元素并聚合每个单词。 由于块是排序的,因此您可以在 O(n) 内完成。
  • 现在,维护 K 个元素的最小优先级堆(堆顶部是堆中的最小元素)。 通过前 K 个元素填充优先级堆,然后进行下一个(唯一单词 - 最终计数),如果其计数大于堆中的顶部元素,则弹出顶部并推送当前单词。 O(n log k)

所以你的最终时间复杂度是 O(n(log k + log n))
-

You could use combination of external merge sort and priority queue. Merge sort will make sure that your memory limits are honored and priority queue will maintain your top K searches. Obviously, the priority queue has to be small enough to fit into memory.

  • First, divide input strings into chunks, sort each chunk and store into secondary storage (external sorting) - O(n log n)
  • Read each chunk and within the chunk, calculate frequency of words, so at end of this step, each chunk is reduced to (unique word - frequency count) within the chunk. O(n)
  • Start reading elements across the chunks and aggregate for each word. Since chunks are sorted, you can do it in O(n)
  • Now, maintain a min priority heap (top of the heap is minimum element in the heap) of K elements. Populate priority heap by first K elements then for next (unique word -final count), if its count is greater than top element in the heap, the pop top and push current word. O(n log k)

So your final time complexity is O(n(log k + log n))
-

风铃鹿 2024-07-24 03:11:24

好吧,如果你想要绝对糟糕的表现......

取出书中的第一个单词,并计算它出现的次数。 取出书中的第二个单词,数数它出现了多少次。 如果超过最后一个单词,则丢弃最后一个单词。 依此类推...您最终会多次计算相同的单词,除非您将它们的列表保存在某个地方,但如果您真的想要最小化内存,那么这应该只需要几个整数。 应该在 O(n^2) 时间内运行,其中 n 是书中的单词数。

Well, if you want absolutely terrible performance...

Take the first word in the book, and count how many times it occurs. Take the second word in the book, count how many times it occurs. If it's more than the last word, discard the last word. And so forth... you'll end up counting the same words multiple times unless you keep a list of them somewhere, but if you really want to minimize memory, this should only require a few ints. Should run in O(n^2) time, where n is the number of words in the book.

烟火散人牵绊 2024-07-24 03:11:24

创建一个单词键的二叉树怎么样(当您不断从文件中读取单词时)。 这有助于在 O(Log(n)) 中搜索已经重复的单词。 所以最后你得到了 O(nLog(n)) 的顶级单词搜索。

基本算法适用

于文件中的每个单词:

  1. 为给定单词创建唯一键(加权 ascii 字符,例如“bat”可以是 1*'b' + 2*'a' + 3*'c';
  2. 将此单词添加到如果该单词已存在,则增加
  3. 单词和当前计数以维护Top5(单词,计数)维护Top5计数和相关单词的动态列表。

该 字。

How about create a binary tree of word keys ( as you keep reading the words from the file ). This helps to search the already repeated words in O(Log(n)). So finally you get O(nLog(n)) for top word search.

Basic algo would be

for each word in a file:

  1. Create unique key for a given word ( weighted ascii char e.g. "bat" could be 1*'b' + 2*'a' + 3*'c';
  2. Add this word to the tree. If the word already exists increment the new count.
  3. Feed the word and the current count to maintainTop5(word, count). maintainTop5() maintains a dynamic list of top5 counts and associated words.

End of the file you have top 5 words.

甜心 2024-07-24 03:11:23

您可能使用了内存密集型但具有恒定查找时间的哈希表,因此性能/内存的权衡是显而易见的。 当你读完本书时,你就会知道答案。 此外,为每个单词递增计数器的速度很快(因为可以快速查找哈希表)。

另一种方法是查看第一个单词,然后浏览整本书,看看该单词出现了多少次。 这需要最少的内存。 然后你对下一个单词做同样的事情并浏览整本书。 如果该单词出现次数较多,则将其添加为最上面的单词(或最前面的 N 个单词)。 当然,这是极其低效的——如果第一个和第三个单词相同,即使你只是对第一个单词做了同样的事情,你最终还是会再次浏览整本书。

You probably used hash tables which are memory-intensive but have a constant-lookup time--so the performance/memory trade off is obvious. By the time you reach the end of the book you will know your answer. Also, incrementing counters for each word is fast (because of the quick hashtable lookups).

The other end of the spectrum is to look at the first word, then go through the entire book to see how many times that word occurs. This requires minimal memory. Then you do the same for the next word and go through the entire book. If that word occurs more times, you add that as the top word (or top N words). Of course, this is extremely inefficient--if the first and third word are the same you'll end up going through the whole book again even though you just did the same thing for the first word.

谎言 2024-07-24 03:11:23

好的,如果您只对出现次数最多的 n 个单词感兴趣,一种方法是分两遍,第一遍基于修改后的 布隆过滤器。 不要使用位图来跟踪哈希出现,而是使用整数数组 - 字节、16 位、32 位甚至 64 位,具体取决于输入大小。 如果布隆过滤器只是设置与单词的每个哈希值相对应的位,则您将增加数组中哈希索引处的计数。

这种方法的问题是两个单词可能会给出相同的哈希值。 因此,您需要执行第二遍,忽略单词,除非它们的哈希总数高于某个阈值,从而减少需要分配的内存量来进行精确计数。

因此,只需创建一个位图,其中为最高出现的哈希值设置位。 然后,在单词的第二遍中,如果某个单词在位图中的哈希值“命中”,则查找它或将其添加到哈希表中并增加其计数。 通过创建仅包含出现次数最多的单词的哈希表,可以最大限度地减少内存使用量。

OK, if you're only interested in the highest n occurring words, one way to do it is in two passes, with the first pass based on a modified Bloom Filter. Instead of using a bit map to track hash occurrences, use an integer array instead - either byte, 16 bit, 32 bit or even 64 bit depending on your input size. Where a Bloom filter simply sets the bit corresponding to each of the hash values of a word, you'll increment the count at the hash index in the array.

The problem with this approach is that two words will probably give the same hash values. So you need to do a second pass where you ignore words unless their hash totals are above a certain threshold, thus reducing the amount of memory you need to allocate to do accurate counting.

So just create a bit map with bits set for the highest occurring hash values. Then in the second pass of the words, if a word has "hits" in the bitmap for its hashes, look it up or add it to a hash table and increment its count. This minimises memory usage by creating a hash table of only the highest occurring words.

哆兒滾 2024-07-24 03:11:23

我是一名物理学家,所以我最喜欢的方法是近似。 您无需浏览整个文本即可获得最常见的单词。 相反:

  • 解析足够小的块以允许您的内存限制,
  • 跳过随机数量的文本,
  • 重复,组合累积的结果。
  • 当列表令人满意地收敛时停止。

如果您对较小的块使用内存高效的算法(例如排序),那么您可以获得比读取每个单词的最有效算法快得多的性能

注意:这确实假设最常见的单词确实在整个文本中出现最频繁,而不仅仅是在文本中的某个位置。 对于英文文本,这个假设是正确的,因为像“the”等单词在整个文本中出现的频率很高。 如果您担心这一要求,请要求算法至少完成整个文本的一次传递。

I'm a physicist, so my favourite approach is to approximate. You don't need to go through the entire text to get the most frequent words. Instead:

  • parse a chunk small enough to allow for your memory limitations,
  • skip a random amount of text,
  • repeat, combining accumulated results.
  • Stop when the list has satisfactorily converged.

If you use a memory-efficient algorithm for the smaller chunks (e.g. sorting) then you can get far faster performance than even the most efficient algorithm that reads every word.

Note: This does make the assumption that the most frequent words do occur most frequently throughout the text, not just at one place in the text. For english text, this assumption is true, because of the frequency of words like 'the' etc throughout. If you're worried about this requirement, require the algorithm to complete at least one pass of the entire text.

不知所踪 2024-07-24 03:11:23

我可能会对此投反对票...

如果文本是英文,而您只想找到最常见的 5 个单词,那么您的程序如下:

print "1. the\n";
print "2. of\n";
print "3. and\n";
print "4. a\n";
print "5. to\n";

运行速度快并且消耗最少的内存!

I'll probably get down-voted for this...

If the text is English and you just want to find the top 5 most frequent words, here is your program:

print "1. the\n";
print "2. of\n";
print "3. and\n";
print "4. a\n";
print "5. to\n";

Runs fast and consumes minimal memory!

知你几分 2024-07-24 03:11:23

如果性能确实不重要,您可以依次浏览每个单词,检查它是否在您的“前 N”中,如果不是,则计算它的所有出现次数。 这样你就只存储 N 个值。 当然,您会多次计算相同的单词,但是,正如您所说,性能不是问题 - 而且代码很简单(这通常是更可取的 - 所有其他条件都相同)。

If performance is really of no concern you could just go through each word in turn, check if it's in your "top N" and, if it isn't, count all it's occurrences. This way you're only storing N values. Of course, you'd be counting the same words many times, but, as you said, performance isn't an issue - and the code would be trivial (which is generally preferable - all other things being equal).

蓝眼睛不忧郁 2024-07-24 03:11:23

一种可能的解决方案是使用 trie 数据结构来存储与其出现次数相关的所有单词。

其他解决方案可以在这个相关问题的答案中找到:Space-用于存储单词列表的高效数据结构?

A possible solution is to use a trie data structure for storing all words associated to their number of occurrences.

Other solutions may be found in answers to this related question: Space-Efficient Data Structure for Storing a Word List?

三五鸿雁 2024-07-24 03:11:23

一种方法是首先对列表进行排序。

我们可以就地对单词进行排序,而无需占用大量内存(以性能低下为代价)。

然后我们可以有一个简单的计数循环来查找频率最大的单词,而不必将所有内容保存在内存中,因为它们是排序形式的。

One way would be to sort the list first.

We can sort the words in-place without a lot of memory (traded with slow performance).

And then we can have a simple counting loops that finds words with maximum frequency without having to save everything in memory since they're in sorted form.

﹎☆浅夏丿初晴 2024-07-24 03:11:23

你的意思是大量的进程内存吗? 如果是这样,一种方法是使用磁盘作为虚拟内存(也称为编写文件系统包装器)。

Do you mean a lot of process memory? If so, one way would be to use the disk as virtual memory (aka write a filesystem wrapper).

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文