在大单词序列中查找前 K 个常用单词的最有效方法

发布于 2024-07-07 04:14:53 字数 760 浏览 14 评论 0 原文

输入:一个正整数K和一个大文本。 文本实际上可以被视为单词序列。 所以我们不必担心如何将其分解为单词序列。
输出:文本中出现频率最高的 K 个单词。

我的想法是这样的。

  1. 遍历整个单词序列时,使用哈希表记录所有单词的频率。 在这个阶段,键是“词”,值是“词频”。 这需要 O(n) 时间。

  2. 对(单词,词频)对进行排序; 关键是“词频”。 使用普通排序算法,这需要 O(n*lg(n)) 时间。

  3. 排序后,我们只取前K个单词。 这需要 O(K) 时间。

总结一下,总时间是 O(n+nlg(n)+K),由于 K 肯定小于 N,所以实际上是 O(nlg(n))。

我们可以改进这一点。 实际上,我们只想要前 K 个词。 其他词的出现频率与我们无关。 所以,我们可以使用“部分堆排序”。 对于步骤 2) 和 3),我们不只是进行排序。 相反,我们将其更改为

2') 构建一个以“词频”为键的(词,词频)对堆。 构建堆需要O(n)时间;

3') 从堆中提取前 K 个单词。 每次提取的时间复杂度为 O(lg(n))。 因此,总时间为 O(k*lg(n))。

总而言之,该解决方案花费的时间为 O(n+k*lg(n))。

这只是我的想法。 我还没有找到改进步骤 1) 的方法。
我希望一些信息检索专家能够进一步阐明这个问题。

Input: A positive integer K and a big text. The text can actually be viewed as word sequence. So we don't have to worry about how to break down it into word sequence.
Output: The most frequent K words in the text.

My thinking is like this.

  1. use a Hash table to record all words' frequency while traverse the whole word sequence. In this phase, the key is "word" and the value is "word-frequency". This takes O(n) time.

  2. sort the (word, word-frequency) pair; and the key is "word-frequency". This takes O(n*lg(n)) time with normal sorting algorithm.

  3. After sorting, we just take the first K words. This takes O(K) time.

To summarize, the total time is O(n+nlg(n)+K), Since K is surely smaller than N, so it is actually O(nlg(n)).

We can improve this. Actually, we just want top K words. Other words' frequency is not concern for us. So, we can use "partial Heap sorting". For step 2) and 3), we don't just do sorting. Instead, we change it to be

2') build a heap of (word, word-frequency) pair with "word-frequency" as key. It takes O(n) time to build a heap;

3') extract top K words from the heap. Each extraction is O(lg(n)). So, total time is O(k*lg(n)).

To summarize, this solution cost time O(n+k*lg(n)).

This is just my thought. I haven't find out way to improve step 1).
I Hope some Information Retrieval experts can shed more light on this question.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(19

阳光下的泡沫是彩色的 2024-07-14 04:14:53

这可以在 O(n) 时间内完成

解决方案 1:

步骤:

  1. 计算单词并对其进行散列,最终将得到这样的结构

    var 哈希 = { 
        “我”:13, 
        “喜欢”:3, 
        “喵”:3, 
        “极客”:3, 
        “汉堡”:2, 
        “猫”:1, 
        “富”:100, 
        ... 
        ... 
      
  2. 遍历散列并找到最常用的单词(在本例中为“foo”100),然后创建该大小的数组

  3. 然后我们可以再次遍历哈希并使用该数字单词出现次数作为数组索引,如果索引中没有任何内容,则创建一个数组,否则将其附加到数组中。 然后我们得到一个像这样的数组:

    <前><代码> 0 1 2 3 100
    [[ ]、[猫]、[汉堡]、[像、喵、极客]、[]...[foo]]

  4. 然后从末尾遍历数组,并收集 k 个单词。

解决方案2:

步骤:

  1. 同上
  2. 使用最小堆并将最小堆的大小保持为k,对于哈希中的每个单词,我们将单词的出现次数与最小值进行比较,1)如果它更大大于 min 值,则删除 min(如果 min 堆的大小等于 k)并将该数字插入到 min 堆中。 2)休息条件简单。
  3. 遍历完数组后,我们只需将最小堆转换为数组并返回数组即可。

This can be done in O(n) time

Solution 1:

Steps:

  1. Count words and hash it, which will end up in the structure like this

    var hash = {
      "I" : 13,
      "like" : 3,
      "meow" : 3,
      "geek" : 3,
      "burger" : 2,
      "cat" : 1,
      "foo" : 100,
      ...
      ...
    
  2. Traverse through the hash and find the most frequently used word (in this case "foo" 100), then create the array of that size

  3. Then we can traverse the hash again and use the number of occurrences of words as array index, if there is nothing in the index, create an array else append it in the array. Then we end up with an array like:

      0   1      2            3                  100
    [[ ],[cat],[burger],[like, meow, geek],[]...[foo]]
    
  4. Then just traverse the array from the end, and collect the k words.

Solution 2:

Steps:

  1. Same as above
  2. Use min heap and keep the size of min heap to k, and for each word in the hash we compare the occurrences of words with the min, 1) if it's greater than the min value, remove the min (if the size of the min heap is equal to k) and insert the number in the min heap. 2) rest simple conditions.
  3. After traversing through the array, we just convert the min heap to array and return the array.
旧瑾黎汐 2024-07-14 04:14:53

一般来说,您不会获得比您所描述的解决方案更好的运行时间。 您必须至少执行 O(n) 工作来评估所有单词,然后执行 O(k) 额外工作来查找前 k 个术语。

如果您的问题集确实很大,您可以使用分布式解决方案,例如map/reduce。 让 n 个 Map Worker 计算每个文本的 1/n 的频率,并且对于每个单词,将其发送到基于单词的哈希计算的 m 个 Reducer Worker 之一。 然后,reducer 对计数求和。 对减速器输出进行合并排序将为您提供按流行程度排列的最流行的单词。

You're not going to get generally better runtime than the solution you've described. You have to do at least O(n) work to evaluate all the words, and then O(k) extra work to find the top k terms.

If your problem set is really big, you can use a distributed solution such as map/reduce. Have n map workers count frequencies on 1/nth of the text each, and for each word, send it to one of m reducer workers calculated based on the hash of the word. The reducers then sum the counts. Merge sort over the reducers' outputs will give you the most popular words in order of popularity.

旧人九事 2024-07-14 04:14:53

如果我们不关心对前 K 进行排名,则解决方案的微小变化会产生 O(n) 算法,并且会产生 O(n+k*lg(k))O(n+k*lg(k))强> 如果我们这样做的话。 我相信这两个界限在常数因子内都是最佳的。

当我们遍历列表并插入哈希表后,这里的优化再次出现。 我们可以使用中位数中位数算法来选择第K大的列表中的元素。 该算法可证明是 O(n)。

选择第 K 个最小元素后,我们围绕该元素对列表进行分区,就像快速排序一样。 这显然也是O(n)。 枢轴“左侧”一侧的所有内容都在我们的 K 元素组中,因此我们完成了(我们可以在继续过程中简单地丢弃其他所有内容)。

所以这个策略是:

  1. 遍历每个单词并将其插入到哈希表中: O(n)
  2. 选择第 K 个最小元素: O(n)
  3. 围绕该元素进行分区: O(n)

如果要对 K 个元素进行排名,只需在 O(k * lg(k)) 时间内使用任何有效的比较排序对它们进行排序,总运行时间为 O(n+k * lg(k))。

O(n) 时间限制在常数因子内是最佳的,因为我们必须至少检查每个单词一次。

O(n + k * lg(k)) 时间限制也是最佳的,因为没有基于比较的方法可以在小于 k * lg(k) 的时间内对 k 个元素进行排序。

A small variation on your solution yields an O(n) algorithm if we don't care about ranking the top K, and a O(n+k*lg(k)) solution if we do. I believe both of these bounds are optimal within a constant factor.

The optimization here comes again after we run through the list, inserting into the hash table. We can use the median of medians algorithm to select the Kth largest element in the list. This algorithm is provably O(n).

After selecting the Kth smallest element, we partition the list around that element just as in quicksort. This is obviously also O(n). Anything on the "left" side of the pivot is in our group of K elements, so we're done (we can simply throw away everything else as we go along).

So this strategy is:

  1. Go through each word and insert it into a hash table: O(n)
  2. Select the Kth smallest element: O(n)
  3. Partition around that element: O(n)

If you want to rank the K elements, simply sort them with any efficient comparison sort in O(k * lg(k)) time, yielding a total run time of O(n+k * lg(k)).

The O(n) time bound is optimal within a constant factor because we must examine each word at least once.

The O(n + k * lg(k)) time bound is also optimal because there is no comparison-based way to sort k elements in less than k * lg(k) time.

若有似无的小暗淡 2024-07-14 04:14:53

如果您的“大单词列表”足够大,您可以简单地进行采样并获得估计。 否则,我喜欢哈希聚合。

编辑

通过示例,我的意思是选择一些页面子集并计算这些页面中最常见的单词。 如果您以合理的方式选择页面并选择具有统计意义的样本,那么您对最常见单词的估计应该是合理的。

只有当您有如此多的数据以至于处理所有数据有点愚蠢时,这种方法才真正合理。 如果您只有几兆,您应该能够不费吹灰之力地分析数据并计算出准确的答案,而不是费心计算估计值。

If your "big word list" is big enough, you can simply sample and get estimates. Otherwise, I like hash aggregation.

Edit:

By sample I mean choose some subset of pages and calculate the most frequent word in those pages. Provided you select the pages in a reasonable way and select a statistically significant sample, your estimates of the most frequent words should be reasonable.

This approach is really only reasonable if you have so much data that processing it all is just kind of silly. If you only have a few megs, you should be able to tear through the data and calculate an exact answer without breaking a sweat rather than bothering to calculate an estimate.

能怎样 2024-07-14 04:14:53

您可以通过使用单词的第一个字母进行分区,然后使用下一个字符对最大的多单词集进行分区,直到拥有 k 个单单词集,从而进一步减少时间。 您可以使用一种 256 路树,其叶子上有部分/完整单词列表。 您需要非常小心,不要造成到处都是字符串复制。

该算法的复杂度为 O(m),其中 m 是字符数。 它避免了对 k 的依赖,这对于大 k 来说非常好[顺便说一句,你发布的运行时间是错误的,它应该是 O(n*lg(k)),我不确定这是什么米]。

如果你并行运行这两种算法,你会得到我很确定是渐近最优的 O(min(m, n*lg(k))) 算法,但我的平均应该更快,因为它不涉及散列或排序。

You can cut down the time further by partitioning using the first letter of words, then partitioning the largest multi-word set using the next character until you have k single-word sets. You would use a sortof 256-way tree with lists of partial/complete words at the leafs. You would need to be very careful to not cause string copies everywhere.

This algorithm is O(m), where m is the number of characters. It avoids that dependence on k, which is very nice for large k [by the way your posted running time is wrong, it should be O(n*lg(k)), and I'm not sure what that is in terms of m].

If you run both algorithms side by side you will get what I'm pretty sure is an asymptotically optimal O(min(m, n*lg(k))) algorithm, but mine should be faster on average because it doesn't involve hashing or sorting.

望她远 2024-07-14 04:14:53

您的描述中有一个错误:计数需要 O(n) 时间,但排序需要 O(m*lg(m)),其中 m 是唯一单词的数量。 这通常比单词总数小得多,因此可能应该优化哈希的构建方式。

You have a bug in your description: Counting takes O(n) time, but sorting takes O(m*lg(m)), where m is the number of unique words. This is usually much smaller than the total number of words, so probably should just optimize how the hash is built.

那一片橙海, 2024-07-14 04:14:53

你的问题和这个一样-
http://www.geeksforgeeks.org/ find-the-k-most-frequent-words-from-a-file/

使用 Trie 和最小堆来有效解决它。

Your problem is same as this-
http://www.geeksforgeeks.org/find-the-k-most-frequent-words-from-a-file/

Use Trie and min heap to efficieinty solve it.

赤濁 2024-07-14 04:14:53

如果您想要的是文本中任何实用 k 和任何自然语言中最常见的 k 个单词的列表,那么算法的复杂性并不相关。

比如说,只需从文本中采样几百万个单词,使用任何算法在几秒钟内处理它,最常见的计数就会非常准确。

顺便说一句,虚拟算法的复杂度(1. 全部计数 2. 对计数进行排序 3. 取最好的)是 O(n+m*log(m)),其中 m 是你的单词中不同单词的数量。文本。 log(m) 比 (n/m) 小得多,因此仍然是 O(n)。

实际上,长步正在计数。

If what you're after is the list of k most frequent words in your text for any practical k and for any natural langage, then the complexity of your algorithm is not relevant.

Just sample, say, a few million words from your text, process that with any algorithm in a matter of seconds, and the most frequent counts will be very accurate.

As a side note, the complexity of the dummy algorithm (1. count all 2. sort the counts 3. take the best) is O(n+m*log(m)), where m is the number of different words in your text. log(m) is much smaller than (n/m), so it remains O(n).

Practically, the long step is counting.

丢了幸福的猪 2024-07-14 04:14:53
  1. 利用内存高效的数据结构来存储单词
  2. 使用MaxHeap,找到前K个频繁单词。

这是代码

import java.util.ArrayList;
import java.util.Comparator;
import java.util.List;
import java.util.PriorityQueue;

import com.nadeem.app.dsa.adt.Trie;
import com.nadeem.app.dsa.adt.Trie.TrieEntry;
import com.nadeem.app.dsa.adt.impl.TrieImpl;

public class TopKFrequentItems {

private int maxSize;

private Trie trie = new TrieImpl();
private PriorityQueue<TrieEntry> maxHeap;

public TopKFrequentItems(int k) {
    this.maxSize = k;
    this.maxHeap = new PriorityQueue<TrieEntry>(k, maxHeapComparator());
}

private Comparator<TrieEntry> maxHeapComparator() {
    return new Comparator<TrieEntry>() {
        @Override
        public int compare(TrieEntry o1, TrieEntry o2) {
            return o1.frequency - o2.frequency;
        }           
    };
}

public void add(String word) {
    this.trie.insert(word);
}

public List<TopK> getItems() {

    for (TrieEntry trieEntry : this.trie.getAll()) {
        if (this.maxHeap.size() < this.maxSize) {
            this.maxHeap.add(trieEntry);
        } else if (this.maxHeap.peek().frequency < trieEntry.frequency) {
            this.maxHeap.remove();
            this.maxHeap.add(trieEntry);
        }
    }
    List<TopK> result = new ArrayList<TopK>();
    for (TrieEntry entry : this.maxHeap) {
        result.add(new TopK(entry));
    }       
    return result;
}

public static class TopK {
    public String item;
    public int frequency;

    public TopK(String item, int frequency) {
        this.item = item;
        this.frequency = frequency;
    }
    public TopK(TrieEntry entry) {
        this(entry.word, entry.frequency);
    }
    @Override
    public String toString() {
        return String.format("TopK [item=%s, frequency=%s]", item, frequency);
    }
    @Override
    public int hashCode() {
        final int prime = 31;
        int result = 1;
        result = prime * result + frequency;
        result = prime * result + ((item == null) ? 0 : item.hashCode());
        return result;
    }
    @Override
    public boolean equals(Object obj) {
        if (this == obj)
            return true;
        if (obj == null)
            return false;
        if (getClass() != obj.getClass())
            return false;
        TopK other = (TopK) obj;
        if (frequency != other.frequency)
            return false;
        if (item == null) {
            if (other.item != null)
                return false;
        } else if (!item.equals(other.item))
            return false;
        return true;
    }

}   

}

这是单元测试

@Test
public void test() {
    TopKFrequentItems stream = new TopKFrequentItems(2);

    stream.add("hell");
    stream.add("hello");
    stream.add("hello");
    stream.add("hello");
    stream.add("hello");
    stream.add("hello");
    stream.add("hero");
    stream.add("hero");
    stream.add("hero");
    stream.add("hello");
    stream.add("hello");
    stream.add("hello");
    stream.add("home");
    stream.add("go");
    stream.add("go");
    assertThat(stream.getItems()).hasSize(2).contains(new TopK("hero", 3), new TopK("hello", 8));
}

有关更多详细信息,请参阅 此测试用例

  1. Utilize memory efficient data structure to store the words
  2. Use MaxHeap, to find the top K frequent words.

Here is the code

import java.util.ArrayList;
import java.util.Comparator;
import java.util.List;
import java.util.PriorityQueue;

import com.nadeem.app.dsa.adt.Trie;
import com.nadeem.app.dsa.adt.Trie.TrieEntry;
import com.nadeem.app.dsa.adt.impl.TrieImpl;

public class TopKFrequentItems {

private int maxSize;

private Trie trie = new TrieImpl();
private PriorityQueue<TrieEntry> maxHeap;

public TopKFrequentItems(int k) {
    this.maxSize = k;
    this.maxHeap = new PriorityQueue<TrieEntry>(k, maxHeapComparator());
}

private Comparator<TrieEntry> maxHeapComparator() {
    return new Comparator<TrieEntry>() {
        @Override
        public int compare(TrieEntry o1, TrieEntry o2) {
            return o1.frequency - o2.frequency;
        }           
    };
}

public void add(String word) {
    this.trie.insert(word);
}

public List<TopK> getItems() {

    for (TrieEntry trieEntry : this.trie.getAll()) {
        if (this.maxHeap.size() < this.maxSize) {
            this.maxHeap.add(trieEntry);
        } else if (this.maxHeap.peek().frequency < trieEntry.frequency) {
            this.maxHeap.remove();
            this.maxHeap.add(trieEntry);
        }
    }
    List<TopK> result = new ArrayList<TopK>();
    for (TrieEntry entry : this.maxHeap) {
        result.add(new TopK(entry));
    }       
    return result;
}

public static class TopK {
    public String item;
    public int frequency;

    public TopK(String item, int frequency) {
        this.item = item;
        this.frequency = frequency;
    }
    public TopK(TrieEntry entry) {
        this(entry.word, entry.frequency);
    }
    @Override
    public String toString() {
        return String.format("TopK [item=%s, frequency=%s]", item, frequency);
    }
    @Override
    public int hashCode() {
        final int prime = 31;
        int result = 1;
        result = prime * result + frequency;
        result = prime * result + ((item == null) ? 0 : item.hashCode());
        return result;
    }
    @Override
    public boolean equals(Object obj) {
        if (this == obj)
            return true;
        if (obj == null)
            return false;
        if (getClass() != obj.getClass())
            return false;
        TopK other = (TopK) obj;
        if (frequency != other.frequency)
            return false;
        if (item == null) {
            if (other.item != null)
                return false;
        } else if (!item.equals(other.item))
            return false;
        return true;
    }

}   

}

Here is the unit tests

@Test
public void test() {
    TopKFrequentItems stream = new TopKFrequentItems(2);

    stream.add("hell");
    stream.add("hello");
    stream.add("hello");
    stream.add("hello");
    stream.add("hello");
    stream.add("hello");
    stream.add("hero");
    stream.add("hero");
    stream.add("hero");
    stream.add("hello");
    stream.add("hello");
    stream.add("hello");
    stream.add("home");
    stream.add("go");
    stream.add("go");
    assertThat(stream.getItems()).hasSize(2).contains(new TopK("hero", 3), new TopK("hello", 8));
}

For more details refer this test case

执笔绘流年 2024-07-14 04:14:53
  1. 遍历整个单词序列时,使用哈希表记录所有单词的频率。 在这个阶段,键是“词”,值是“词频”。 这需要 O(n) 时间。这与上面解释的每个相同

  2. 在 hashmap 中插入本身时,保持大小为 10(k=10) 的 Treeset(特定于 java,每种语言都有实现)以保持最常用的 10 个单词。 直到大小小于10,继续添加。 如果大小等于 10,如果插入的元素大于最小元素(即第一个元素)。 如果是,则删除它并插入新元素

要限制树集的大小,请参阅 此链接

  1. use a Hash table to record all words' frequency while traverse the whole word sequence. In this phase, the key is "word" and the value is "word-frequency". This takes O(n) time.This is same as every one explained above

  2. While insertion itself in hashmap , keep the Treeset(specific to java, there are implementations in every language) of size 10(k=10) to keep the top 10 frequent words. Till size is less than 10, keep adding it. If size equal to 10, if inserted element is greater than minimum element i.e. first element. If yes remove it and insert new element

To restrict the size of treeset see this link

喵星人汪星人 2024-07-14 04:14:53

假设我们有一个单词序列“ad”“ad”“boy”“big”“bad”“com”“come”“cold”。 并且K=2。
正如您提到的“使用单词的第一个字母进行分区”,我们得到
(“广告”,“广告”)(“男孩”,“大”,“坏”)(“com”“来”“冷”)
“然后使用下一个字符划分最大的多词集,直到拥有 k 个单词集。”
它将分区(“boy”,“big”,“bad”)(“com”“come”“cold”),第一个分区(“ad”,“ad”)被错过,而“ad”实际上是最常见的词。

也许我误解了你的观点。 您能详细说明一下您的分区过程吗?

Suppose we have a word sequence "ad" "ad" "boy" "big" "bad" "com" "come" "cold". And K=2.
as you mentioned "partitioning using the first letter of words", we got
("ad", "ad") ("boy", "big", "bad") ("com" "come" "cold")
"then partitioning the largest multi-word set using the next character until you have k single-word sets."
it will partition ("boy", "big", "bad") ("com" "come" "cold"), the first partition ("ad", "ad") is missed, while "ad" is actually the most frequent word.

Perhaps I misunderstand your point. Can you please detail your process about partition?

装纯掩盖桑 2024-07-14 04:14:53

我相信这个问题可以通过 O(n) 算法来解决。 我们可以即时进行排序。 换句话说,这种情况下的排序是传统排序问题的子问题,因为每次访问哈希表时只有一个计数器加一。 最初,列表已排序,因为所有计数器均为零。 当我们不断增加哈希表中的计数器时,我们会记录另一个按频率排序的哈希值数组,如下所示。 每次我们递增计数器时,我们都会检查其在排序数组中的索引,并检查其计数是否超过列表中的前一个。 如果是这样,我们交换这两个元素。 这样我们就得到了一个至多 O(n) 的解,其中 n 是原始文本中的单词数。

I believe this problem can be solved by an O(n) algorithm. We could make the sorting on the fly. In other words, the sorting in that case is a sub-problem of the traditional sorting problem since only one counter gets incremented by one every time we access the hash table. Initially, the list is sorted since all counters are zero. As we keep incrementing counters in the hash table, we bookkeep another array of hash values ordered by frequency as follows. Every time we increment a counter, we check its index in the ranked array and check if its count exceeds its predecessor in the list. If so, we swap these two elements. As such we obtain a solution that is at most O(n) where n is the number of words in the original text.

南城旧梦 2024-07-14 04:14:53

我也曾为此苦苦挣扎,并受到@aly 的启发。 我们可以维护一个预先排序的单词列表 (List>),而不是事后排序,并且该单词将位于集合中的位置 X,其中 X 是单词的当前计数。单词。 一般来说,它的工作原理如下:

  1. 对于每个单词,将其存储为其出现位置映射的一部分:Map
  2. 然后根据计数,将其从之前的计数集中移除,并添加到新的计数集中。

这样做的缺点是列表可能很大 - 可以通过使用 TreeMap> 进行优化 - 但这会增加一些开销。 最终我们可以混合使用 HashMap 或我们自己的数据结构。

代码

public class WordFrequencyCounter {
    private static final int WORD_SEPARATOR_MAX = 32; // UNICODE 0000-001F: control chars
    Map<String, MutableCounter> counters = new HashMap<String, MutableCounter>();
    List<Set<String>> reverseCounters = new ArrayList<Set<String>>();

    private static class MutableCounter {
        int i = 1;
    }

    public List<String> countMostFrequentWords(String text, int max) {
        int lastPosition = 0;
        int length = text.length();
        for (int i = 0; i < length; i++) {
            char c = text.charAt(i);
            if (c <= WORD_SEPARATOR_MAX) {
                if (i != lastPosition) {
                    String word = text.substring(lastPosition, i);
                    MutableCounter counter = counters.get(word);
                    if (counter == null) {
                        counter = new MutableCounter();
                        counters.put(word, counter);
                    } else {
                        Set<String> strings = reverseCounters.get(counter.i);
                        strings.remove(word);
                        counter.i ++;
                    }
                    addToReverseLookup(counter.i, word);
                }
                lastPosition = i + 1;
            }
        }

        List<String> ret = new ArrayList<String>();
        int count = 0;
        for (int i = reverseCounters.size() - 1; i >= 0; i--) {
            Set<String> strings = reverseCounters.get(i);
            for (String s : strings) {
                ret.add(s);
                System.out.print(s + ":" + i);
                count++;
                if (count == max) break;
            }
            if (count == max) break;
        }
        return ret;
    }

    private void addToReverseLookup(int count, String word) {
        while (count >= reverseCounters.size()) {
            reverseCounters.add(new HashSet<String>());
        }
        Set<String> strings = reverseCounters.get(count);
        strings.add(word);
    }

}

I was struggling with this as well and get inspired by @aly. Instead of sorting afterwards, we can just maintain a presorted list of words (List<Set<String>>) and the word will be in the set at position X where X is the current count of the word. In generally, here's how it works:

  1. for each word, store it as part of map of it's occurrence: Map<String, Integer>.
  2. then, based on the count, remove it from the previous count set, and add it into the new count set.

The drawback of this is the list maybe big - can be optimized by using a TreeMap<Integer, Set<String>> - but this will add some overhead. Ultimately we can use a mix of HashMap or our own data structure.

The code

public class WordFrequencyCounter {
    private static final int WORD_SEPARATOR_MAX = 32; // UNICODE 0000-001F: control chars
    Map<String, MutableCounter> counters = new HashMap<String, MutableCounter>();
    List<Set<String>> reverseCounters = new ArrayList<Set<String>>();

    private static class MutableCounter {
        int i = 1;
    }

    public List<String> countMostFrequentWords(String text, int max) {
        int lastPosition = 0;
        int length = text.length();
        for (int i = 0; i < length; i++) {
            char c = text.charAt(i);
            if (c <= WORD_SEPARATOR_MAX) {
                if (i != lastPosition) {
                    String word = text.substring(lastPosition, i);
                    MutableCounter counter = counters.get(word);
                    if (counter == null) {
                        counter = new MutableCounter();
                        counters.put(word, counter);
                    } else {
                        Set<String> strings = reverseCounters.get(counter.i);
                        strings.remove(word);
                        counter.i ++;
                    }
                    addToReverseLookup(counter.i, word);
                }
                lastPosition = i + 1;
            }
        }

        List<String> ret = new ArrayList<String>();
        int count = 0;
        for (int i = reverseCounters.size() - 1; i >= 0; i--) {
            Set<String> strings = reverseCounters.get(i);
            for (String s : strings) {
                ret.add(s);
                System.out.print(s + ":" + i);
                count++;
                if (count == max) break;
            }
            if (count == max) break;
        }
        return ret;
    }

    private void addToReverseLookup(int count, String word) {
        while (count >= reverseCounters.size()) {
            reverseCounters.add(new HashSet<String>());
        }
        Set<String> strings = reverseCounters.get(count);
        strings.add(word);
    }

}
素手挽清风 2024-07-14 04:14:53

我刚刚找到这个问题的其他解决方案。 但我不确定这是正确的。
解决方案:

  1. 使用哈希表记录所有单词的频率 T(n) = O(n)
  2. 选择哈希表的前 k 个元素,并将它们存储在一个缓冲区中(其空间 = k)。 T(n) = O(k)
  3. 每次,我们首先需要找到缓冲区当前的最小元素,并将缓冲区的最小元素与哈希表的 (n - k) 个元素一一比较。 如果哈希表的元素大于缓冲区的最小元素,则删除当前缓冲区的最小值,并添加哈希表的元素。 所以每次我们在缓冲区中找到最小的一个需要 T(n) = O(k),遍历整个哈希表需要 T(n) = O(n - k)。 所以这个过程的整个时间复杂度是 T(n) = O((nk) * k)。
  4. 遍历整个哈希表后,结果就在这个缓冲区中。
  5. 整个时间复杂度:T(n) = O(n) + O(k) + O(kn - k^2) = O(kn + n - k^2 + k)。 因为,一般来说,k确实小于n。 因此对于这个解决方案,时间复杂度为T(n) = O(kn)。 当 k 非常小时,这就是线性时间。 这样对吗? 我真的不确定。

I just find out the other solution for this problem. But I am not sure it is right.
Solution:

  1. Use a Hash table to record all words' frequency T(n) = O(n)
  2. Choose first k elements of hash table, and restore them in one buffer (whose space = k). T(n) = O(k)
  3. Each time, firstly we need find the current min element of the buffer, and just compare the min element of the buffer with the (n - k) elements of hash table one by one. If the element of hash table is greater than this min element of buffer, then drop the current buffer's min, and add the element of the hash table. So each time we find the min one in the buffer need T(n) = O(k), and traverse the whole hash table need T(n) = O(n - k). So the whole time complexity for this process is T(n) = O((n-k) * k).
  4. After traverse the whole hash table, the result is in this buffer.
  5. The whole time complexity: T(n) = O(n) + O(k) + O(kn - k^2) = O(kn + n - k^2 + k). Since, k is really smaller than n in general. So for this solution, the time complexity is T(n) = O(kn). That is linear time, when k is really small. Is it right? I am really not sure.
陈甜 2024-07-14 04:14:53

尝试考虑特殊的数据结构来解决此类问题。 在这种情况下,像 trie 这样的特殊树以特定方式存储字符串,非常有效。 或者构建自己的解决方案的第二种方法,例如计算单词数。 我猜这 TB 数据是英文的,那么我们一般有大约 600,000 个单词,因此可以只存储这些单词并计算哪些字符串会重复 + 这个解决方案将需要正则表达式来消除一些特殊字符。 我很确定第一个解决方案会更快。

http://en.wikipedia.org/wiki/Trie

Try to think of special data structure to approach this kind of problems. In this case special kind of tree like trie to store strings in specific way, very efficient. Or second way to build your own solution like counting words. I guess this TB of data would be in English then we do have around 600,000 words in general so it'll be possible to store only those words and counting which strings would be repeated + this solution will need regex to eliminate some special characters. First solution will be faster, I'm pretty sure.

http://en.wikipedia.org/wiki/Trie

懷念過去 2024-07-14 04:14:53

这是一个有趣的搜索想法,我可以找到与 Top-K 相关的这篇论文 https://icmi.cs.ucsb.edu/research/tech_reports/reports/2005-23.pdf

还有一个它的实现 此处

This is an interesting idea to search and I could find this paper related to Top-K https://icmi.cs.ucsb.edu/research/tech_reports/reports/2005-23.pdf

Also there is an implementation of it here.

鲜血染红嫁衣 2024-07-14 04:14:53

获取最常用单词出现次数的最简单代码。

 function strOccurence(str){
    var arr = str.split(" ");
    var length = arr.length,temp = {},max; 
    while(length--){
    if(temp[arr[length]] == undefined && arr[length].trim().length > 0)
    {
        temp[arr[length]] = 1;
    }
    else if(arr[length].trim().length > 0)
    {
        temp[arr[length]] = temp[arr[length]] + 1;

    }
}
    console.log(temp);
    var max = [];
    for(i in temp)
    {
        max[temp[i]] = i;
    }
    console.log(max[max.length])
   //if you want second highest
   console.log(max[max.length - 2])
}

Simplest code to get the occurrence of most frequently used word.

 function strOccurence(str){
    var arr = str.split(" ");
    var length = arr.length,temp = {},max; 
    while(length--){
    if(temp[arr[length]] == undefined && arr[length].trim().length > 0)
    {
        temp[arr[length]] = 1;
    }
    else if(arr[length].trim().length > 0)
    {
        temp[arr[length]] = temp[arr[length]] + 1;

    }
}
    console.log(temp);
    var max = [];
    for(i in temp)
    {
        max[temp[i]] = i;
    }
    console.log(max[max.length])
   //if you want second highest
   console.log(max[max.length - 2])
}
画▽骨i 2024-07-14 04:14:53

在这些情况下,我建议使用 Java 内置功能。 因为,它们已经经过充分测试并且稳定。 在这个问题中,我使用HashMap数据结构来查找单词的重复。 然后,我将结果推送到对象数组中。 我通过 Arrays.sort() 对对象进行排序并打印前 k 个单词及其重复次数。

import java.io.*;
import java.lang.reflect.Array;
import java.util.*;

public class TopKWordsTextFile {

    static class SortObject implements Comparable<SortObject>{

        private String key;
        private int value;

        public SortObject(String key, int value) {
            super();
            this.key = key;
            this.value = value;
        }

        @Override
        public int compareTo(SortObject o) {
            //descending order
            return o.value - this.value;
        }
    }


    public static void main(String[] args) {
        HashMap<String,Integer> hm = new HashMap<>();
        int k = 1;
        try {
            BufferedReader br = new BufferedReader(new InputStreamReader(new FileInputStream("words.in")));

            String line;
            while ((line = br.readLine()) != null) {
                // process the line.
                //System.out.println(line);
                String[] tokens = line.split(" ");
                for(int i=0; i<tokens.length; i++){
                    if(hm.containsKey(tokens[i])){
                        //If the key already exists
                        Integer prev = hm.get(tokens[i]);
                        hm.put(tokens[i],prev+1);
                    }else{
                        //If the key doesn't exist
                        hm.put(tokens[i],1);
                    }
                }
            }
            //Close the input
            br.close();
            //Print all words with their repetitions. You can use 3 for printing top 3 words.
            k = hm.size();
            // Get a set of the entries
            Set set = hm.entrySet();
            // Get an iterator
            Iterator i = set.iterator();
            int index = 0;
            // Display elements
            SortObject[] objects = new SortObject[hm.size()];
            while(i.hasNext()) {
                Map.Entry e = (Map.Entry)i.next();
                //System.out.print("Key: "+e.getKey() + ": ");
                //System.out.println(" Value: "+e.getValue());
                String tempS = (String) e.getKey();
                int tempI = (int) e.getValue();
                objects[index] = new SortObject(tempS,tempI);
                index++;
            }
            System.out.println();
            //Sort the array
            Arrays.sort(objects);
            //Print top k
            for(int j=0; j<k; j++){
                System.out.println(objects[j].key+":"+objects[j].value);
            }


        } catch (IOException e) {
            e.printStackTrace();
        }
    }

}

有关更多信息,请访问 https://github.com /m-vahidalizadeh/foundations/blob/master/src/algorithms/TopKWordsTextFile.java。 我希望它有帮助。

In these situations, I recommend to use Java built-in features. Since, they are already well tested and stable. In this problem, I find the repetitions of the words by using HashMap data structure. Then, I push the results to an array of objects. I sort the object by Arrays.sort() and print the top k words and their repetitions.

import java.io.*;
import java.lang.reflect.Array;
import java.util.*;

public class TopKWordsTextFile {

    static class SortObject implements Comparable<SortObject>{

        private String key;
        private int value;

        public SortObject(String key, int value) {
            super();
            this.key = key;
            this.value = value;
        }

        @Override
        public int compareTo(SortObject o) {
            //descending order
            return o.value - this.value;
        }
    }


    public static void main(String[] args) {
        HashMap<String,Integer> hm = new HashMap<>();
        int k = 1;
        try {
            BufferedReader br = new BufferedReader(new InputStreamReader(new FileInputStream("words.in")));

            String line;
            while ((line = br.readLine()) != null) {
                // process the line.
                //System.out.println(line);
                String[] tokens = line.split(" ");
                for(int i=0; i<tokens.length; i++){
                    if(hm.containsKey(tokens[i])){
                        //If the key already exists
                        Integer prev = hm.get(tokens[i]);
                        hm.put(tokens[i],prev+1);
                    }else{
                        //If the key doesn't exist
                        hm.put(tokens[i],1);
                    }
                }
            }
            //Close the input
            br.close();
            //Print all words with their repetitions. You can use 3 for printing top 3 words.
            k = hm.size();
            // Get a set of the entries
            Set set = hm.entrySet();
            // Get an iterator
            Iterator i = set.iterator();
            int index = 0;
            // Display elements
            SortObject[] objects = new SortObject[hm.size()];
            while(i.hasNext()) {
                Map.Entry e = (Map.Entry)i.next();
                //System.out.print("Key: "+e.getKey() + ": ");
                //System.out.println(" Value: "+e.getValue());
                String tempS = (String) e.getKey();
                int tempI = (int) e.getValue();
                objects[index] = new SortObject(tempS,tempI);
                index++;
            }
            System.out.println();
            //Sort the array
            Arrays.sort(objects);
            //Print top k
            for(int j=0; j<k; j++){
                System.out.println(objects[j].key+":"+objects[j].value);
            }


        } catch (IOException e) {
            e.printStackTrace();
        }
    }

}

For more information, please visit https://github.com/m-vahidalizadeh/foundations/blob/master/src/algorithms/TopKWordsTextFile.java. I hope it helps.

远昼 2024-07-14 04:14:53
**

C++11上述思想的实现

**

class Solution {
public:
vector<int> topKFrequent(vector<int>& nums, int k) {

    unordered_map<int,int> map;
    for(int num : nums){
        map[num]++;
    }

    vector<int> res;
    // we use the priority queue, like the max-heap , we will keep (size-k) smallest elements in the queue
    // pair<first, second>: first is frequency,  second is number 
    priority_queue<pair<int,int>> pq; 
    for(auto it = map.begin(); it != map.end(); it++){
        pq.push(make_pair(it->second, it->first));

        // onece the size bigger than size-k, we will pop the value, which is the top k frequent element value 

        if(pq.size() > (int)map.size() - k){
            res.push_back(pq.top().second);
            pq.pop();
        }
    }
    return res;

}

};

**

C++11 Implementation of the above thought

**

class Solution {
public:
vector<int> topKFrequent(vector<int>& nums, int k) {

    unordered_map<int,int> map;
    for(int num : nums){
        map[num]++;
    }

    vector<int> res;
    // we use the priority queue, like the max-heap , we will keep (size-k) smallest elements in the queue
    // pair<first, second>: first is frequency,  second is number 
    priority_queue<pair<int,int>> pq; 
    for(auto it = map.begin(); it != map.end(); it++){
        pq.push(make_pair(it->second, it->first));

        // onece the size bigger than size-k, we will pop the value, which is the top k frequent element value 

        if(pq.size() > (int)map.size() - k){
            res.push_back(pq.top().second);
            pq.pop();
        }
    }
    return res;

}

};

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文