当前位置：文江博客话题详情

algorithm word-frequency

在大单词序列中查找前 K 个常用单词的最有效方法

发布于 2024-07-07 04:14:53 字数 760 浏览 14 评论 0 原文

输入：一个正整数K和一个大文本。文本实际上可以被视为单词序列。所以我们不必担心如何将其分解为单词序列。
输出：文本中出现频率最高的 K 个单词。

我的想法是这样的。

遍历整个单词序列时，使用哈希表记录所有单词的频率。在这个阶段，键是“词”，值是“词频”。这需要 O(n) 时间。
对（单词，词频）对进行排序；关键是“词频”。使用普通排序算法，这需要 O(n*lg(n)) 时间。
排序后，我们只取前K个单词。这需要 O(K) 时间。

总结一下，总时间是 O(n+nlg(n)+K)，由于 K 肯定小于 N，所以实际上是 O(nlg(n))。

我们可以改进这一点。实际上，我们只想要前 K 个词。其他词的出现频率与我们无关。所以，我们可以使用“部分堆排序”。对于步骤 2) 和 3)，我们不只是进行排序。相反，我们将其更改为

2') 构建一个以“词频”为键的（词，词频）对堆。构建堆需要O(n)时间；

3') 从堆中提取前 K 个单词。每次提取的时间复杂度为 O(lg(n))。因此，总时间为 O(k*lg(n))。

总而言之，该解决方案花费的时间为 O(n+k*lg(n))。

这只是我的想法。我还没有找到改进步骤 1) 的方法。
我希望一些信息检索专家能够进一步阐明这个问题。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

阳光下的泡沫是彩色的 2024-07-14 04:14:53

这可以在 O(n) 时间内完成

解决方案 1：

步骤：

计算单词并对其进行散列，最终将得到这样的结构

var 哈希 = { 
    “我”：13， 
    “喜欢”：3， 
    “喵”：3， 
    “极客”：3， 
    “汉堡”：2， 
    “猫”：1， 
    “富”：100， 
    ... 
    ...

遍历散列并找到最常用的单词（在本例中为“foo”100），然后创建该大小的数组
然后我们可以再次遍历哈希并使用该数字单词出现次数作为数组索引，如果索引中没有任何内容，则创建一个数组，否则将其附加到数组中。然后我们得到一个像这样的数组：

<前><代码> 0 1 2 3 100
[[ ]、[猫]、[汉堡]、[像、喵、极客]、[]...[foo]]
然后从末尾遍历数组，并收集 k 个单词。

解决方案2：

步骤：

同上
使用最小堆并将最小堆的大小保持为k，对于哈希中的每个单词，我们将单词的出现次数与最小值进行比较，1）如果它更大大于 min 值，则删除 min（如果 min 堆的大小等于 k）并将该数字插入到 min 堆中。 2）休息条件简单。
遍历完数组后，我们只需将最小堆转换为数组并返回数组即可。

This can be done in O(n) time

Solution 1:

Steps:

Count words and hash it, which will end up in the structure like this

var hash = {
  "I" : 13,
  "like" : 3,
  "meow" : 3,
  "geek" : 3,
  "burger" : 2,
  "cat" : 1,
  "foo" : 100,
  ...
  ...

Traverse through the hash and find the most frequently used word (in this case "foo" 100), then create the array of that size
Then we can traverse the hash again and use the number of occurrences of words as array index, if there is nothing in the index, create an array else append it in the array. Then we end up with an array like:
```
  0   1      2            3                  100
[[ ],[cat],[burger],[like, meow, geek],[]...[foo]]
```
Then just traverse the array from the end, and collect the k words.

Solution 2:

Steps:

Same as above
Use min heap and keep the size of min heap to k, and for each word in the hash we compare the occurrences of words with the min, 1) if it's greater than the min value, remove the min (if the size of the min heap is equal to k) and insert the number in the min heap. 2) rest simple conditions.
After traversing through the array, we just convert the min heap to array and return the array.

回复收藏 0 原文

旧瑾黎汐 2024-07-14 04:14:53

一般来说，您不会获得比您所描述的解决方案更好的运行时间。您必须至少执行 O(n) 工作来评估所有单词，然后执行 O(k) 额外工作来查找前 k 个术语。

如果您的问题集确实很大，您可以使用分布式解决方案，例如map/reduce。让 n 个 Map Worker 计算每个文本的 1/n 的频率，并且对于每个单词，将其发送到基于单词的哈希计算的 m 个 Reducer Worker 之一。然后，reducer 对计数求和。对减速器输出进行合并排序将为您提供按流行程度排列的最流行的单词。

回复收藏 0 原文

旧人九事 2024-07-14 04:14:53

如果我们不关心对前 K 进行排名，则解决方案的微小变化会产生 O(n) 算法，并且会产生 O(n+k*lg(k))O(n+k*lg(k))强> 如果我们这样做的话。我相信这两个界限在常数因子内都是最佳的。

当我们遍历列表并插入哈希表后，这里的优化再次出现。我们可以使用中位数中位数算法来选择第K大的列表中的元素。该算法可证明是 O(n)。

选择第 K 个最小元素后，我们围绕该元素对列表进行分区，就像快速排序一样。这显然也是O(n)。枢轴“左侧”一侧的所有内容都在我们的 K 元素组中，因此我们完成了（我们可以在继续过程中简单地丢弃其他所有内容）。

所以这个策略是：

遍历每个单词并将其插入到哈希表中： O(n)
选择第 K 个最小元素： O(n)
围绕该元素进行分区： O(n)

如果要对 K 个元素进行排名，只需在 O(k * lg(k)) 时间内使用任何有效的比较排序对它们进行排序，总运行时间为 O(n+k * lg(k))。

O(n) 时间限制在常数因子内是最佳的，因为我们必须至少检查每个单词一次。

O(n + k * lg(k)) 时间限制也是最佳的，因为没有基于比较的方法可以在小于 k * lg(k) 的时间内对 k 个元素进行排序。

回复收藏 0 原文

若有似无的小暗淡 2024-07-14 04:14:53

如果您的“大单词列表”足够大，您可以简单地进行采样并获得估计。否则，我喜欢哈希聚合。

编辑：

通过示例，我的意思是选择一些页面子集并计算这些页面中最常见的单词。如果您以合理的方式选择页面并选择具有统计意义的样本，那么您对最常见单词的估计应该是合理的。

只有当您有如此多的数据以至于处理所有数据有点愚蠢时，这种方法才真正合理。如果您只有几兆，您应该能够不费吹灰之力地分析数据并计算出准确的答案，而不是费心计算估计值。

回复收藏 0 原文

能怎样 2024-07-14 04:14:53

您可以通过使用单词的第一个字母进行分区，然后使用下一个字符对最大的多单词集进行分区，直到拥有 k 个单单词集，从而进一步减少时间。您可以使用一种 256 路树，其叶子上有部分/完整单词列表。您需要非常小心，不要造成到处都是字符串复制。

该算法的复杂度为 O(m)，其中 m 是字符数。它避免了对 k 的依赖，这对于大 k 来说非常好[顺便说一句，你发布的运行时间是错误的，它应该是 O(n*lg(k))，我不确定这是什么米]。

如果你并行运行这两种算法，你会得到我很确定是渐近最优的 O(min(m, n*lg(k))) 算法，但我的平均应该更快，因为它不涉及散列或排序。

回复收藏 0 原文

望她远 2024-07-14 04:14:53

您的描述中有一个错误：计数需要 O(n) 时间，但排序需要 O(m*lg(m))，其中 m 是唯一单词的数量。这通常比单词总数小得多，因此可能应该优化哈希的构建方式。

回复收藏 0 原文

那一片橙海， 2024-07-14 04:14:53

你的问题和这个一样-
http://www.geeksforgeeks.org/ find-the-k-most-frequent-words-from-a-file/

使用 Trie 和最小堆来有效解决它。

回复收藏 0 原文

赤濁 2024-07-14 04:14:53

如果您想要的是文本中任何实用 k 和任何自然语言中最常见的 k 个单词的列表，那么算法的复杂性并不相关。

比如说，只需从文本中采样几百万个单词，使用任何算法在几秒钟内处理它，最常见的计数就会非常准确。

顺便说一句，虚拟算法的复杂度（1. 全部计数 2. 对计数进行排序 3. 取最好的）是 O(n+m*log(m))，其中 m 是你的单词中不同单词的数量。文本。 log(m) 比 (n/m) 小得多，因此仍然是 O(n)。

实际上，长步正在计数。

回复收藏 0 原文

丢了幸福的猪 2024-07-14 04:14:53

利用内存高效的数据结构来存储单词
使用MaxHeap，找到前K个频繁单词。

这是代码

import java.util.ArrayList;
import java.util.Comparator;
import java.util.List;
import java.util.PriorityQueue;

import com.nadeem.app.dsa.adt.Trie;
import com.nadeem.app.dsa.adt.Trie.TrieEntry;
import com.nadeem.app.dsa.adt.impl.TrieImpl;

public class TopKFrequentItems {

private int maxSize;

private Trie trie = new TrieImpl();
private PriorityQueue<TrieEntry> maxHeap;

public TopKFrequentItems(int k) {
    this.maxSize = k;
    this.maxHeap = new PriorityQueue<TrieEntry>(k, maxHeapComparator());
}

private Comparator<TrieEntry> maxHeapComparator() {
    return new Comparator<TrieEntry>() {
        @Override
        public int compare(TrieEntry o1, TrieEntry o2) {
            return o1.frequency - o2.frequency;
        }           
    };
}

public void add(String word) {
    this.trie.insert(word);
}

public List<TopK> getItems() {

    for (TrieEntry trieEntry : this.trie.getAll()) {
        if (this.maxHeap.size() < this.maxSize) {
            this.maxHeap.add(trieEntry);
        } else if (this.maxHeap.peek().frequency < trieEntry.frequency) {
            this.maxHeap.remove();
            this.maxHeap.add(trieEntry);
        }
    }
    List<TopK> result = new ArrayList<TopK>();
    for (TrieEntry entry : this.maxHeap) {
        result.add(new TopK(entry));
    }       
    return result;
}

public static class TopK {
    public String item;
    public int frequency;

    public TopK(String item, int frequency) {
        this.item = item;
        this.frequency = frequency;
    }
    public TopK(TrieEntry entry) {
        this(entry.word, entry.frequency);
    }
    @Override
    public String toString() {
        return String.format("TopK [item=%s, frequency=%s]", item, frequency);
    }
    @Override
    public int hashCode() {
        final int prime = 31;
        int result = 1;
        result = prime * result + frequency;
        result = prime * result + ((item == null) ? 0 : item.hashCode());
        return result;
    }
    @Override
    public boolean equals(Object obj) {
        if (this == obj)
            return true;
        if (obj == null)
            return false;
        if (getClass() != obj.getClass())
            return false;
        TopK other = (TopK) obj;
        if (frequency != other.frequency)
            return false;
        if (item == null) {
            if (other.item != null)
                return false;
        } else if (!item.equals(other.item))
            return false;
        return true;
    }

}

}

这是单元测试

@Test
public void test() {
    TopKFrequentItems stream = new TopKFrequentItems(2);

    stream.add("hell");
    stream.add("hello");
    stream.add("hello");
    stream.add("hello");
    stream.add("hello");
    stream.add("hello");
    stream.add("hero");
    stream.add("hero");
    stream.add("hero");
    stream.add("hello");
    stream.add("hello");
    stream.add("hello");
    stream.add("home");
    stream.add("go");
    stream.add("go");
    assertThat(stream.getItems()).hasSize(2).contains(new TopK("hero", 3), new TopK("hello", 8));
}

有关更多详细信息，请参阅此测试用例

Utilize memory efficient data structure to store the words
Use MaxHeap, to find the top K frequent words.

Here is the code

import java.util.ArrayList;
import java.util.Comparator;
import java.util.List;
import java.util.PriorityQueue;

import com.nadeem.app.dsa.adt.Trie;
import com.nadeem.app.dsa.adt.Trie.TrieEntry;
import com.nadeem.app.dsa.adt.impl.TrieImpl;

public class TopKFrequentItems {

private int maxSize;

private Trie trie = new TrieImpl();
private PriorityQueue<TrieEntry> maxHeap;

public TopKFrequentItems(int k) {
    this.maxSize = k;
    this.maxHeap = new PriorityQueue<TrieEntry>(k, maxHeapComparator());
}

private Comparator<TrieEntry> maxHeapComparator() {
    return new Comparator<TrieEntry>() {
        @Override
        public int compare(TrieEntry o1, TrieEntry o2) {
            return o1.frequency - o2.frequency;
        }           
    };
}

public void add(String word) {
    this.trie.insert(word);
}

public List<TopK> getItems() {

    for (TrieEntry trieEntry : this.trie.getAll()) {
        if (this.maxHeap.size() < this.maxSize) {
            this.maxHeap.add(trieEntry);
        } else if (this.maxHeap.peek().frequency < trieEntry.frequency) {
            this.maxHeap.remove();
            this.maxHeap.add(trieEntry);
        }
    }
    List<TopK> result = new ArrayList<TopK>();
    for (TrieEntry entry : this.maxHeap) {
        result.add(new TopK(entry));
    }       
    return result;
}

public static class TopK {
    public String item;
    public int frequency;

    public TopK(String item, int frequency) {
        this.item = item;
        this.frequency = frequency;
    }
    public TopK(TrieEntry entry) {
        this(entry.word, entry.frequency);
    }
    @Override
    public String toString() {
        return String.format("TopK [item=%s, frequency=%s]", item, frequency);
    }
    @Override
    public int hashCode() {
        final int prime = 31;
        int result = 1;
        result = prime * result + frequency;
        result = prime * result + ((item == null) ? 0 : item.hashCode());
        return result;
    }
    @Override
    public boolean equals(Object obj) {
        if (this == obj)
            return true;
        if (obj == null)
            return false;
        if (getClass() != obj.getClass())
            return false;
        TopK other = (TopK) obj;
        if (frequency != other.frequency)
            return false;
        if (item == null) {
            if (other.item != null)
                return false;
        } else if (!item.equals(other.item))
            return false;
        return true;
    }

}

}

Here is the unit tests

@Test
public void test() {
    TopKFrequentItems stream = new TopKFrequentItems(2);

    stream.add("hell");
    stream.add("hello");
    stream.add("hello");
    stream.add("hello");
    stream.add("hello");
    stream.add("hello");
    stream.add("hero");
    stream.add("hero");
    stream.add("hero");
    stream.add("hello");
    stream.add("hello");
    stream.add("hello");
    stream.add("home");
    stream.add("go");
    stream.add("go");
    assertThat(stream.getItems()).hasSize(2).contains(new TopK("hero", 3), new TopK("hello", 8));
}

For more details refer this test case

回复收藏 0 原文

执笔绘流年 2024-07-14 04:14:53

遍历整个单词序列时，使用哈希表记录所有单词的频率。在这个阶段，键是“词”，值是“词频”。这需要 O(n) 时间。这与上面解释的每个相同
在 hashmap 中插入本身时，保持大小为 10(k=10) 的 Treeset（特定于 java，每种语言都有实现）以保持最常用的 10 个单词。直到大小小于10，继续添加。如果大小等于 10，如果插入的元素大于最小元素（即第一个元素）。如果是，则删除它并插入新元素

要限制树集的大小，请参阅此链接

回复收藏 0 原文

喵星人汪星人 2024-07-14 04:14:53

假设我们有一个单词序列“ad”“ad”“boy”“big”“bad”“com”“come”“cold”。并且K=2。
正如您提到的“使用单词的第一个字母进行分区”，我们得到
（“广告”，“广告”）（“男孩”，“大”，“坏”）（“com”“来”“冷”）
“然后使用下一个字符划分最大的多词集，直到拥有 k 个单词集。”
它将分区（“boy”，“big”，“bad”）（“com”“come”“cold”），第一个分区（“ad”，“ad”）被错过，而“ad”实际上是最常见的词。

也许我误解了你的观点。您能详细说明一下您的分区过程吗？

回复收藏 0 原文

装纯掩盖桑 2024-07-14 04:14:53

我相信这个问题可以通过 O(n) 算法来解决。我们可以即时进行排序。换句话说，这种情况下的排序是传统排序问题的子问题，因为每次访问哈希表时只有一个计数器加一。最初，列表已排序，因为所有计数器均为零。当我们不断增加哈希表中的计数器时，我们会记录另一个按频率排序的哈希值数组，如下所示。每次我们递增计数器时，我们都会检查其在排序数组中的索引，并检查其计数是否超过列表中的前一个。如果是这样，我们交换这两个元素。这样我们就得到了一个至多 O(n) 的解，其中 n 是原始文本中的单词数。

回复收藏 0 原文

南城旧梦 2024-07-14 04:14:53

我也曾为此苦苦挣扎，并受到@aly 的启发。我们可以维护一个预先排序的单词列表 (List>)，而不是事后排序，并且该单词将位于集合中的位置 X，其中 X 是单词的当前计数。单词。一般来说，它的工作原理如下：

对于每个单词，将其存储为其出现位置映射的一部分：Map。
然后根据计数，将其从之前的计数集中移除，并添加到新的计数集中。

这样做的缺点是列表可能很大 - 可以通过使用 TreeMap> 进行优化 - 但这会增加一些开销。最终我们可以混合使用 HashMap 或我们自己的数据结构。

代码

public class WordFrequencyCounter {
    private static final int WORD_SEPARATOR_MAX = 32; // UNICODE 0000-001F: control chars
    Map<String, MutableCounter> counters = new HashMap<String, MutableCounter>();
    List<Set<String>> reverseCounters = new ArrayList<Set<String>>();

    private static class MutableCounter {
        int i = 1;
    }

    public List<String> countMostFrequentWords(String text, int max) {
        int lastPosition = 0;
        int length = text.length();
        for (int i = 0; i < length; i++) {
            char c = text.charAt(i);
            if (c <= WORD_SEPARATOR_MAX) {
                if (i != lastPosition) {
                    String word = text.substring(lastPosition, i);
                    MutableCounter counter = counters.get(word);
                    if (counter == null) {
                        counter = new MutableCounter();
                        counters.put(word, counter);
                    } else {
                        Set<String> strings = reverseCounters.get(counter.i);
                        strings.remove(word);
                        counter.i ++;
                    }
                    addToReverseLookup(counter.i, word);
                }
                lastPosition = i + 1;
            }
        }

        List<String> ret = new ArrayList<String>();
        int count = 0;
        for (int i = reverseCounters.size() - 1; i >= 0; i--) {
            Set<String> strings = reverseCounters.get(i);
            for (String s : strings) {
                ret.add(s);
                System.out.print(s + ":" + i);
                count++;
                if (count == max) break;
            }
            if (count == max) break;
        }
        return ret;
    }

    private void addToReverseLookup(int count, String word) {
        while (count >= reverseCounters.size()) {
            reverseCounters.add(new HashSet<String>());
        }
        Set<String> strings = reverseCounters.get(count);
        strings.add(word);
    }

}

I was struggling with this as well and get inspired by @aly. Instead of sorting afterwards, we can just maintain a presorted list of words (List<Set<String>>) and the word will be in the set at position X where X is the current count of the word. In generally, here's how it works:

for each word, store it as part of map of it's occurrence: Map<String, Integer>.
then, based on the count, remove it from the previous count set, and add it into the new count set.

The drawback of this is the list maybe big - can be optimized by using a TreeMap<Integer, Set<String>> - but this will add some overhead. Ultimately we can use a mix of HashMap or our own data structure.

The code

public class WordFrequencyCounter {
    private static final int WORD_SEPARATOR_MAX = 32; // UNICODE 0000-001F: control chars
    Map<String, MutableCounter> counters = new HashMap<String, MutableCounter>();
    List<Set<String>> reverseCounters = new ArrayList<Set<String>>();

    private static class MutableCounter {
        int i = 1;
    }

    public List<String> countMostFrequentWords(String text, int max) {
        int lastPosition = 0;
        int length = text.length();
        for (int i = 0; i < length; i++) {
            char c = text.charAt(i);
            if (c <= WORD_SEPARATOR_MAX) {
                if (i != lastPosition) {
                    String word = text.substring(lastPosition, i);
                    MutableCounter counter = counters.get(word);
                    if (counter == null) {
                        counter = new MutableCounter();
                        counters.put(word, counter);
                    } else {
                        Set<String> strings = reverseCounters.get(counter.i);
                        strings.remove(word);
                        counter.i ++;
                    }
                    addToReverseLookup(counter.i, word);
                }
                lastPosition = i + 1;
            }
        }

        List<String> ret = new ArrayList<String>();
        int count = 0;
        for (int i = reverseCounters.size() - 1; i >= 0; i--) {
            Set<String> strings = reverseCounters.get(i);
            for (String s : strings) {
                ret.add(s);
                System.out.print(s + ":" + i);
                count++;
                if (count == max) break;
            }
            if (count == max) break;
        }
        return ret;
    }

    private void addToReverseLookup(int count, String word) {
        while (count >= reverseCounters.size()) {
            reverseCounters.add(new HashSet<String>());
        }
        Set<String> strings = reverseCounters.get(count);
        strings.add(word);
    }

}

回复收藏 0 原文

素手挽清风 2024-07-14 04:14:53

我刚刚找到这个问题的其他解决方案。但我不确定这是正确的。
解决方案：

使用哈希表记录所有单词的频率 T(n) = O(n)
选择哈希表的前 k 个元素，并将它们存储在一个缓冲区中（其空间 = k）。 T(n) = O(k)
每次，我们首先需要找到缓冲区当前的最小元素，并将缓冲区的最小元素与哈希表的 (n - k) 个元素一一比较。如果哈希表的元素大于缓冲区的最小元素，则删除当前缓冲区的最小值，并添加哈希表的元素。所以每次我们在缓冲区中找到最小的一个需要 T(n) = O(k)，遍历整个哈希表需要 T(n) = O(n - k)。所以这个过程的整个时间复杂度是 T(n) = O((nk) * k)。
遍历整个哈希表后，结果就在这个缓冲区中。
整个时间复杂度：T(n) = O(n) + O(k) + O(kn - k^2) = O(kn + n - k^2 + k)。因为，一般来说，k确实小于n。因此对于这个解决方案，时间复杂度为T(n) = O(kn)。当 k 非常小时，这就是线性时间。这样对吗？我真的不确定。

回复收藏 0 原文

陈甜 2024-07-14 04:14:53

尝试考虑特殊的数据结构来解决此类问题。在这种情况下，像 trie 这样的特殊树以特定方式存储字符串，非常有效。或者构建自己的解决方案的第二种方法，例如计算单词数。我猜这 TB 数据是英文的，那么我们一般有大约 600,000 个单词，因此可以只存储这些单词并计算哪些字符串会重复 + 这个解决方案将需要正则表达式来消除一些特殊字符。我很确定第一个解决方案会更快。

http://en.wikipedia.org/wiki/Trie

回复收藏 0 原文

懷念過去 2024-07-14 04:14:53

这是一个有趣的搜索想法，我可以找到与 Top-K 相关的这篇论文 https://icmi.cs.ucsb.edu/research/tech_reports/reports/2005-23.pdf

还有一个它的实现此处。

回复收藏 0 原文

鲜血染红嫁衣 2024-07-14 04:14:53

获取最常用单词出现次数的最简单代码。

 function strOccurence(str){
    var arr = str.split(" ");
    var length = arr.length,temp = {},max; 
    while(length--){
    if(temp[arr[length]] == undefined && arr[length].trim().length > 0)
    {
        temp[arr[length]] = 1;
    }
    else if(arr[length].trim().length > 0)
    {
        temp[arr[length]] = temp[arr[length]] + 1;

    }
}
    console.log(temp);
    var max = [];
    for(i in temp)
    {
        max[temp[i]] = i;
    }
    console.log(max[max.length])
   //if you want second highest
   console.log(max[max.length - 2])
}

Simplest code to get the occurrence of most frequently used word.

 function strOccurence(str){
    var arr = str.split(" ");
    var length = arr.length,temp = {},max; 
    while(length--){
    if(temp[arr[length]] == undefined && arr[length].trim().length > 0)
    {
        temp[arr[length]] = 1;
    }
    else if(arr[length].trim().length > 0)
    {
        temp[arr[length]] = temp[arr[length]] + 1;

    }
}
    console.log(temp);
    var max = [];
    for(i in temp)
    {
        max[temp[i]] = i;
    }
    console.log(max[max.length])
   //if you want second highest
   console.log(max[max.length - 2])
}

回复收藏 0 原文

画▽骨i 2024-07-14 04:14:53

在这些情况下，我建议使用 Java 内置功能。因为，它们已经经过充分测试并且稳定。在这个问题中，我使用HashMap数据结构来查找单词的重复。然后，我将结果推送到对象数组中。我通过 Arrays.sort() 对对象进行排序并打印前 k 个单词及其重复次数。

import java.io.*;
import java.lang.reflect.Array;
import java.util.*;

public class TopKWordsTextFile {

    static class SortObject implements Comparable<SortObject>{

        private String key;
        private int value;

        public SortObject(String key, int value) {
            super();
            this.key = key;
            this.value = value;
        }

        @Override
        public int compareTo(SortObject o) {
            //descending order
            return o.value - this.value;
        }
    }


    public static void main(String[] args) {
        HashMap<String,Integer> hm = new HashMap<>();
        int k = 1;
        try {
            BufferedReader br = new BufferedReader(new InputStreamReader(new FileInputStream("words.in")));

            String line;
            while ((line = br.readLine()) != null) {
                // process the line.
                //System.out.println(line);
                String[] tokens = line.split(" ");
                for(int i=0; i<tokens.length; i++){
                    if(hm.containsKey(tokens[i])){
                        //If the key already exists
                        Integer prev = hm.get(tokens[i]);
                        hm.put(tokens[i],prev+1);
                    }else{
                        //If the key doesn't exist
                        hm.put(tokens[i],1);
                    }
                }
            }
            //Close the input
            br.close();
            //Print all words with their repetitions. You can use 3 for printing top 3 words.
            k = hm.size();
            // Get a set of the entries
            Set set = hm.entrySet();
            // Get an iterator
            Iterator i = set.iterator();
            int index = 0;
            // Display elements
            SortObject[] objects = new SortObject[hm.size()];
            while(i.hasNext()) {
                Map.Entry e = (Map.Entry)i.next();
                //System.out.print("Key: "+e.getKey() + ": ");
                //System.out.println(" Value: "+e.getValue());
                String tempS = (String) e.getKey();
                int tempI = (int) e.getValue();
                objects[index] = new SortObject(tempS,tempI);
                index++;
            }
            System.out.println();
            //Sort the array
            Arrays.sort(objects);
            //Print top k
            for(int j=0; j<k; j++){
                System.out.println(objects[j].key+":"+objects[j].value);
            }


        } catch (IOException e) {
            e.printStackTrace();
        }
    }

}

有关更多信息，请访问 https://github.com /m-vahidalizadeh/foundations/blob/master/src/algorithms/TopKWordsTextFile.java。我希望它有帮助。

In these situations, I recommend to use Java built-in features. Since, they are already well tested and stable. In this problem, I find the repetitions of the words by using HashMap data structure. Then, I push the results to an array of objects. I sort the object by Arrays.sort() and print the top k words and their repetitions.

import java.io.*;
import java.lang.reflect.Array;
import java.util.*;

public class TopKWordsTextFile {

    static class SortObject implements Comparable<SortObject>{

        private String key;
        private int value;

        public SortObject(String key, int value) {
            super();
            this.key = key;
            this.value = value;
        }

        @Override
        public int compareTo(SortObject o) {
            //descending order
            return o.value - this.value;
        }
    }


    public static void main(String[] args) {
        HashMap<String,Integer> hm = new HashMap<>();
        int k = 1;
        try {
            BufferedReader br = new BufferedReader(new InputStreamReader(new FileInputStream("words.in")));

            String line;
            while ((line = br.readLine()) != null) {
                // process the line.
                //System.out.println(line);
                String[] tokens = line.split(" ");
                for(int i=0; i<tokens.length; i++){
                    if(hm.containsKey(tokens[i])){
                        //If the key already exists
                        Integer prev = hm.get(tokens[i]);
                        hm.put(tokens[i],prev+1);
                    }else{
                        //If the key doesn't exist
                        hm.put(tokens[i],1);
                    }
                }
            }
            //Close the input
            br.close();
            //Print all words with their repetitions. You can use 3 for printing top 3 words.
            k = hm.size();
            // Get a set of the entries
            Set set = hm.entrySet();
            // Get an iterator
            Iterator i = set.iterator();
            int index = 0;
            // Display elements
            SortObject[] objects = new SortObject[hm.size()];
            while(i.hasNext()) {
                Map.Entry e = (Map.Entry)i.next();
                //System.out.print("Key: "+e.getKey() + ": ");
                //System.out.println(" Value: "+e.getValue());
                String tempS = (String) e.getKey();
                int tempI = (int) e.getValue();
                objects[index] = new SortObject(tempS,tempI);
                index++;
            }
            System.out.println();
            //Sort the array
            Arrays.sort(objects);
            //Print top k
            for(int j=0; j<k; j++){
                System.out.println(objects[j].key+":"+objects[j].value);
            }


        } catch (IOException e) {
            e.printStackTrace();
        }
    }

}

For more information, please visit https://github.com/m-vahidalizadeh/foundations/blob/master/src/algorithms/TopKWordsTextFile.java. I hope it helps.

回复收藏 0 原文

远昼 2024-07-14 04:14:53

**

C++11上述思想的实现

class Solution {
public:
vector<int> topKFrequent(vector<int>& nums, int k) {

    unordered_map<int,int> map;
    for(int num : nums){
        map[num]++;
    }

    vector<int> res;
    // we use the priority queue, like the max-heap , we will keep (size-k) smallest elements in the queue
    // pair<first, second>: first is frequency,  second is number 
    priority_queue<pair<int,int>> pq; 
    for(auto it = map.begin(); it != map.end(); it++){
        pq.push(make_pair(it->second, it->first));

        // onece the size bigger than size-k, we will pop the value, which is the top k frequent element value 

        if(pq.size() > (int)map.size() - k){
            res.push_back(pq.top().second);
            pq.pop();
        }
    }
    return res;

}

};

**

C++11 Implementation of the above thought

class Solution {
public:
vector<int> topKFrequent(vector<int>& nums, int k) {

    unordered_map<int,int> map;
    for(int num : nums){
        map[num]++;
    }

    vector<int> res;
    // we use the priority queue, like the max-heap , we will keep (size-k) smallest elements in the queue
    // pair<first, second>: first is frequency,  second is number 
    priority_queue<pair<int,int>> pq; 
    for(auto it = map.begin(); it != map.end(); it++){
        pq.push(make_pair(it->second, it->first));

        // onece the size bigger than size-k, we will pop the value, which is the top k frequent element value 

        if(pq.size() > (int)map.size() - k){
            res.push_back(pq.top().second);
            pq.pop();
        }
    }
    return res;

}

};

回复收藏 0 原文

~没有更多了~