有没有更好的方法来计算文件中所有符号的频率?

发布于 12-08 07:26 字数 596 浏览 1 评论 0原文

好吧,假设我有一个文本文件(不一定包含每个可能的符号),我想计算每个符号的频率,在计算频率后,我需要从最频繁的位置访问每个符号及其频率到最不频繁。这些符号不一定是 ASCII 字符,它们可以是任意字节序列,尽管长度相同。

我正在考虑做这样的事情(以伪代码):

function add_to_heap (symbol)
    freq = heap.find(symbol).frequency
    if (freq.exists? == true)
        freq++
    else
        symbol.freq = 1
        heap.insert(symbol)

MaxBinaryHeap heap
while somefile != EOF
    symbol = read_byte(somefile)
    heap.add_to_heap(symbol)
heap.sort_by_frequency()

while heap.root != empty
    root = heap.extract_root()
    do_stuff(root)

我想知道:是否有更好、更简单的方法来计算和存储每个符号在文件中出现的次数?

Okay, so, say I have a text file (not necessarily containing every possible symbol) and I'd like to calculate the frequency of each symbol and, after calculating the frequency, I then need to access each symbol and its frequency from most frequent to least frequent. The symbols are not necessarily ASCII characters, they could be arbitrary byte sequences, albeit all of the same length.

I was considering doing something like this (in pseudocode):

function add_to_heap (symbol)
    freq = heap.find(symbol).frequency
    if (freq.exists? == true)
        freq++
    else
        symbol.freq = 1
        heap.insert(symbol)

MaxBinaryHeap heap
while somefile != EOF
    symbol = read_byte(somefile)
    heap.add_to_heap(symbol)
heap.sort_by_frequency()

while heap.root != empty
    root = heap.extract_root()
    do_stuff(root)

I was wondering: is there a better, simpler way to calculate and store how many times each symbol occurs in a file?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

自由如风2024-12-15 07:26:55

您始终可以使用 HashMap 而不是堆。像这样,您将为找到的每个符号执行 O(1) 的操作,而不是 O(log n),其中 n 是堆上当前项目的数量。

但是,如果不同符号的数量受到合理数量的限制(1 字节是理想的,2 字节应该仍然可以),您可以只使用该大小的数组,并且再次具有 O(1) 但具有明显较低的常数成本。

You can always use a HashMap isntead of the Heap. Like this you'll be performing operations that are in O(1) for each symbol found instead of O(log n) wheres n is the number of items currently on the heap.

However, if te number of distinct symbols is bounded by a reasonable number (1 Byte is ideal, 2 Byte should be still fine), you can just use an array of that size and again have O(1) but with a significantly lower constant cost.

青衫负雪2024-12-15 07:26:55

如果您正在寻找基于运行时间的“最佳”解决方案,我建议如下:

当您读取文件时,您应该按符号本身的值对符号进行排序(或散列),而不是他们的频率。这将使您能够在已见过的交易品种列表中快速找到当前交易品种,而不必搜索整个列表。您还应该让初始结构能够执行快速插入 - 我建议使用哈希二叉树。

阅读完所有符号后,您应该根据频率计数更改顺序。我将所有内容读入数组,然后执行就地排序,但是有很多等效的方法可以执行此操作。

希望这有帮助!

If you're looking for a "best" solution based on running times, here's what I'd suggest:

When you're reading the file, you should have your symbols sorted (or hashed) by the value of the symbols themselves, not their frequencies. This'll let you find the current symbol in your list of already seen symbols quickly, rather than having to search through your entire list. You should also have that initial structure be able to perform fast inserts - I'd recommend a binary tree of a hash.

Once you've read all your symbols, then you should switch your ordering to be based on the frequency counts. I'd read everything into an array and then perform an in-place sort, but there are a bunch of equivalent ways to do this.

Hope this helps!

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文