有没有更好的方法来计算文件中所有符号的频率?
好吧,假设我有一个文本文件(不一定包含每个可能的符号),我想计算每个符号的频率,在计算频率后,我需要从最频繁的位置访问每个符号及其频率到最不频繁。这些符号不一定是 ASCII 字符,它们可以是任意字节序列,尽管长度相同。
我正在考虑做这样的事情(以伪代码):
function add_to_heap (symbol)
freq = heap.find(symbol).frequency
if (freq.exists? == true)
freq++
else
symbol.freq = 1
heap.insert(symbol)
MaxBinaryHeap heap
while somefile != EOF
symbol = read_byte(somefile)
heap.add_to_heap(symbol)
heap.sort_by_frequency()
while heap.root != empty
root = heap.extract_root()
do_stuff(root)
我想知道:是否有更好、更简单的方法来计算和存储每个符号在文件中出现的次数?
Okay, so, say I have a text file (not necessarily containing every possible symbol) and I'd like to calculate the frequency of each symbol and, after calculating the frequency, I then need to access each symbol and its frequency from most frequent to least frequent. The symbols are not necessarily ASCII characters, they could be arbitrary byte sequences, albeit all of the same length.
I was considering doing something like this (in pseudocode):
function add_to_heap (symbol)
freq = heap.find(symbol).frequency
if (freq.exists? == true)
freq++
else
symbol.freq = 1
heap.insert(symbol)
MaxBinaryHeap heap
while somefile != EOF
symbol = read_byte(somefile)
heap.add_to_heap(symbol)
heap.sort_by_frequency()
while heap.root != empty
root = heap.extract_root()
do_stuff(root)
I was wondering: is there a better, simpler way to calculate and store how many times each symbol occurs in a file?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
发布评论
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
您始终可以使用 HashMap 而不是堆。像这样,您将为找到的每个符号执行 O(1) 的操作,而不是 O(log n),其中 n 是堆上当前项目的数量。
但是,如果不同符号的数量受到合理数量的限制(1 字节是理想的,2 字节应该仍然可以),您可以只使用该大小的数组,并且再次具有 O(1) 但具有明显较低的常数成本。
You can always use a HashMap isntead of the Heap. Like this you'll be performing operations that are in O(1) for each symbol found instead of O(log n) wheres n is the number of items currently on the heap.
However, if te number of distinct symbols is bounded by a reasonable number (1 Byte is ideal, 2 Byte should be still fine), you can just use an array of that size and again have O(1) but with a significantly lower constant cost.