如何计算起始步停止编码方案的最佳参数?

发布于 2024-07-15 05:33:53 字数 555 浏览 1 评论 0原文

起始步终止码是一种数据压缩技术,用于压缩相对较小的数字。

代码的工作原理如下:它有三个参数,start、step 和 stop。 Start 确定用于计算前几个数字的位数。 Step 确定当我们用完时要添加多少位到编码中,stop 确定用于对数字进行编码的最大位数。

因此,编码的长度由 l = start + step * i 给出。

特定代码的“i”值使用一元编码。 也就是说,许多 1 位后跟一个终止 0 位。 如果我们已经到达停止位置,那么我们可以删除终止 0 位。 如果 i 为零,我们只写出 0 位。

因此 (1, 2, 5) 开始-步骤-停止代码的工作方式如下:

值 0,编码为:0 0
值1,编码为:0 1
值 2,编码为:10 000
值 9,编码为:10 111
值 10,编码为:11 00000
值 41,编码为:11 11111

那么,给定一个包含多个数字的文件,我们如何计算该文件的最佳开始-步骤-停止代码? 最佳参数被定义为那些将产生最大压缩比的参数。

A start-step-stop code is a data compression technique that is used to compress number that are relatively small.

The code works as follows: It has three parameters, start, step and stop. Start determines the amount of bits used to compute the first few numbers. Step determines how many bits to add to the encoding when we run out and stop determines the maximum amount of bits used to encode a number.

So the length of an encoding is given by l = start + step * i.

The "i" value of a particular code is encoded using unary. That is, a number of 1 bits followed by a terminating 0 bit. If we have reached stop then we can drop the terminating 0 bit. If i is zero we only write out the 0 bit.

So a (1, 2, 5) start-step-stop code would work as follows:

Value 0, encoded as: 0 0
Value 1, encoded as: 0 1
Value 2, encoded as: 10 000
Value 9, encoded as: 10 111
Value 10, encoded as: 11 00000
Value 41, encoded as: 11 11111

So, given a file containing several numbers, how can we compute the optimal start-step-stop codes for that file? The optimal parameters are defined as those that will result in the greatest compression ratio.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

南街女流氓 2024-07-22 05:33:53

这些“start-step-stop”代码看起来像是调用霍夫曼代码的不同方式。 有关计算它们的伪代码的概述,请参阅基本技术

本质上,这就是算法的作用:

在开始霍夫曼编码之前,您需要收集要压缩的每个符号的统计信息(它们在要压缩的文件中的总频率)。

之后,您可以使用该信息创建一个二叉树,以便最常用的符号是位于树的顶部(因此使用较少的位),并且没有编码具有 前缀代码。 因为如果编码具有公共前缀,则解压缩时可能会出现歧义。

在霍夫曼编码结束时,你的起始值将是最浅叶节点的深度,你的步长将始终为1(逻辑上这是有道理的,为什么你会强制比你需要的更多的位,只需一次添加一个,)并且您的停止值将是最深叶节点的深度。

如果频率统计数据未排序,则需要 O(nlog n) 完成,如果按频率排序,则可以在 O(n) 内完成。

对于这种类型的编码,霍夫曼码保证具有最佳的平均压缩率:

霍夫曼能够设计出最
这种高效的压缩方法
类型:没有其他个体映射
源符号到唯一的字符串
位将产生较小的平均值
实际符号时的输出大小
频率与习惯一致
创建代码。

这应该可以帮助您实现问题的理想解决方案。

编辑:虽然相似,但这不是OP想要的。

这些代码的创建者的这篇学术论文描述了起始步骤的概括- 停止代码、起止代码。 然而,作者在第 2 节末尾简要描述了如何获得最佳的开始-步骤-停止。它涉及使用统计随机变量,或强力资助最佳组合。 在没有任何文件先验知识的情况下,算法的复杂度为 O((log n)^3)。

希望这可以帮助。

These "start-step-stop" codes looks like a different way of calling Huffman codes. See the basic technique for an outline of the pseudo-code for calculating them.

Essentially this is what the algorithm does:

Before you start the Huffman encoding you need to gather the statistics of each symbol you'll be compressing (Their total frequency in the file to compress).

After you have that you create a binary tree using that info such that the most frequently used symbols are at the top of the tree (and thus use less bits) and such that no encoding has a prefix code. Since if an encoding has a common prefix there could be ambiguities decompressing.

At the end of the Huffman encoding your start value will be depth of the shallowest leaf node, your step will always be 1 (logically this makes sense, why would you force more bits than you need, just add one at a time,) and your stop value will be the depth of the deepest leaf node.

If the frequency stats aren't sorted it will take O(nlog n) to do, if they are sorted by frequency it can be done in O(n).

Huffman codes are guaranteed to have the best average compression for this type of encoding:

Huffman was able to design the most
efficient compression method of this
type: no other mapping of individual
source symbols to unique strings of
bits will produce a smaller average
output size when the actual symbol
frequencies agree with those used to
create the code.

This should help you implement the ideal solution to your problem.

Edit: Though similar, this isn't what the OP was looking for.

This academic paper by the creator of these codes describes a generalization of start-step-stop codes, start-stop codes. However, the author briefly describes how to get optimal start-step-stop near the end of section 2. It involves using a statistical random variable, or brute-force funding the best combination. Without any prior knowledge of the file the algorithm is O((log n)^3).

Hope this helps.

寄意 2024-07-22 05:33:53

我使用的方法是一个简单的暴力解决方案。 该算法遵循以下基本步骤:

  1. 计算文件中每个数字的频率。 在同一遍中,计算文件中数字的总数,并将最大的数字确定为 maxNumber。

  2. 计算每个数字的概率,即其频率除以文件中的数字总数。

    计算每个数字的概率,

  3. 确定“optimalStop”等于 log2(maxNumber)。 这是香农信息论中应该用来表示 maxNumber 的理想位数,因此是对特定数字编码中使用的最佳最大位数的合理估计。

  4. 对于从 1 到“optimalStop”的每个“start”值,重复步骤 5 - 7:

  5. 对于每个“步骤” " 值从 1 到 ("optimalStop" - "start") / 2,重复步骤 6 & 7:

  6. 对于某个整数 i,计算最接近“optimalStop”的“stop”值,满足 stop = start + step * i。

    计算

  7. 计算此编码将使用的平均位数。 这可以计算为每个数字的概率乘以给定编码中的位长度。

  8. 选择平均位数最低的编码。

The approach I used was a simple brute force solution. The algorithm followed these basic steps:

  1. Count the frequency of each number in the file. In the same pass, compute the total amount of numbers in the file and determine the greatest number as maxNumber.

  2. Compute the probability of each number as its frequency divided by the total amount of numbers in the file.

  3. Determine "optimalStop" as equal to log2(maxNumber). This is the ideal number of bits that should be used to represent maxNumber as in Shannon information theory and therefore a reasonable estimate of the optimal maximum amount of bits used in the encoding of a particular number.

  4. For every "start" value from 1 to "optimalStop" repeat step 5 - 7:

  5. For every "step" value from 1 to ("optimalStop" - "start") / 2, repeat step 6 & 7:

  6. Calculate the "stop" value closest to "optimalStop" that satisfies stop = start + step * i for some integer i.

  7. Compute the average number of bits that would be used by this encoding. This can be calculated as each number's probability multiplied by its bit length in the given encoding.

  8. Pick the encoding with the lowest average number of bits.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文