仅适用于数字的压缩算法

发布于 2024-07-19 11:00:50 字数 207 浏览 4 评论 0原文

我要压缩位置数据(纬度,经度,日期,时间)。 所有数字均采用固定格式。 其中 2 个(纬度、经度)采用十进制格式。 另外2个是整数。

现在这些数字是固定格式的字符串。

以固定格式压缩数字的算法有哪些? 仅数字压缩(如果有的话)比字符串压缩更好吗? 我应该直接压缩字符串而不将其转换为数字然后再压缩吗?

提前致谢。

I am to compress location data (latitude,longitude, date,time). All the numbers are in fixed format. 2 of them (latitude,longitude) are with decimal format. Other 2 are integers.

Now these numbers are in fixed format string.

What are the algorithms for compressing numbers in fixed format?
Is number only compressions (if there any) better than string compression?
Should I directly compress string without converting it to numbers and then compress?

Thanks in advance.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

三生池水覆流年 2024-07-26 11:00:50

这是需要一点理论帮助的地方之一。 您需要考虑几件事:

  • 测量的分辨率是多少:0.1° 还是 0.001°? 1秒还是1微秒?
  • 测量结果是相关联的并且按某种顺序排列,还是随机地组合在一起?

举例来说,分辨率为 0.01°。 您知道您的值范围从 -180° 到 +180°,或 35900 个不同的值。 Lg(35900) ≈ 16 所以你需要 16 位; -90°–+90° 为 14 位。 显然,如果您将这种值存储为浮点型,则可以立即将数据压缩一半。

与日期时间类似,范围是多少; 你必须有多少位?

现在,如果数据按某种顺序排列(例如,在单艘船上顺序采集的样本),那么您需要的只是一个起始值和一个增量; 这可以产生很大的变化。 当船舶以 30 节的速度行驶时,位置变化不能超过每小时约 0.03 度或每秒约 0.0000083 度。 这些增量将是非常小的值,因此您可以将它们存储在很少的位中。

关键是,您可以做很多事情,但您必须比我们更了解数据才能提出建议。


更新:哦,等等,定点字符串?!

好的,这(相对)容易。 首先,是的,您想要将字符串转换为某种二进制表示形式。 只是组成一个数据项,您可能可以

040.00105.0020090518212100Z

将其转换为

|  4000            | short int, 16 bits |  
| 10500            | short int, 16 bits |  
| 20090518212100Z  | 64 bits            |

96 位,12 字节与 26 字节。

This is one of these places where a little theory is helpful. You need to think about several things:

  • what is the resolution of your measurements: 0.1° or 0.001°? 1 second or one microsecond?
  • are the measurements associated and in some order, or tossed together randomly?

Let's say, just for example, that the resolution is 0.01°. Them you know that your values range from -180° to +180°, or 35900 different values. Lg(35900) ≈ 16 so you need 16 bits; 14 bits for -90°–+90°. Clearly, if you're storing this kind of value as floating-point, you can compress the data by half immediately.

Similarly with date time, what's the range; how many bits must you have?

Now, if the data is in some order (like, samples taken sequentially aboard a single ship) then all you need is a start value and a delta; that can make a big difference. With a ship traveling at 30 knots, the position can't change any more that about 0.03 degrees an hour or about 0.0000083 degrees a second. Those deltas are going to be very small values, so you can store them in a very few bits.

The point is that there are a number of things you can do, but you have to know more about the data than we do to make a recommendation.


Update: Oh, wait, fixed point strings?!

Okay, this is (relatively) easy. Just to start with, yes, you want to convert your strings into some binary representation. Just making up a data item, you might have

040.00105.0020090518212100Z

which you could convert to

|  4000            | short int, 16 bits |  
| 10500            | short int, 16 bits |  
| 20090518212100Z  | 64 bits            |

So that's 96 bits, 12 bytes versus 26 bytes.

秋意浓 2024-07-26 11:00:50

压缩通常作用于字节流。 当流的字节值(例如文本或存储为文本的数字)不均匀分布时,您可以实现的压缩比会更高,因为使用更少的位来存储更频繁出现的字节(在霍夫曼压缩)。

通常,您所讨论的数据将简单地存储为二进制数(而不是文本),这通常具有空间和检索效率。

我建议您查看数据压缩书籍

Compression typically works on a byte stream. When a stream has a non-uniform distribution of byte values (for instance text, or numbers stored as text), the compression ratio you can achieve will be higher, since fewer bits are used to store the bytes which appear more frequently (in Huffman compression).

Typically, the data you are talking about will simply be stored as binary numbers (not text), and that's usually space and retrieval efficient.

I recommend you have a look at The Data Compression Book

尬尬 2024-07-26 11:00:50

您要压缩什么类型的数据? 它是如何分布的? 是否以任何方式订购? 所有这些因素都会影响它的压缩效果,并且可能允许您将数据转换为更容易压缩的数据,或者直接转换为更小的数据。

数据压缩对于“随机”数据效果不佳。 如果您的数据在较小范围内,您很可能能够利用它。

事实上,您应该简单地尝试运行任何常见算法并查看数据是否“压缩得足够”。 如果不是,并且您对数据的了解超出了压缩算法“直觉”的范围,那么您应该利用该信息。

一个例子是,您的数据不仅仅是纬度和经度,而且还假设它们彼此“接近”。 然后你可以存储“原点”纬度和经度,其余的可以是微分的。 也许这些差异足够小,可以容纳在一个有符号的字节中。

这只是一个简单的例子,说明了你可以利用数据知识来完成一些事情,而不是一些通用算法可能无法弄清楚的事情。

What kind of data are you compressing? How is it distributed? Is it ordered in any way? All of these things can affect how well it compresses, and perhaps allow you to convert the data in to something more easily compressed, or simply smaller right out the gate.

Data compress works poorly on "random" data. If your data is within a smaller range, you may well be able to leverage that.

In truth, you should simply try running any of the common algorithms and see if the data is "compressed enough". If not, and you know more about the data than can be "intuited" by the compression algorithms, you should leverage that information.

An example is say that your data is not just Lat's and Long's, but they're assumed to be "close" to each other. Then you could probably store an "origin" Lat and Long, and the rest can be differential. Perhaps these differences are small enough to fit in to a single, signed byte.

That's just a simple example of things you can do with knowledge of the data vs what some generic algorithm may not be able to figure out.

你列表最软的妹 2024-07-26 11:00:50

这取决于您要如何处理数据以及您需要多少精度。

纬度/经度传统上以度、分和秒为单位,其中 60 秒为分,60 分钟为度,1 度纬度名义上等于 60 海里 (nmi)。 1 分钟相当于 1 海里,1 秒相当于 100 英尺多一点。

纬度从 -90 度到 +90 度。 将纬度表示为整数秒,范围为 -324000..+324000,或大约 20 位。 经度从 -180 到 +180,因此以同样的方式表示经度还需要 1 位。

因此,您可以用 41 位表示完整的纬度/经度位置,精确到 +/- 50 英尺。

显然,如果您不需要那么高的精度,您可以减少位数。

请注意,传统的单精度 32 位浮点数使用大约 24 位尾数,因此如果您只是将纬度/经度(以秒为单位)转换为浮点数,则结果会下降到大约 +/- 6 英尺。 对于这种事情,很难击败两个单精度浮点数。

It depends on what you are going to do with the data, and how much precision you need.

Lat/long is traditionally given in degrees, minutes, and seconds, with 60 seconds to the minute, 60 minutes to the degree,and 1 degree of latitude nominally equal to 60 nautical miles (nmi). 1 minute is then 1 nmi, and 1 second is just over 100 ft.

Latitude goes from -90 to +90 degrees. Representing latitude as integer seconds gives you a range of -324000..+324000, or about 20 bits. Longitude goes -180 to +180, so representing longitude the same way requires 1 more bit.

So you can represent a complete lat/long position, to +/- 50 ft, in 41 bits.

Obviously, if you don't need that much precision, you can back down your bit count.

Observe that a traditional single-precision 32-bit float uses about 24 bits of mantissa, so you are down to about +/- 6 feet if you just convert your lat/long in seconds to float. It is kind of hard to beat two single-precision floats for this kind of thing.

紅太極 2024-07-26 11:00:50

根据可用的角色,你可以很容易地制作一些东西。

例如,如果输入仅为数字 (0..9),这里有一个在 Kotlin 中对它们进行编码和解码的解决方案(Java 上的类似方法):

fun encodeDigitsOnlyString(stringWithDigitsOnly: String): ByteArray {
    //we couple each 2 digits together into a single byte.
    //For the last digit, if it has no digit to pair with, it's paired with something that's not a digit
    val result = ArrayList<Byte>()
    val length = stringWithDigitsOnly.length
    var lastDigit: Byte? = null
    for (i in 0 until length) {
        val char = stringWithDigitsOnly[i]
        val digitAsByte = char.toString().toInt().toByte()
        if (lastDigit == null) {
            if (i == length - 1) {
                //last digit
                val newByte = (digitAsByte + 0xf0).toByte()
                result.add(newByte)
            } else {
                //more to go
                lastDigit = digitAsByte
            }
        } else {
            val newByte = (digitAsByte + lastDigit.toInt().shl(4)).toByte()
            result.add(newByte)
            lastDigit = null
        }
    }
    return result.toByteArray()
}

fun decodeByteArrayToDigitsOnlyString(encodedDigitsOnlyByteArray: ByteArray): String {
    val sb = StringBuilder(encodedDigitsOnlyByteArray.size * 2)
    for (byte in encodedDigitsOnlyByteArray) {
        val hex = Integer.toHexString(byte.toInt()).takeLast(2).padStart(2, '0')
        if (hex[0].isLetter())
            sb.append(hex.last())
        else
            sb.append(hex)
    }
    return sb.toString()
}

示例用法:

val inputString="12345"
val byteArray=encodeDigitsOnlyString(inputString) //produces a byte array of size 3
val outputString=decodeByteArrayToDigitsOnlyString(byteArray) //should be the same as the input

Depending on the available characters, you could make something quite easily.

For example, if the input is only digits (0..9), here's a solution that will encode and decode them, in Kotlin (similar thing on Java) :

fun encodeDigitsOnlyString(stringWithDigitsOnly: String): ByteArray {
    //we couple each 2 digits together into a single byte.
    //For the last digit, if it has no digit to pair with, it's paired with something that's not a digit
    val result = ArrayList<Byte>()
    val length = stringWithDigitsOnly.length
    var lastDigit: Byte? = null
    for (i in 0 until length) {
        val char = stringWithDigitsOnly[i]
        val digitAsByte = char.toString().toInt().toByte()
        if (lastDigit == null) {
            if (i == length - 1) {
                //last digit
                val newByte = (digitAsByte + 0xf0).toByte()
                result.add(newByte)
            } else {
                //more to go
                lastDigit = digitAsByte
            }
        } else {
            val newByte = (digitAsByte + lastDigit.toInt().shl(4)).toByte()
            result.add(newByte)
            lastDigit = null
        }
    }
    return result.toByteArray()
}

fun decodeByteArrayToDigitsOnlyString(encodedDigitsOnlyByteArray: ByteArray): String {
    val sb = StringBuilder(encodedDigitsOnlyByteArray.size * 2)
    for (byte in encodedDigitsOnlyByteArray) {
        val hex = Integer.toHexString(byte.toInt()).takeLast(2).padStart(2, '0')
        if (hex[0].isLetter())
            sb.append(hex.last())
        else
            sb.append(hex)
    }
    return sb.toString()
}

Example usage:

val inputString="12345"
val byteArray=encodeDigitsOnlyString(inputString) //produces a byte array of size 3
val outputString=decodeByteArrayToDigitsOnlyString(byteArray) //should be the same as the input
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文