非常简单的短字符串压缩
对于长度最多为 255 个字符的字符串,是否有一种非常简单的压缩技术(是的,我正在压缩 URL< /a>)?
我不关心压缩强度 - 我正在寻找性能非常好并且可以快速实施的东西。 我想要一些比 SharpZipLib 更简单的东西:可以用几个简短的方法来实现的东西。
Is there a really simple compression technique for strings up to about 255 characters in length (yes, I'm compressing URLs)?
I am not concerned with the strength of compression - I am looking for something that performs very well and is quick to implement. I would like something simpler than SharpZipLib: something that can be implemented with a couple of short methods.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(9)
我认为这里的关键问题是“为什么要压缩网址?”
尝试缩短地址栏的长网址?
您最好将原始网址存储在某处(数据库、文本文件...)以及非域部分的哈希码(MD5 即可)。 然后,您可以使用一个简单的页面(或者一些 HTTPModule,如果您觉得华丽的话)来读取 MD5 并查找真实的 URL。 这就是 TinyURL 和其他工具的工作原理。
例如:
可以缩短为:
为此使用压缩库将不起作用。 该字符串将被压缩为更短的二进制表示形式,但将其转换回需要作为 URL 一部分有效的字符串(例如 Base64)将会抵消您从压缩中获得的任何好处。
在内存或磁盘上存储大量 URL?
使用 System.IO.Compression 中内置的压缩库或 ZLib 库,这是简单且非常好的。 由于您将存储二进制数据,因此压缩输出按原样就可以了。 您需要将其解压缩才能将其用作 URL。
I think the key question here is "Why do you want to compress URLs?"
Trying to shorten long urls for the address bar?
You're better storing the original URL somewhere (database, text file ...) alongside a hashcode of the non-domain part (MD5 is fine). You can then have a simple page (or some HTTPModule if you're feeling flashy) to read the MD5 and lookup the real URL. This is how TinyURL and others work.
For example:
Could be shorted to:
Using a compression library for this will not work. The string will be compressed into a shorter binary representation, but converting this back to a string which needs to be valid as part of a URL (e.g. Base64) will negate any benefit you gained from the compression.
Storing lots of URLs in memory or on disk?
Use the built in compressing library within System.IO.Compression or the ZLib library which is simple and incredibly good. Since you will be storing binary data the compressed output will be fine as-is. You'll need to uncompress it to use it as a URL.
正如接受的答案中所建议的,使用数据压缩并不努力缩短已经相当短的 URL 路径。
DotNetZip 有一个 DeflateStream 类,该类公开静态(在 VB 中共享)压缩字符串方法。 这是一种使用 DEFLATE 压缩字符串的单行方法 (RFC 1951)。 DEFLATE 实现与 System.IO.Compression 完全兼容.DeflateStream,但 DotNetZip 压缩效果更好。 以下是您可以如何使用它:
使用该代码,这是我的测试结果:
因此您可以看到“压缩”字节数组(以十六进制表示时)比原始数组长,大约是原来的 2 倍长。 原因是一个十六进制字节实际上是 2 个 ASCII 字符。
您可以通过使用 base-62 而不是 base-16(十六进制)来表示数字来对此进行一定程度的补偿。 在这种情况下,az 和 AZ 也是数字,即 0-9 (10) + az (+26) + AZ (+26) = 总共 62 个数字。 这将大大缩短产量。 我没试过。 然而。
编辑
好的,我测试了 Base-62 编码器。 它将六角字符串缩短了大约一半。 我认为它会将其减少到 25% (62/16 =~ 4) 但我认为我在离散化过程中失去了一些东西。 在我的测试中,生成的 base-62 编码字符串与原始 URL 的长度大致相同。 所以,不,使用压缩然后使用 Base-62 编码仍然不是一个好方法。 你确实想要一个哈希值。
As suggested in the accepted answer, Using data compression does not work to shorten URL paths that are already fairly short.
DotNetZip has a DeflateStream class that exposes a static (Shared in VB) CompressString method. It's a one-line way to compress a string using DEFLATE (RFC 1951). The DEFLATE implementation is fully compatible with System.IO.Compression.DeflateStream, but DotNetZip compresses better. Here's how you might use it:
Using that code, here are my test results:
So you can see the "compressed" byte array, when represented in hex, is longer than the original, about 2x as long. The reason is that a hex byte is actually 2 ASCII chars.
You could compensate somewhat for that by using base-62, instead of base-16 (hex) to represent the number. In that case a-z and A-Z are also digits, giving you 0-9 (10) + a-z (+26) + A-Z (+26) = 62 total digits. That would shorten the output significantly. I haven't tried that. yet.
EDIT
Ok I tested the Base-62 encoder. It shortens the hex string by about half. I figured it would cut it to 25% (62/16 =~ 4) But I think I am losing something with the discretization. In my tests, the resulting base-62 encoded string is about the same length as the original URL. So, no, using compression and then base-62 encoding is still not a good approach. you really want a hash value.
我建议查看 System.IO.Compression 命名空间< /a>. 一篇关于 CodeProject 的文章可能会有所帮助。
I'd suggest looking in the System.IO.Compression Namespace. There's an article on CodeProject that may help.
我刚刚创建了一个针对 URL 的压缩方案,并实现了大约 50% 的压缩(与原始 URL 文本的 Base64 表示相比)。
请参阅http://blog.alivate.com.au/packed-url/
如果来自大型科技公司的人正确地构建了这个并发布给所有人使用,那就太好了。 谷歌支持协议缓冲区。 这个工具可以为像谷歌这样的人节省大量磁盘空间,同时仍然可以扫描。 或者也许是伟大的船长本人? https://twitter.com/capnproto
从技术上讲,我将其称为二进制(按位)序列化方案URL 下的数据。 将 URL 视为概念数据的文本表示,然后使用专门的序列化器序列化该概念数据模型。 当然,结果是原始版本的更压缩版本。 这与通用压缩算法的工作方式非常不同。
I have just created a compression scheme that targets URLs and achieves around 50% compression (compared to base64 representation of the original URL text).
see http://blog.alivate.com.au/packed-url/
It would be great if someone from a big tech company built this out properly and published it for all to use. Google championed Protocol buffers. This tool can save a lot of disk space for someone like Google, while still being scannable. Or perhaps the great captain himself? https://twitter.com/capnproto
Technically, I would call this a binary (bitwise) serialisation scheme for the data that underlies a URL. Treat the URL as text-representation of conceptual data, then serialize that conceptual data model with a specialised serializer. The outcome is a more compressed version of the original of course. This is very different to how a general-purpose compression algorithm works.
你的目标是什么?
What's your goal?
您可以直接使用 deflate 算法,无需任何页眉校验和或页脚,如本问题所述:Python: Inflate 和 Deflate 实现
在我的测试中,这会将 4100 个字符的 URL 减少为 1270 个 base64 字符,从而使其能够满足 IE 2000 个字符的限制。
这是 4000个字符的网址,不能由于小程序可以存在于任何服务器上,因此可以使用哈希表来解决。
You can use deflate algorithm directly, without any headers checksums or footers, as described in this question: Python: Inflate and Deflate implementations
This cuts down a 4100 character URL to 1270 base64 characters, in my test, allowing it to fit inside IE's 2000 limit.
And here's an example of a 4000-character URL, which can't be solved with a hashtable since the applet can exist on any server.
我将首先尝试现有的(免费或开源)zip 库之一,例如 http:// www.icsharpcode.net/OpenSource/SharpZipLib/
Zip 应该适用于文本字符串,我不确定是否值得自己实现压缩算法......
I would start with trying one of the existing (free or open source) zip libraries, e.g. http://www.icsharpcode.net/OpenSource/SharpZipLib/
Zip should work well for text strings, and I am not sure if it is worth implementing a compression algorithm yourserlf....
您是否尝试过仅使用 gzip ?
不知道它是否能有效地处理如此短的字符串,但我想说这可能是你最好的选择。
Have you tried just using gzip?
No idea if it would work effectively with such short strings, but I'd say its probably your best bet.
开源库 SharpZipLib 易于使用,将为您提供压缩工具
The open source library SharpZipLib is easy to use and will provide you with compression tools