当前位置：文江博客话题详情

非常简单的短字符串压缩

发布于 2024-07-29 20:48:36 字数 324 浏览 6 评论 0原文

对于长度最多为 255 个字符的字符串，是否有一种非常简单的压缩技术（是的，我正在压缩 URL< /a>）？

我不关心压缩强度 - 我正在寻找性能非常好并且可以快速实施的东西。我想要一些比 SharpZipLib 更简单的东西：可以用几个简短的方法来实现的东西。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

暮色兮凉城 2024-08-05 20:48:36

我认为这里的关键问题是“为什么要压缩网址？”

尝试缩短地址栏的长网址？

您最好将原始网址存储在某处（数据库、文本文件...）以及非域部分的哈希码（MD5 即可）。然后，您可以使用一个简单的页面（或者一些 HTTPModule，如果您觉得华丽的话）来读取 MD5 并查找真实的 URL。这就是 TinyURL 和其他工具的工作原理。

例如：

http://mydomain.com/folder1/folder2/page1.aspx

可以缩短为：

http://mydomain.com/2d4f1c8a

为此使用压缩库将不起作用。该字符串将被压缩为更短的二进制表示形式，但将其转换回需要作为 URL 一部分有效的字符串（例如 Base64）将会抵消您从压缩中获得的任何好处。

在内存或磁盘上存储大量 URL？

使用 System.IO.Compression 中内置的压缩库或 ZLib 库，这是简单且非常好的。由于您将存储二进制数据，因此压缩输出按原样就可以了。您需要将其解压缩才能将其用作 URL。

I think the key question here is "Why do you want to compress URLs?"

Trying to shorten long urls for the address bar?

You're better storing the original URL somewhere (database, text file ...) alongside a hashcode of the non-domain part (MD5 is fine). You can then have a simple page (or some HTTPModule if you're feeling flashy) to read the MD5 and lookup the real URL. This is how TinyURL and others work.

For example:

http://mydomain.com/folder1/folder2/page1.aspx

Could be shorted to:

http://mydomain.com/2d4f1c8a

Using a compression library for this will not work. The string will be compressed into a shorter binary representation, but converting this back to a string which needs to be valid as part of a URL (e.g. Base64) will negate any benefit you gained from the compression.

Storing lots of URLs in memory or on disk?

Use the built in compressing library within System.IO.Compression or the ZLib library which is simple and incredibly good. Since you will be storing binary data the compressed output will be fine as-is. You'll need to uncompress it to use it as a URL.

回复收藏 0 原文

感情旳空白 2024-08-05 20:48:36

正如接受的答案中所建议的，使用数据压缩并不努力缩短已经相当短的 URL 路径。

DotNetZip 有一个 DeflateStream 类，该类公开静态（在 VB 中共享）压缩字符串方法。这是一种使用 DEFLATE 压缩字符串的单行方法 (RFC 1951)。 DEFLATE 实现与 System.IO.Compression 完全兼容.DeflateStream，但 DotNetZip 压缩效果更好。以下是您可以如何使用它：

string[] orig = {
    "folder1/folder2/page1.aspx",
    "folderBB/folderAA/page2.aspx",
};
public void Run()
{
    foreach (string s in orig)
    {
        System.Console.WriteLine("original    : {0}", s);
        byte[] compressed = DeflateStream.CompressString(s);
        System.Console.WriteLine("compressed  : {0}", ByteArrayToHexString(compressed));
        string uncompressed = DeflateStream.UncompressString(compressed);
        System.Console.WriteLine("uncompressed: {0}\n", uncompressed);
    }
}

使用该代码，这是我的测试结果：

original    : folder1/folder2/page1.aspx
compressed  : 4bcbcf49492d32d44f03d346fa0589e9a9867a89c5051500
uncompressed: folder1/folder2/page1.aspx

original    : folderBB/folderAA/page2.aspx
compressed  : 4bcbcf49492d7272d24f03331c1df50b12d3538df4128b0b2a00
uncompressed: folderBB/folderAA/page2.aspx

因此您可以看到“压缩”字节数组（以十六进制表示时）比原始数组长，大约是原来的 2 倍长。原因是一个十六进制字节实际上是 2 个 ASCII 字符。

您可以通过使用 base-62 而不是 base-16（十六进制）来表示数字来对此进行一定程度的补偿。在这种情况下，az 和 AZ 也是数字，即 0-9 (10) + az (+26) + AZ (+26) = 总共 62 个数字。这将大大缩短产量。我没试过。然而。

编辑
好的，我测试了 Base-62 编码器。它将六角字符串缩短了大约一半。我认为它会将其减少到 25% (62/16 =~ 4) 但我认为我在离散化过程中失去了一些东西。在我的测试中，生成的 base-62 编码字符串与原始 URL 的长度大致相同。所以，不，使用压缩然后使用 Base-62 编码仍然不是一个好方法。你确实想要一个哈希值。

As suggested in the accepted answer, Using data compression does not work to shorten URL paths that are already fairly short.

DotNetZip has a DeflateStream class that exposes a static (Shared in VB) CompressString method. It's a one-line way to compress a string using DEFLATE (RFC 1951). The DEFLATE implementation is fully compatible with System.IO.Compression.DeflateStream, but DotNetZip compresses better. Here's how you might use it:

string[] orig = {
    "folder1/folder2/page1.aspx",
    "folderBB/folderAA/page2.aspx",
};
public void Run()
{
    foreach (string s in orig)
    {
        System.Console.WriteLine("original    : {0}", s);
        byte[] compressed = DeflateStream.CompressString(s);
        System.Console.WriteLine("compressed  : {0}", ByteArrayToHexString(compressed));
        string uncompressed = DeflateStream.UncompressString(compressed);
        System.Console.WriteLine("uncompressed: {0}\n", uncompressed);
    }
}

Using that code, here are my test results:

original    : folder1/folder2/page1.aspx
compressed  : 4bcbcf49492d32d44f03d346fa0589e9a9867a89c5051500
uncompressed: folder1/folder2/page1.aspx

original    : folderBB/folderAA/page2.aspx
compressed  : 4bcbcf49492d7272d24f03331c1df50b12d3538df4128b0b2a00
uncompressed: folderBB/folderAA/page2.aspx

So you can see the "compressed" byte array, when represented in hex, is longer than the original, about 2x as long. The reason is that a hex byte is actually 2 ASCII chars.

You could compensate somewhat for that by using base-62, instead of base-16 (hex) to represent the number. In that case a-z and A-Z are also digits, giving you 0-9 (10) + a-z (+26) + A-Z (+26) = 62 total digits. That would shorten the output significantly. I haven't tried that. yet.

EDIT
Ok I tested the Base-62 encoder. It shortens the hex string by about half. I figured it would cut it to 25% (62/16 =~ 4) But I think I am losing something with the discretization. In my tests, the resulting base-62 encoded string is about the same length as the original URL. So, no, using compression and then base-62 encoding is still not a good approach. you really want a hash value.

回复收藏 0 原文