如何压缩小字符串

发布于 2024-07-12 21:42:27 字数 210 浏览 5 评论 0原文

我有一个充满大量 URL 的 sqlite 数据库，它占用了大量的磁盘空间，并且访问它会导致许多磁盘寻道并且速度很慢。平均 URL 路径长度为 97 字节（主机名重复很多，因此我将它们移动到外键表中）。有什么好的方法可以压缩它们吗？大多数压缩算法都能很好地处理大文档，而不是平均小于 100 字节的“文档”，但即使减少 20% 也非常有用。有什么可行的压缩算法吗？不必是任何标准的东西。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

水水月牙 2024-07-19 21:42:27

使用压缩算法但使用共享字典。

我之前做过类似的事情，使用 LZC/LZW 算法，如 Unix 压缩命令所使用的那样。

使用短字符串获得良好压缩的技巧是使用由要压缩的 URL 的标准样本组成的字典。

你应该很容易得到20%。

编辑：LZC 是 LZW 的变体。您只需要 LZW，因为您只需要一个静态字典。 LZC 添加了对字典/表满时重置的支持。

回复收藏 0 原文

呆橘 2024-07-19 21:42:27

我已经尝试使用以下策略。它使用共享字典，但解决 python 的 zlib 不允许您访问字典本身的方式。

首先，通过运行一堆训练字符串来初始化预训练的压缩器和解压缩器。丢弃输出字符串。

然后，使用经过训练的压缩器的副本来压缩每个小字符串，并使用解压缩器的副本来解压缩它们。

这是我的 python 代码（对丑陋的测试表示歉意）：

    import zlib
    class Trained_short_string_compressor(object):
        def __init__(self,
                     training_set, 
                     bits = -zlib.MAX_WBITS,
                     compression = zlib.Z_DEFAULT_COMPRESSION,
                     scheme = zlib.DEFLATED):
            # Use a negative number of bits, so the checksum is not included.
            compressor = zlib.compressobj(compression,scheme,bits)
            decompressor = zlib.decompressobj(bits)
            junk_offset = 0
            for line in training_set:
                junk_offset += len(line)
                # run the training line through the compressor and decompressor
                junk_offset -= len(decompressor.decompress(compressor.compress(line)))
            
            # use Z_SYNC_FLUSH. A full flush seems to detrain the compressor, and 
            # not flushing wastes space.
            junk_offset -= len(decompressor.decompress(compressor.flush(zlib.Z_SYNC_FLUSH)))
            
            self.junk_offset = junk_offset
            self.compressor = compressor
            self.decompressor = decompressor
            
        def compress(self,s):
            compressor = self.compressor.copy()
            return compressor.compress(s)+compressor.flush()
            
        def decompress(self,s):
            decompressor = self.decompressor.copy()
            return (decompressor.decompress(s)+decompressor.flush())[self.junk_offset:]

测试它，我在一堆 10,000 个短（50 -> 300 个字符）unicode 字符串上节省了 30% 以上。压缩和解压缩它们也花费了大约 6 秒（相比之下，使用简单的 zlib 压缩/解压缩大约需要 2 秒）。另一方面，简单的 zlib 压缩节省了大约 5%，而不是 30%。

def test_compress_small_strings():
    lines =[l for l in gzip.open(fname)]
    compressor=Trained_short_string_compressor(lines[:500])

    import time
    t = time.time()
    s = 0.0
    sc = 0.
    for i in range(10000):
        line = lines[1000+i] # use an offset, so you don't cheat and compress the training set
        cl = compressor.compress(line)
        ucl = compressor.decompress(cl)
        s += len(line)
        sc+=len(cl)
        assert line == ucl
        
    print 'compressed',i,'small strings in',time.time()-t,'with a ratio of',s0/s
    print 'now, compare it ot a naive compression '
    t = time.time()
    for i in range(10000):
        line = lines[1000+i]
        cr = zlib.compressobj(zlib.Z_DEFAULT_COMPRESSION,zlib.DEFLATED,-zlib.MAX_WBITS)
        cl=cr.compress(line)+cr.flush()
        ucl = zlib.decompress(cl,-zlib.MAX_WBITS)
        sc += len(cl)
        assert line == ucl
        
        
    print 'naive zlib compressed',i,'small strings in',time.time()-t, 'with a ratio of ',sc/s

请注意，如果删除它，它就不会持久。如果你想要坚持，你就必须记住训练集。

I've tried this using the following strategy. It's using a shared dictionary, but working around the way python's zlib doesn't give you access to the dictionary itself.

First, initialize a pre-trained compressor and decompressor by running a bunch of training strings through them. Throw away the output strings.

Then, use copies of the trained compressor to compress every small string, and use copies of the decompressor to decompress them.

Here my the python code (apologies for the ugly testing):

    import zlib
    class Trained_short_string_compressor(object):
        def __init__(self,
                     training_set, 
                     bits = -zlib.MAX_WBITS,
                     compression = zlib.Z_DEFAULT_COMPRESSION,
                     scheme = zlib.DEFLATED):
            # Use a negative number of bits, so the checksum is not included.
            compressor = zlib.compressobj(compression,scheme,bits)
            decompressor = zlib.decompressobj(bits)
            junk_offset = 0
            for line in training_set:
                junk_offset += len(line)
                # run the training line through the compressor and decompressor
                junk_offset -= len(decompressor.decompress(compressor.compress(line)))
            
            # use Z_SYNC_FLUSH. A full flush seems to detrain the compressor, and 
            # not flushing wastes space.
            junk_offset -= len(decompressor.decompress(compressor.flush(zlib.Z_SYNC_FLUSH)))
            
            self.junk_offset = junk_offset
            self.compressor = compressor
            self.decompressor = decompressor
            
        def compress(self,s):
            compressor = self.compressor.copy()
            return compressor.compress(s)+compressor.flush()
            
        def decompress(self,s):
            decompressor = self.decompressor.copy()
            return (decompressor.decompress(s)+decompressor.flush())[self.junk_offset:]

Testing it, I saved over 30% on a bunch of 10,000 shortish (50 -> 300 char) unicode strings. It also took about 6 seconds to compress and decompress them (compared to about 2 seconds using simple zlib compression / decompression). On the other hand, the simple zlib compression saved about 5%, not 30%.

def test_compress_small_strings():
    lines =[l for l in gzip.open(fname)]
    compressor=Trained_short_string_compressor(lines[:500])

    import time
    t = time.time()
    s = 0.0
    sc = 0.
    for i in range(10000):
        line = lines[1000+i] # use an offset, so you don't cheat and compress the training set
        cl = compressor.compress(line)
        ucl = compressor.decompress(cl)
        s += len(line)
        sc+=len(cl)
        assert line == ucl
        
    print 'compressed',i,'small strings in',time.time()-t,'with a ratio of',s0/s
    print 'now, compare it ot a naive compression '
    t = time.time()
    for i in range(10000):
        line = lines[1000+i]
        cr = zlib.compressobj(zlib.Z_DEFAULT_COMPRESSION,zlib.DEFLATED,-zlib.MAX_WBITS)
        cl=cr.compress(line)+cr.flush()
        ucl = zlib.decompress(cl,-zlib.MAX_WBITS)
        sc += len(cl)
        assert line == ucl
        
        
    print 'naive zlib compressed',i,'small strings in',time.time()-t, 'with a ratio of ',sc/s

Note, it's not persistent if you delete it. If you wanted persistence, you would have to remember the training set.

回复收藏 0 原文

像你 2024-07-19 21:42:27

您是否考虑过使用静态霍夫曼编码？

您可以使用现有的 URL 正文根据 URL 中出现的所有字节的频率计算霍夫曼代码。然后，您可以存储该组代码一次，并使用它对所有 URL 进行编码。我认为它应该提供适当的压缩。

回复收藏 0 原文

握住你手 2024-07-19 21:42:27

你们的网址格式是什么？

如果任何 URL 共享一个或多个域，并且您拥有大约 20 亿个域名就足够了，那么您可以创建一个域名池。如果您共享相对路径，则可以将它们集中到一起。

对于数据库中的每个 URL，将每个 URL 分为三个部分。方案和域，例如 http://mydomain.com 实际 url /my/path/ ，然后是其余的 mypage。 html?id=4（如果您有查询字符串参数）

这样您应该将每个域和相对路径的开销减少到大约 8 个字节。如果您想查找部分 URL，那应该更好、更快。

注意：“http”方案字符串本身就是 4 个字节，您将在每个域条目上保存除此之外的任何内容。如果每个 URL 都以“http://www”开头。您将保存“://www”。每次7字节。

尝试一下如何分割和构造 URL，我敢打赌这就是您会发现压缩的地方。现在，剩余的字符串不是公共域或相对路径，您可以用它做什么？

压缩 URL

此类方法的通用压缩源自算术编码。信息论之父香农在 60 年代写了一篇关于这一点的论文。我从事压缩工作已经有一段时间了，我一直发现的一件事是通用压缩永远无法解决实际问题。

您很幸运，因为 URL 具有结构，您应该利用该结构来更好地存储 URL。

如果您想应用压缩算法（我认为应该更改主题以反映 URL 压缩，因为它是特定于域的），您必须检查数据的熵。因为它会告诉您有关存储产量的信息。 URL 是 ASCII 字符，任何不在 ASCII 范围 0x20-0x7E 内的字符都不会出现，并且不区分大小写，只剩下 63 种不同的状态。 !"#%&'()*+,-./0123456789:;<=>?@abcdefghijklmnopqrstuvwxyz{|}~ 包括空格。

您可以创建剩余字符的频率表并执行算术编码。你知道你最多需要 6 位，这意味着对于 URL 数据库中的每个字符，你现在浪费了 2 位，如果你只是将内容转移到位并使用查找表，你会得到你的20% 压缩。就像那样；）

因为数据非常具体，所以仅使用通用方法进行压缩确实不是一个好主意，最好将信息结构化并将其拆分为可以更有效地存储的数据块。有关该领域的很多知识，请使用这些知识来压缩您的数据。

What's the format of your URLs?

If any URL share one or more domain and you're sufficient with about 2 billion domain names you can create a pool for domain names. And if you have shared relative paths you can pool them to.

For every URL in your database, split each URL into three parts. the scheme and domain e.g. http://mydomain.com the realtive url /my/path/ and then the rest mypage.html?id=4 (if you have query string parameters)

This way you should reduce the overhead of every domain and relative path to just about 8 bytes. That should be better, and fast if you wanna lookup parts of URLs.

Note: just the "http" scheme string itself is 4 bytes, you'll save anything beyond that on every domain entry. If every URL starts with "http://www." you'll save "://www." 7 bytes each time.

Experiment a bit on how to split and structure URLs, I'm betting this is were you'll find your compression. Now, the remaining string that is not a common domain or relative path, what could you do with that?

Compressing URLs

General purpose compression such methods are derived from arithmetic encoding. Shannon the father of information theory wrote a paper about this in the 60's. I've been working with compression for a while and the one thing I've always found, is that general purpose compression never solves the actual problem.

You're in luck, because URLs have structure and that structure you should utilize to better store your URLs.

If you wanna apply a compression algorithm (I think the topic should be changed to reflect URL compression, because it's domain specific) you'll have to examine the entropy of your data. Because it will tell you something about the yield of storage. URLs are ASCII characters, any character not within the ASCII range 0x20-0x7E won't be occurring and throwing away case sensitivity, you're down to a mere 63 distinct states. !"#%&'()*+,-./0123456789:;<=>?@abcdefghijklmnopqrstuvwxyz{|}~ including white-space.

You could create a frequency table of the remaining characters and perform arithmetic encoding. You know that you'll need at most 6-bits, which means for every character in your URL database you're wasting 2 bits right now, and if you just shifted things into place and used a lookup table, you'd get your 20% compression. Just like that ;)

Because the data is so specific it's really not a good idea to just compress with general purpose methods. It's better to structure the information and split that into pieces of data you can store more efficiently. You know a lot about the domain, use that knowledge to compress your data.

回复收藏 0 原文

め七分饶幸 2024-07-19 21:42:27

摘要：

大型搜索引擎和网络蜘蛛的一个常见问题是如何处理大量遇到的URL。传统的搜索引擎和网络蜘蛛使用硬盘来存储 URL，而不进行任何压缩。这会导致性能下降和需要更多空间。本文描述了一种简单的 URL 压缩算法，可以实现高效的压缩和解压缩。压缩算法基于增量编码方案来提取共享公共前缀的 URL 和 AVL 树以获得高效的搜索速度。我们的结果表明尺寸减小了 50%。 1.——

Kasom Koht-arsa 计算机工程系。

资源

回复收藏 0 原文