如何压缩小字符串
我有一个充满大量 URL 的 sqlite 数据库,它占用了大量的磁盘空间,并且访问它会导致许多磁盘寻道并且速度很慢。 平均 URL 路径长度为 97 字节(主机名重复很多,因此我将它们移动到外键表中)。 有什么好的方法可以压缩它们吗? 大多数压缩算法都能很好地处理大文档,而不是平均小于 100 字节的“文档”,但即使减少 20% 也非常有用。 有什么可行的压缩算法吗? 不必是任何标准的东西。
I have an sqlite database full of huge number of URLs and it's taking huge amount of diskspace, and accessing it causes many disk seeks and is slow. Average URL path length is 97 bytes (host names repeat a lot so I moved them to a foreign-keyed table). Is there any good way of compressing them? Most compression algorithms work well with big documents, not "documents" of less that 100 bytes on average, but even 20% reduction would be very useful. Any compression algorithms that would work? Doesn't have to be anything standard.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(7)
使用压缩算法但使用共享字典。
我之前做过类似的事情,使用 LZC/LZW 算法,如 Unix 压缩命令所使用的那样。
使用短字符串获得良好压缩的技巧是使用由要压缩的 URL 的标准样本组成的字典。
你应该很容易得到20%。
编辑:LZC 是 LZW 的变体。 您只需要 LZW,因为您只需要一个静态字典。 LZC 添加了对字典/表满时重置的支持。
Use the compress algorithm but use a shared dictionary.
I've done something like this before where I used the LZC/LZW algorithm, as used by the Unix compress command.
The trick to get good compression with short strings is to use a dictionary made up of a standard sample of the URLs you are compressing.
You should easily get 20%.
Edit: LZC is a variant of LZW. You only require LZW as you only need a static dictionary. LZC adds support for resetting the dictionary/table when it gets full.
我已经尝试使用以下策略。 它使用共享字典,但解决 python 的 zlib 不允许您访问字典本身的方式。
首先,通过运行一堆训练字符串来初始化预训练的压缩器和解压缩器。 丢弃输出字符串。
然后,使用经过训练的压缩器的副本来压缩每个小字符串,并使用解压缩器的副本来解压缩它们。
这是我的 python 代码(对丑陋的测试表示歉意):
测试它,我在一堆 10,000 个短(50 -> 300 个字符)unicode 字符串上节省了 30% 以上。 压缩和解压缩它们也花费了大约 6 秒(相比之下,使用简单的 zlib 压缩/解压缩大约需要 2 秒)。 另一方面,简单的 zlib 压缩节省了大约 5%,而不是 30%。
请注意,如果删除它,它就不会持久。 如果你想要坚持,你就必须记住训练集。
I've tried this using the following strategy. It's using a shared dictionary, but working around the way python's zlib doesn't give you access to the dictionary itself.
First, initialize a pre-trained compressor and decompressor by running a bunch of training strings through them. Throw away the output strings.
Then, use copies of the trained compressor to compress every small string, and use copies of the decompressor to decompress them.
Here my the python code (apologies for the ugly testing):
Testing it, I saved over 30% on a bunch of 10,000 shortish (50 -> 300 char) unicode strings. It also took about 6 seconds to compress and decompress them (compared to about 2 seconds using simple zlib compression / decompression). On the other hand, the simple zlib compression saved about 5%, not 30%.
Note, it's not persistent if you delete it. If you wanted persistence, you would have to remember the training set.
您是否考虑过使用静态霍夫曼编码?
您可以使用现有的 URL 正文根据 URL 中出现的所有字节的频率计算霍夫曼代码。 然后,您可以存储该组代码一次,并使用它对所有 URL 进行编码。 我认为它应该提供适当的压缩。
Have you considered using static Huffman coding?
You could use your existing body of URLs to compute Huffmann codes over all bytes occuring in URLs according to their frequency. You can then store that set of codes once, and encode all URLs using it. I think it should give decent compression.
你们的网址格式是什么?
如果任何 URL 共享一个或多个域,并且您拥有大约 20 亿个域名就足够了,那么您可以创建一个域名池。 如果您共享相对路径,则可以将它们集中到一起。
对于数据库中的每个 URL,将每个 URL 分为三个部分。 方案和域,例如 http://mydomain.com 实际 url /my/path/ ,然后是其余的 mypage。 html?id=4(如果您有查询字符串参数)
这样您应该将每个域和相对路径的开销减少到大约 8 个字节。 如果您想查找部分 URL,那应该更好、更快。
注意:“http”方案字符串本身就是 4 个字节,您将在每个域条目上保存除此之外的任何内容。 如果每个 URL 都以“http://www”开头。 您将保存“://www”。 每次7字节。
尝试一下如何分割和构造 URL,我敢打赌这就是您会发现压缩的地方。 现在,剩余的字符串不是公共域或相对路径,您可以用它做什么?
压缩 URL
此类方法的通用压缩源自算术编码。 信息论之父香农在 60 年代写了一篇关于这一点的论文。 我从事压缩工作已经有一段时间了,我一直发现的一件事是通用压缩永远无法解决实际问题。
您很幸运,因为 URL 具有结构,您应该利用该结构来更好地存储 URL。
如果您想应用压缩算法(我认为应该更改主题以反映 URL 压缩,因为它是特定于域的),您必须检查数据的熵。 因为它会告诉您有关存储产量的信息。 URL 是 ASCII 字符,任何不在 ASCII 范围 0x20-0x7E 内的字符都不会出现,并且不区分大小写,只剩下 63 种不同的状态。 !"#%&'()*+,-./0123456789:;<=>?@abcdefghijklmnopqrstuvwxyz{|}~ 包括空格。
您可以创建剩余字符的频率表并执行算术编码。你知道你最多需要 6 位,这意味着对于 URL 数据库中的每个字符,你现在浪费了 2 位,如果你只是将内容转移到位并使用查找表,你会得到你的20% 压缩。就像那样;)
因为数据非常具体,所以仅使用通用方法进行压缩确实不是一个好主意,最好将信息结构化并将其拆分为可以更有效地存储的数据块。有关该领域的很多知识,请使用这些知识来压缩您的数据。
What's the format of your URLs?
If any URL share one or more domain and you're sufficient with about 2 billion domain names you can create a pool for domain names. And if you have shared relative paths you can pool them to.
For every URL in your database, split each URL into three parts. the scheme and domain e.g. http://mydomain.com the realtive url /my/path/ and then the rest mypage.html?id=4 (if you have query string parameters)
This way you should reduce the overhead of every domain and relative path to just about 8 bytes. That should be better, and fast if you wanna lookup parts of URLs.
Note: just the "http" scheme string itself is 4 bytes, you'll save anything beyond that on every domain entry. If every URL starts with "http://www." you'll save "://www." 7 bytes each time.
Experiment a bit on how to split and structure URLs, I'm betting this is were you'll find your compression. Now, the remaining string that is not a common domain or relative path, what could you do with that?
Compressing URLs
General purpose compression such methods are derived from arithmetic encoding. Shannon the father of information theory wrote a paper about this in the 60's. I've been working with compression for a while and the one thing I've always found, is that general purpose compression never solves the actual problem.
You're in luck, because URLs have structure and that structure you should utilize to better store your URLs.
If you wanna apply a compression algorithm (I think the topic should be changed to reflect URL compression, because it's domain specific) you'll have to examine the entropy of your data. Because it will tell you something about the yield of storage. URLs are ASCII characters, any character not within the ASCII range 0x20-0x7E won't be occurring and throwing away case sensitivity, you're down to a mere 63 distinct states. !"#%&'()*+,-./0123456789:;<=>?@abcdefghijklmnopqrstuvwxyz{|}~ including white-space.
You could create a frequency table of the remaining characters and perform arithmetic encoding. You know that you'll need at most 6-bits, which means for every character in your URL database you're wasting 2 bits right now, and if you just shifted things into place and used a lookup table, you'd get your 20% compression. Just like that ;)
Because the data is so specific it's really not a good idea to just compress with general purpose methods. It's better to structure the information and split that into pieces of data you can store more efficiently. You know a lot about the domain, use that knowledge to compress your data.
摘要:
大型搜索引擎和网络蜘蛛的一个常见问题是如何处理大量遇到的URL。 传统的搜索引擎和网络蜘蛛使用硬盘来存储 URL,而不进行任何压缩。 这会导致性能下降和需要更多空间。 本文描述了一种简单的 URL 压缩算法,可以实现高效的压缩和解压缩。 压缩算法基于增量编码方案来提取共享公共前缀的 URL 和 AVL 树以获得高效的搜索速度。 我们的结果表明尺寸减小了 50%。 1.——
Kasom Koht-arsa 计算机工程系。
资源
Abstract:
A common problem of large scale search engines and web spiders is how to handle a huge number of encountered URLs. Traditional search engines and web spiders use hard disk to store URLs without any compression. This results in slow performance and more space requirement. This paper describes a simple URL compression algorithm allowing efficient compression and decompression. The compression algorithm is based on a delta encoding scheme to extract URLs sharing common prefixes and an AVL tree to get efficient search speed. Our results show that the 50 % of size reduction is achieved. 1.
-- Kasom Koht-arsa Department of Computer Engineering.
Resource
如何使用 URL 表?
您通常只进行“范围扫描”或唯一 ID 查找吗?
如果您不会执行类似
WHERE url like "/xxxx/question/%"
的操作。 您可以在 varchar() 上使用散列索引而不是 b 树索引来减少磁盘查找次数。How to you use the URL table?
Do you usually do a "range scan" or unique id lookup only?
If you won't do something like
WHERE url like "/xxxx/question/%"
. You could use a hashed index rather then a b-tree index on varchar() to reduce the number of disk seeks.是 97 个字节,还是 97 个 8 位 ASCII 字符,还是 97 个 16 位 Unicode 字符?
假设您的所有网址都是合法网址,符合 http://www.w3。 org/Addressing/URL/url-spec.txt,那么您应该只有 ASCII 字符。
如果 97 个 16 位 Unicode 字符,只需存储每个字符的低字节即可自动节省 50%。
如果有 97 个 8 位字符,请注意您只需要 7 位。 您只需一次将 7 位传递到您的比特流中,并将该比特流存储到您的数据库中即可; 使用一些较旧的 7 位传输协议; 或者想出您自己的临时方法,将每 8 个字符的位存储在前 7 个字符的高位中。
Is that 97 bytes, or 97 8-bit ASCII characters, or 97 16-bit Unicode characters?
Assuming that all your URLs are legal URLs as per http://www.w3.org/Addressing/URL/url-spec.txt, then you should have only ASCII characters.
If 97 16-bit Unicode characters simply storing the lower byte of each character will automatically give you 50% savings.
If 97 8-bit characters, notice that you only need 7-bits. You can simply pass in 7 bits at a time into your bitstream and store that bitstream into your database; use some older 7-bit transmission protocol; or come up with your own adhoc way of storing every 8th character's bits in the high bits of the previous 7 characters.