使用递归 N-Gram 压缩文本

发布于 2024-12-24 19:59:42 字数 1095 浏览 1 评论 0 原文

我只是想通过使用递归 2-Gram 存储将一大组文本分解为单个整数,直到只剩下一个值。

table pair
{
    id
    first_parent_id (points to -> this.id)
    second_parent_id (points to -> this.id)
}

例如,在下面的代码中,我有一个 11 个单词的句子(其中 12 个带有句点)。我可以将每个单词对存储在数据库中(“this”+“is”= ID #1),然后将每组两个单词对存储在数据库中(1 + 2 = ID #7),然后重复,直到我开始只剩下一个单词集 - 这将是 ID 12。

This is my group of words which I plan to compress.
---1---|--2-----|--3-----|-----4-|----5--|-------6-
-------7--------|--------8-------|-------9---------
----------------10---------------11----------------
------------------------12-------------------------

然后使用数字“12”我们可以向后工作(如果我们有相同的数据集)

------------------------12-------------------------
----------------10---------------11----------------
-------7--------|--------8-------|-------9---------
---1---|--2-----|--3-----|-----4-|----5--|-------6-
This is my group of words which I plan to compress.

虽然这需要大量的工作来压缩/解压缩每个字符串 - 看起来像它可能有某种用处需要存储内容的归档工作,但永远不会被读取,除非在极少数情况下解压缩过程不成问题。

我的想法正确吗?单词序列的可能数量是否会太大而无法像这样存储? (想象一个 500 字的文档)。

I was just kicking around the idea of breaking up a large group of text into a single integer by using recursive 2-Gram storage until there is only one value left.

table pair
{
    id
    first_parent_id (points to -> this.id)
    second_parent_id (points to -> this.id)
}

For example, in the following code I have a 11 word sentence (twelve with the period). I could store each word pair in a database ("this" + "is" = ID #1) and then store each set of two wordpairs in the database (1 + 2 = ID #7), and repeat until I get down to only one word set left - which would be ID 12.

This is my group of words which I plan to compress.
---1---|--2-----|--3-----|-----4-|----5--|-------6-
-------7--------|--------8-------|-------9---------
----------------10---------------11----------------
------------------------12-------------------------

Then using the number "12" we can work backwards (if we have the same dataset)

------------------------12-------------------------
----------------10---------------11----------------
-------7--------|--------8-------|-------9---------
---1---|--2-----|--3-----|-----4-|----5--|-------6-
This is my group of words which I plan to compress.

While this would take a tremendous amount of work to compress/uncompress each string - it seems like it might have a use in some kind of archive work where the contents need to be stored - but are never read except in rare cases where the uncompression process isn't a problem.

Am I thinking about this correctly? Would the possible number of word sequences just be too great to store like this? (Imagine a 500 word document).

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

情深缘浅 2024-12-31 19:59:42

为什么需要“二词组”来实现压缩?如果这不是严格要求,有多种方法可以根据不同的场景来压缩文本数据。这些大多称为字典预处理。以下是可应用于您的情况的列表:

  1. 计算单词出现次数并按频率降序对它们进行排序。您可以将前 N 个单词与自定义编码方法一起使用,其中 N 可由用户配置。您甚至可以通过动态规划等来优化 N。在实际编码时,编码一个标志来指示下一个符号是字典单词还是直接编码的单词。

  2. 构建二元组或三元组字符组合(包括空格、标点符号等)的直方图。然后使用未使用的字节值对那些经常出现的二元组或三元组进行编码。您甚至可以使用递归方法一遍又一遍地扫描以减少源文件。

就您而言,如果您考虑上述方法,效率很低。因为,似乎您没有考虑到您需要一个非常大的数据来解码编码数据。要理解大多数压缩思想,最好编写一个非常简单的测试程序来分析其输出。最终你会得到一个更强大、更稳定的算法。

这是我想到的一些字典预处理器,仅供参考:

  1. XWRT:状态之一艺术词典预处理器。
  2. DICT:高性能FreeArc 归档器(它是开源的)。有一篇关于它的文章。不幸的是,它是俄语的。
  3. KWC:一个简单的测试字典预处理器,用字典代码替换 6 元代码。请查看此处进行讨论。
  4. bpe2 V3:它基于 n-gram 替换。其他版本:V1V2。此外,还有一个关于它的讨论

Why do you need "digram words" to achive compression? If that's not a strictly requirement, there are various method to compress textual data with different scenerio. Those are mostly called dictionary preprocessing. Here is a list that can be applied in your case:

  1. Count word occurrences and sort them by frequencies in descending order. You can use top N words with your custom encoding method where N can be configurable by the user. You can even optimize N with dynamic programming or such. At actual encoding, encode a flag to indicate next symbol is whether a dictionary word or a directly encoded word.

  2. Build a histogram of digram or trigram character combinations (including spaces, punctuation etc). Then use unused byte values to encode those digram or trigrams which are frequently seen. You can even use recursive methods to scan over and over again to reduce source file.

In your case, it's inefficient if you consider above methods. Because, seems you didn't consider that you need a really big data to decode your encoded data. To understand most of compression ideas, it's better to write a very simple test program to analyze it's output. You'll end up with a stronger and stable algorithms eventually.

Here is some dictionary preprocessors which come into my mind for just giving you a reference:

  1. XWRT: One of the state of art dictionary preprocessors.
  2. DICT: Preprocessor of high performing FreeArc archiver (it's open source). There is an article about it. Unfortunately, it's in Russian.
  3. KWC: A simple test dictionary preprocessor which replaces 6-grams codes with a dictionary codes. Look here for the discussion.
  4. bpe2 V3: It's based on n-gram substitution. The other versions: V1, V2. Also, there is a discussion about it.
冷心人i 2024-12-31 19:59:42

简而言之,是的,可能的序列数量可能太大而无法有效地完成此操作。更大的问题是,这些单词映射以及每个映射后面的 n 元语法需要存储在某个地方,这将大大超过实际“压缩”所节省的任何空间。

In short, yes the possible number of sequences would likely be too great to do this efficiently. The bigger problem is that those word mappings, and the n-grams following each of those mappings, would need to be stored somewhere, which would greatly outweigh any savings of the actual "compression."

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文