使用递归 N-Gram 压缩文本

发布于 2024-12-24 19:59:42 字数 1095 浏览 1 评论 0 原文

我只是想通过使用递归 2-Gram 存储将一大组文本分解为单个整数，直到只剩下一个值。

table pair
{
    id
    first_parent_id (points to -> this.id)
    second_parent_id (points to -> this.id)
}

例如，在下面的代码中，我有一个 11 个单词的句子（其中 12 个带有句点）。我可以将每个单词对存储在数据库中（“this”+“is”= ID #1），然后将每组两个单词对存储在数据库中（1 + 2 = ID #7），然后重复，直到我开始只剩下一个单词集 - 这将是 ID 12。

This is my group of words which I plan to compress.
---1---|--2-----|--3-----|-----4-|----5--|-------6-
-------7--------|--------8-------|-------9---------
----------------10---------------11----------------
------------------------12-------------------------

然后使用数字“12”我们可以向后工作（如果我们有相同的数据集）

------------------------12-------------------------
----------------10---------------11----------------
-------7--------|--------8-------|-------9---------
---1---|--2-----|--3-----|-----4-|----5--|-------6-
This is my group of words which I plan to compress.

虽然这需要大量的工作来压缩/解压缩每个字符串 - 看起来像它可能有某种用处需要存储内容的归档工作，但永远不会被读取，除非在极少数情况下解压缩过程不成问题。

我的想法正确吗？单词序列的可能数量是否会太大而无法像这样存储？（想象一个 500 字的文档）。

原文

I was just kicking around the idea of breaking up a large group of text into a single integer by using recursive 2-Gram storage until there is only one value left.

table pair
{
    id
    first_parent_id (points to -> this.id)
    second_parent_id (points to -> this.id)
}

For example, in the following code I have a 11 word sentence (twelve with the period). I could store each word pair in a database ("this" + "is" = ID #1) and then store each set of two wordpairs in the database (1 + 2 = ID #7), and repeat until I get down to only one word set left - which would be ID 12.

This is my group of words which I plan to compress.
---1---|--2-----|--3-----|-----4-|----5--|-------6-
-------7--------|--------8-------|-------9---------
----------------10---------------11----------------
------------------------12-------------------------

Then using the number "12" we can work backwards (if we have the same dataset)

------------------------12-------------------------
----------------10---------------11----------------
-------7--------|--------8-------|-------9---------
---1---|--2-----|--3-----|-----4-|----5--|-------6-
This is my group of words which I plan to compress.

While this would take a tremendous amount of work to compress/uncompress each string - it seems like it might have a use in some kind of archive work where the contents need to be stored - but are never read except in rare cases where the uncompression process isn't a problem.

Am I thinking about this correctly? Would the possible number of word sequences just be too great to store like this? (Imagine a 500 word document).

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

情深缘浅 2024-12-31 19:59:42

为什么需要“二词组”来实现压缩？如果这不是严格要求，有多种方法可以根据不同的场景来压缩文本数据。这些大多称为字典预处理。以下是可应用于您的情况的列表：

计算单词出现次数并按频率降序对它们进行排序。您可以将前 N 个单词与自定义编码方法一起使用，其中 N 可由用户配置。您甚至可以通过动态规划等来优化 N。在实际编码时，编码一个标志来指示下一个符号是字典单词还是直接编码的单词。
构建二元组或三元组字符组合（包括空格、标点符号等）的直方图。然后使用未使用的字节值对那些经常出现的二元组或三元组进行编码。您甚至可以使用递归方法一遍又一遍地扫描以减少源文件。

就您而言，如果您考虑上述方法，效率很低。因为，似乎您没有考虑到您需要一个非常大的数据来解码编码数据。要理解大多数压缩思想，最好编写一个非常简单的测试程序来分析其输出。最终你会得到一个更强大、更稳定的算法。

这是我想到的一些字典预处理器，仅供参考：