“压缩”词对(不定式和屈折变化)

发布于 2025-01-08 00:28:27 字数 800 浏览 1 评论 0原文

我有一个包含单词及其变形形式的大型数据库,例如:

BASIC_FORM ##### INFLECED_FORM

talk ----- talk
talk ----- talking
talk ----- talked
talk ----- talks
paragraph ----- paragraph
paragraph ----- paragraphs
...

这个数据库需要大量磁盘空间,当然,只要它有 100 万个或更多条目。

“压缩”该数据集的最佳方法是什么,即在不丢失信息的情况下减少所需的磁盘空间量?

我的第一个想法是创建一个额外的列来保存字符数可以从基本表格的开头复制。然后你只需要保存变形形式不同的部分,例如:

BASIC_FORM ##### NUM_EQUAL ##### INFLECED_FORM

talk ----- 4 ----- 
talk ----- 4 ----- ing
talk ----- 4 ----- ed
talk ----- 4 ----- s
try ----- 3 ----- 
try ----- 2 ----- ied
paragraph ----- 9 ----- 
paragraph ----- 9 ----- s
...

这应该节省一些磁盘空间,因为“NUM_EQUAL”可以在MySQL中保存为TINYINT(例如),所以它只需要1个字节并且在字符串中“INFLECTED_FORM”通常会保存超过1个字符(即超过1个字节)。

您还有其他节省磁盘空间的建议吗?

I have a large database containing words and their inflected forms, e.g.:

BASIC_FORM ##### INFLECED_FORM

talk ----- talk
talk ----- talking
talk ----- talked
talk ----- talks
paragraph ----- paragraph
paragraph ----- paragraphs
...

This database requires a lot of disk space, of course, as soon as it has 1 million entries or more.

What is the best method to "compress" that set of data, i.e. reduce the required amount of disk space while no information is lost?

My first idea was to create an extra column which holds the number of characters that can be copied from the beginning of the basic form. Then you just have to save the part of the inflected form that differs, e.g.:

BASIC_FORM ##### NUM_EQUAL ##### INFLECED_FORM

talk ----- 4 ----- 
talk ----- 4 ----- ing
talk ----- 4 ----- ed
talk ----- 4 ----- s
try ----- 3 ----- 
try ----- 2 ----- ied
paragraph ----- 9 ----- 
paragraph ----- 9 ----- s
...

This should save some amount of disk space as "NUM_EQUAL" can be saved as TINYINT in MySQL (for example) so it requires only 1 byte and in the string "INFLECTED_FORM" you usually save more than 1 character (i.e. more than 1 byte).

Do you have other suggestions to save disk space?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

深海夜未眠 2025-01-15 00:28:27

您应该规范化模型。这意味着,为 basic_form 创建一个单独的表。我不确定你会节省多少空间,因为这样将取决于数据(你的单词越长,词形变化越多,你节省的空间就越多)。但是,假设每个单词只有一个单词和一个变形单词(我知道情况并非如此,但让我们把它推向极端),那么拥有两个表会增加所需的存储空间。

现在,在应用之前的重构之后(这也会为您省去一些麻烦,就像规范化总是做的那样!)您还可以应用您的系统来减少存储指令所需的大小。

You should normalize the model. That means, create a separate table for the basic_form. I'm not sure how much space you will save because that way because that will depend on the data (the longer the words you have and the more inflections you have, the more space you'll save). However, let's say you only have one word and one inflected word for each (I know that's not the case, but let's take it to that extreme), then having two tables would increase the storage needed.

Now, after aplying the previous refactor (that will also save you some headaches, as normalization always do!) you can also apply YOUR system for reducing the size it takes to store the inlections too.

夜未央樱花落 2025-01-15 00:28:27

为什么不创建两个表,例如:

BasicForm
  id
  word

InflectedForm
  id
  basicFormId
  inflectedWord

这样,您就可以删除通过为每个词形变化重复单词的基本版本而创建的所有重复项。

Why not create two tables like:

BasicForm
  id
  word

InflectedForm
  id
  basicFormId
  inflectedWord

This way, you are removing all of the duplication created by having the basic version of the word repeated over for each inflection.

知足的幸福 2025-01-15 00:28:27

有几个框架需要变形功能才能在构建模型文件时创建对象名称,并且它们工作得很好。我使用的是 CakePHP 的 Inflector 类。

大概是这样的:

public static function inflect($rootString, $howMany)
{
    return ((int) $howMany > 1) ? Cake_Inflector::pluralize($rootString) : Cake_Inflector::singularize($rootString);
}

有可能,如果有一些不寻常的东西你需要改变,并且它不是内置的,你可以通过扩展类来添加它,但希望它向你展示什么是可能的,而不是增加数据库。

There are several frameworks out there that need inflect capabilities in order to create object names when building model files, and they work pretty well. The one I use is CakePHP's Inflector class.

Goes something like:

public static function inflect($rootString, $howMany)
{
    return ((int) $howMany > 1) ? Cake_Inflector::pluralize($rootString) : Cake_Inflector::singularize($rootString);
}

Chances are, if there is something out of the ordinary you need to inflect, and it's not built in you can add it by extending the class, but hopefully it shows you what's possible, rather than bulking up a database.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文