编写包含 160 位可恢复信息的合成英语短语

发布于 2024-10-12 13:39:57 字数 1610 浏览 7 评论 0原文

我有 160 位随机数据​​。

只是为了好玩,我想生成伪英语短语来“存储”此信息。我希望能够从该短语中恢复此信息。

注意:这不是安全问题,我不在乎其他人是否能够恢复该信息,甚至检测到该信息是否存在.

更好的短语的标准,从最重要到最不重要:

  • 简短、
  • 独特
  • 、自然、外观

建议的当前方法此处

获取三个列表,每个列表包含 1024 个名词、动词和形容词(选择最常用的列表)。通过以下模式生成一个短语,每个单词读取 20 位:

Noun verb adjective verb,
Noun verb adjective verb,
Noun verb adjective verb,
Noun verb adjective verb.

现在,这似乎是一个很好的方法,但该短语有点太长了,也有点太乏味了。

我在此处找到了一个单词语料库(语音数据库的一部分)。

经过一些临时过滤后,我计算出该语料库包含大约

  • 50690 个可用形容词
  • 123585 个名词
  • 15301 个动词
  • 13010 个副词(不包含在模式中,但在答案中提到)

这允许我每个形容词最多使用

  • 16 位(实际上是 16.9,但我不知道如何使用小数位)
  • 每个名词 15 位 每个
  • 动词 13 位 每个
  • 副词 13 位

对于名词-动词-形容词-动词模式,这为短语中的每个“句子”提供 57 位。这意味着,如果我使用从该语料库中获得的所有单词,我可以生成三个句子而不是四个(160 / 57 ≈ 2.8)。

Noun verb adjective verb,
Noun verb adjective verb,
Noun verb adjective verb.

还是有点太长太沉闷了。

有什么提示我可以如何改进吗?

我看到我可以尝试:

  • 尝试在编码之前以某种方式压缩我的数据。但由于数据是完全随机的,只有一些短语会更短(而且,我猜,不会太短)。

  • 改进短语模式,这样看起来更好。

  • 使用多种模式,使用短语中的第一个单词以某种方式指示将来的解码使用了哪种模式。 (例如,使用最后一个字母甚至单词的长度。)根据数据的第一个字节选择模式。

...我的英语不太好,无法想出更好的短语模式。有什么建议吗?

  • 在模式中使用更多的语言学。不同的时态等等

......我想,我需要比现在更好的语料库。有什么提示我可以在哪里找到合适的吗?

I have 160 bits of random data.

Just for fun, I want to generate pseudo-English phrase to "store" this information in. I want to be able to recover this information from the phrase.

Note: This is not a security question, I don't care if someone else will be able to recover the information or even detect that it is there or not.

Criteria for better phrases, from most important to the least:

  • Short
  • Unique
  • Natural-looking

The current approach, suggested here:

Take three lists of 1024 nouns, verbs and adjectives each (picking most popular ones). Generate a phrase by the following pattern, reading 20 bits for each word:

Noun verb adjective verb,
Noun verb adjective verb,
Noun verb adjective verb,
Noun verb adjective verb.

Now, this seems to be a good approach, but the phrase is a bit too long and a bit too dull.

I have found a corpus of words here (Part of Speech Database).

After some ad-hoc filtering, I calculated that this corpus contains, approximately

  • 50690 usable adjectives
  • 123585 nouns
  • 15301 verbs
  • 13010 adverbs (not included in pattern, but mentioned in answers)

This allows me to use up to

  • 16 bits per adjective (actually 16.9, but I can't figure how to use fractional bits)
  • 15 bits per noun
  • 13 bits per verb
  • 13 bits per adverb

For noun-verb-adjective-verb pattern this gives 57 bits per "sentence" in phrase. This means that, if I'll use all words I can get from this corpus, I can generate three sentences instead of four (160 / 57 ≈ 2.8).

Noun verb adjective verb,
Noun verb adjective verb,
Noun verb adjective verb.

Still a bit too long and dull.

Any hints how can I improve it?

What I see that I can try:

  • Try to compress my data somehow before encoding. But since the data is completely random, only some phrases would be shorter (and, I guess, not by much).

  • Improve phrase pattern, so it would look better.

  • Use several patterns, using the first word in phrase to somehow indicate for future decoding which pattern was used. (For example, use the last letter or even the length of the word.) Pick pattern according to the first bytes of the data.

...I'm not that good with English to come up with better phrase patterns. Any suggestions?

  • Use more linguistics in the pattern. Different tenses etc.

...I guess, I would need much better word corpus than I have now for that. Any hints where can I get a suitable one?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

煞人兵器 2024-10-19 13:39:57

我会考虑将副词添加到您的列表中。这是我想到的一个模式:

<Adverb>, the
    <adverb> <adjective>, <adverb> <adjective> <noun> and the
    <adverb> <adjective>, <adverb> <adjective> <noun>
<verb> <adverb> over the <adverb> <adjective> <noun>.

它可以编码 181 位数据。我使用不久前从 WordNet 数据中制作的列表得出了这个数字(可能有点偏差,因为我包含了复合词):

  • 12650 个可用名词(13.6 位/名词,向下舍入)
  • 5247 个可用形容词(12.3 位/形容词)
  • 5009 个可用动词(12.2 位/动词)
  • 1512 个可用副词(10.5 位/副词)

例句:“浸泡,习惯性愚蠢的社会投机样本和无所畏惧的灾难性的,在某个互惠的宏观世界天使般地排除在不可避免的间歇性被子上”。

I would consider adding adverbs to your list. Here is a pattern I came up with:

<Adverb>, the
    <adverb> <adjective>, <adverb> <adjective> <noun> and the
    <adverb> <adjective>, <adverb> <adjective> <noun>
<verb> <adverb> over the <adverb> <adjective> <noun>.

This can encode 181 bits of data. I derived this figure using lists I made a while back from WordNet data (probably a bit off because I included compound words):

  • 12650 usable nouns (13.6 bits/noun, rounded down)
  • 5247 usable adjectives (12.3 bits/adjective)
  • 5009 usable verbs (12.2 bits/verb)
  • 1512 usable adverbs (10.5 bits/adverb)

Example sentence: "Soaking, the habitually goofy, socially speculative swatch and the fearlessly cataclysmic, somewhere reciprocal macrocosm foreclose angelically over the unavoidably intermittent comforter."

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文