当前位置：文江博客话题详情

将 n-gram 存储在数据库中 << n 个表

发布于 2024-09-01 23:03:56 字数 673 浏览 10 评论 0 原文

如果我正在编写一个软件，尝试使用用户之前输入的两个单词来预测用户接下来要输入的单词，我将创建两个表。

就像这样：

== 1-gram table ==
Token | NextWord | Frequency
------+----------+-----------
"I"   | "like"   | 15
"I"   | "hate"   | 20

== 2-gram table ==
Token    | NextWord   | Frequency
---------+------------+-----------
"I like" | "apples"   | 8
"I like" | "tomatoes" | 12
"I hate" | "tomatoes" | 20
"I hate" | "apples"   | 2

按照这个示例实现，用户输入“I”，软件使用上述数据库预测用户要输入的下一个单词是“hate”。如果用户确实输入了“hate”，那么软件将预测用户要输入的下一个单词是“tomatoes”。

然而，这种实现需要为我选择考虑的每个附加 n 元模型提供一个表格。如果我决定在预测下一个单词时考虑前面的 5 或 6 个单词，那么我将需要 5-6 个表，并且每个 n 元语法的空间呈指数级增长。

仅用一两个表来表示这一点的最佳方法是什么，并且我可以支持的 n 元语法数量没有上限？

原文

If I was writing a piece of software that attempted to predict what word a user was going to type next using the two previous words the user had typed, I would create two tables.

Like so:

== 1-gram table ==
Token | NextWord | Frequency
------+----------+-----------
"I"   | "like"   | 15
"I"   | "hate"   | 20

== 2-gram table ==
Token    | NextWord   | Frequency
---------+------------+-----------
"I like" | "apples"   | 8
"I like" | "tomatoes" | 12
"I hate" | "tomatoes" | 20
"I hate" | "apples"   | 2

Following this example implimentation the user types "I" and the software, using the above database, predicts that the next word the user is going to type is "hate". If the user does type "hate" then the software will then predict that the next word the user is going to type is "tomatoes".

However, this implimentation would require a table for each additional n-gram that I choose to take into account. If I decided that I wanted to take the 5 or 6 preceding words into account when predicting the next word, then I would need 5-6 tables, and an exponentially increase in space per n-gram.

What would be the best way to represent this in only one or two tables, that has no upper-limit on the number of n-grams I can support?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

海拔太高太耀眼 2024-09-08 23:03:57

实际上，您可以将其保持原样，只使用一张桌子。二元语法不能等于一元语法，因为二元语法中会有空格。类似地，任何三元语法都不等于任何两元语法，因为三元语法将有两个空格。无穷无尽。

因此，您可以将所有 1-gram、2-gram 等放入 Token 字段中，并且不会发生冲突。

回复收藏 0 原文

拧巴小姐 2024-09-08 23:03:56

尝试使用两列表 -

phrase, frequency

一种优化是将短语中的某些单词“正常化”，例如“isn't”为“is not”。

第二个优化是使用 MD5、CRC32 或类似的短语哈希作为密钥。

Try a two column table -

phrase, frequency

One optimisation would be to "noramalise" some words in the phrase e.g. "isn't" to "is not".

A second optimisation would be to use an MD5, CRC32 or similar hash of the phrase as the key.

回复收藏 0 原文

拥醉 2024-09-08 23:03:56

为什么不将它们全部存储在一张表中呢？

Token    | NextWord   | Frequency
---------+------------+-----------
"I"      | "like"     | 15
"I"      | "hate"     | 20
"I like" | "apples"   | 8
"I like" | "tomatoes" | 12
"I hate" | "tomatoes" | 20
"I hate" | "apples"   | 2

然后由您的软件决定您为“令牌”传递的内容，以及何时插入新值（即不要插入部分键入的单词）。如果你想变得棘手，你可以有一个额外的单词数列，但我不认为这实际上是必需的（空格数+1是单词数）

Why not just store them all in the one table?

Token    | NextWord   | Frequency
---------+------------+-----------
"I"      | "like"     | 15
"I"      | "hate"     | 20
"I like" | "apples"   | 8
"I like" | "tomatoes" | 12
"I hate" | "tomatoes" | 20
"I hate" | "apples"   | 2

It'd then be up to your software to decide what you pass in for 'Token', and also when you insert new values (i.e. don't insert a partially-typed word). If you want to get tricky, you can have an extra column for the number of words, but I don't think that would actually be required (the number of spaces+1 is the number of words)

回复收藏 0 原文

~没有更多了~

关于作者

野稚

暂无简介

0 文章

0 评论

23 人气

关注发私信

友情链接

文江博客

将 n-gram 存储在数据库中 << n 个表

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（3）

关于作者

相关话题

热门标签

推荐作者

娇女薄笑

biaggi

xiaolangfanhua

rivulet

我三岁

薆情海

友情链接

将 n-gram 存储在数据库中 << n 个表

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（3）

关于作者

相关话题

热门标签

推荐作者

娇女薄笑

biaggi

xiaolangfanhua

rivulet

我三岁

薆情海

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。