用于序列校正的 N-Gram

发布于 2024-07-25 05:13:02 字数 652 浏览 7 评论 0原文

抱歉问了这个困难的问题。

我有一大组序列需要通过添加数字或替换它们(不要删除任何内容)来纠正,如下所示:

  • 1,2,,3 => 1,7,4,3
  • 4,,5,6 => 4,4,5,6
  • 4,7,8,9 => 4,7,8,9,1
  • 4,7 => 4,8
  • 4,7,1 => 4,7,2

它以填充的原始序列和样本校正开始。

我希望能够通过计算被校正的不同 n-gram 的频率来自动校正序列,第一个样本将变为

  • 1=>1
  • 2=>7
  • 3=>3
  • 1,2 =>1,7
  • 2,3=>7,4,3
  • 1,2,3=>1,7,4,3

我会收集这些 n-gram 校正的频率,并且我正在寻找了解一种计算纠正样本数据中可能存在或不存在的新输入的最佳方法的方法。

这似乎与 SMT 类似。

Sorry for the difficult question.

I have a large set of sequences to be corrected by either/or adding digits or replacing them (never removing anything) that looks like this:

  • 1,2,,3 => 1,7,4,3
  • 4,,5,6 => 4,4,5,6
  • 4,7,8,9 => 4,7,8,9,1
  • 4,7 => 4,8
  • 4,7,1 => 4,7,2

It starts with a padded original sequence, and a sample correction.

I'd like to be able to work on correcting the sequences automatically by calculating the frequencies of the different n-grams being corrected, the first sample would become

  • 1=>1
  • 2=>7
  • 3=>3
  • 1,2=>1,7
  • 2,3=>7,4,3
  • 1,2,3=>1,7,4,3

I'd collect the frequency of these n-grams corrections, and I'm looking for a way to calculate the best way to correct a new input that may or may not be in the sample data.

This seems to be similar to SMT.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

守望孤独 2024-08-01 05:13:02

根据替换的长度和出现的次数,为已知的替换分配一个分数。 天真地,我建议使这个分数与长度的平方(在我能想到的大多数情况下,越长的匹配越罕见)和出现次数的平方根成正比,这样 4 项序列就有同样多的权重作为出现次数为 16 次的 2 项序列。 这个需要根据自己的实际情况进行调整。

给定长度为 M 的序列,有 N 个长度为 1 到 M 的子字符串,其中 N=M*(M+1)/2,因此如果字符串相当短,那么您可以迭代每个子字符串并查找可能的替换。 我认为,用这些子字符串组成整个字符串的方法数量也与 M^2 成正比。

对于原始字符串由子字符串组成的每种可能的组合,将每个子字符串的最佳(最高分数)替换的总分相加。

总分最高的作品将(可能,考虑到我对过程的假设)替换后的“最佳”结果。

Assign known replacements a score, based on the length of the replacement and the number of occurrences. Naively, I would suggest making this score proportional to the square of the length (longer matches being rarer, in most scenarios I can think of) and the square root of the number of occurrences, such that a 4-item sequence has as much weight as a 2-item sequence that occurs 16 times as often. This would need to be adjusted based on your actual situation.

Given a sequence of length M, there are N substrings of lengths 1 to M, where N=M*(M+1)/2, so if the strings are reasonably short then you could iterate over every substring and look up possible replacements. The number of ways to compose the whole string out of these substrings is also proportional to M^2, I think.

For every possible composition of the original string by substrings, add up the total score of the best (highest score) replacement for each substring.

The composition with the highest total score will be (potentially, given my assumptions about the process) the "best" post-replacement result.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文