对 HTML 文档执行拼写检查的高效算法

发布于 2024-08-16 17:21:52 字数 432 浏览 2 评论 0原文

我有一个 HTML 文档、一个常见拼写错误列表以及每种情况的正确拼写。 HTML 文档将多达约 50 页,并且有约 30K 拼写更正条目。

纠正此 HTML 文档中所有拼写错误的有效方法是什么?
(注意:我的实现将使用 Python,以防您知道任何相关的库。)


我想到了两种可能的方法:

  • 构建拼写数据的哈希表
  • 将 HTML 中的文本解析
  • 为标记
  • 如果拼写哈希表中的标记替换为更正 则
  • 用更新的文本构建新的 HTML 文档

这种方法对于多单词拼写更正将失败,将会存在。以下是一种更简单但看似效率较低的方法,适用于多单词:

  • 迭代拼写数据
  • 在 HTML 文档中搜索单词
  • (如果存在单词)用更正替换

I have a HTML document, a list of common spelling mistakes, and the correct spelling for each case.
The HTML documents will be up to ~50 pages and there are ~30K spelling correction entries.

What is an efficient way to correct all spelling mistakes in this HTML document?

(Note: my implementation will be in Python, in case you know of any relevant libraries.)

I have thought of 2 possibles approaches:

  • build hashtable of the spelling data
  • parse text from HTML
  • split text by whitespace into tokens
  • if token in spelling hashtable replace with correction
  • build new HTML document with updated text

This approach will fail for multi-word spelling corrections, which will exist. The following is a simpler though seemingly less efficient approach that will work for multi-words:

  • iterate spelling data
  • search for word in HTML document
  • if word exists replace with correction

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

时光病人 2024-08-23 17:21:52

您是正确的,第一种方法比第二种方法快得多(此外,我建议您研究 Tries< /a> 而不是直接哈希,对于 30k 字来说,空间节省将是相当惊人的)。

为了仍然能够处理多单词的情况,您可以跟踪前一个标记,从而检查哈希值是否有组合字符串,例如“prev cur”。

或者,您可以将多单词更正排除在哈希之外,并结合两种方法,首先使用单个单词的哈希,然后扫描多单词组合(反之亦然)。如果多字校正的数量相对较小,这仍然可以相对较快。

不过要小心,提取单词标记比仅仅在空格上分割要棘手。您不想仅仅因为在散列中没有找到带有逗号的“实例”而无法更正错误。

You are correct that the first approach will be MUCH faster than the second (additionally, I would recommend looking into Tries instead of a straight hash, the space savings will be quite dramatic for 30k words).

To still be able to handle the multi-word cases, you could either keep track of the previous token and thereby check your hash for a combined string such as "prev cur".

Or else you could leave the multi-word corrections out of the hash and combine your two approaches, first using the hash for single words and then doing a scan for the multi-word combos (or vice versa). This could still be relatively fast if the number of multi-word corrections is relatively small.

Be careful tho, pulling out word tokens is trickier than just splitting on whitespace. You don't want to fail to correct an error simply because you didn't find 'instence,' with a comma in your hash.

笑忘罢 2024-08-23 17:21:52

我同意 Rob 关于使用基于字符的特里树的建议,因为我很久以前就根据存储为特里树的有效单词字典编写了拼写校正算法。通过使用分支定界,我能够建议拼写错误的单词的可能正确拼写(通过 编辑距离)。此外,由于特里树只是一个大型有限状态机,因此添加公共前缀和后缀相当容易,因此它可以处理“后民族化主义”之类的“单词”。

I agree with Rob's suggestion of using a trie, based on characters, because I programmed a spelling correction algorithm ages ago based on having a dictionary of valid words stored as a trie. By using branch-and-bound I was able to suggest possibly correct spellings of misspelled words (by Levenshtein distance). In addition, since a trie is just a big finite-state-machine, it is fairly easy to add common prefixes and suffixes, so it could handle "words" like "postnationalizationalism's".

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文