自动检查文本中的单词拼写
[编辑]简而言之:您将如何编写自动拼写检查器? 这个想法是,检查器从已知的良好来源(字典)构建单词列表,并在使用足够频繁时自动添加新单词。 一段时间没有使用的词语应该被淘汰。 因此,如果我删除包含“Mungrohyperiofier”的场景部分,检查器应该会记住它一段时间,并且当我输入“Mung
同时,我想避免在字典中添加拼写错误。[/编辑]
我想为科幻故事编写一个文本编辑器。 编辑器应该为当前故事中任何地方使用的任何单词提供单词补全。 它只会提供故事的单个场景进行编辑(因此您可以轻松地移动场景)。
这意味着我有三个集合:
- 所有其他场景中所有单词的集合 在
- 我开始编辑之前当前场景中的单词集合
- 当前编辑器中的单词集合
我需要存储这些集合因为每次从头开始构建列表的成本太高。 我认为一个简单的纯文本文件,每行一个单词就足够了。
当用户编辑场景时,我们会遇到以下情况:
- 她删除了一个单词。 当前场景中的其他任何地方都没有使用该词。
- 她键入一个新单词
- 她键入一个已存在的单词
- 她键入一个已存在但出现拼写错误的单词
- 她更正集合 #2 中的一个单词中的拼写错误。
- 她纠正了第 1 组中的一个单词中的拼写错误(即拼写错误也在其他地方)。
- 她删除了一个她打算再次使用的词。 不过,删除后,该单词不再位于集合 #1 和 #3 中。
显而易见的策略是在保存场景时重建单词集,并从每个场景的单词列表文件构建集合#1。
所以我的问题是:是否有一个聪明的策略来保留不再使用的单词,但仍然能够逐步消除拼写错误? 如果可能的话,这个策略应该在后台工作,而用户甚至不会注意到发生了什么(即我想避免必须抓住鼠标从菜单中选择“将单词添加到字典”)。
[编辑] 基于 grieve 的评论
[EDIT]In Short: How would you write an automatic spell checker? The idea is that the checker builds a list of words from a known good source (a dictionary) and automatically adds new words when they are used often enough. Words which haven't been used a while should be phased out. So if I delete part of a scene which contains "Mungrohyperiofier", the checker should remember it for a while and when I type "Mung<Ctrl+Space>" in another scene, it should offer it again. If I don't use the word for, say, a few days, it should forget about it.
At the same time, I'd like to avoid adding typos to the dictionary.[/EDIT]
I want to write a text editor for SciFi stories. The editor should offer word completion for any word used anywhere in the current story. It will only offer a single scene of the story for editing (so you can easily move scenes around).
This means I have three sets:
- The set of all words in all other scenes
- The set of word in the current scene before I started editing it
- The set of words in the current editor
I need to store the sets somewhere as it would be too expensive to build the list from scratch every time. I think a simple plain text file with one-word-per-line is enough for that.
As the user edits the scene, we have these situations:
- She deletes a word. This word is not used anywhere else in the current scene.
- She types a word which is new
- She types a word which already exists
- She types a word which already exists but makes a typo
- She corrects a typo in a word which is in set #2.
- She corrects a typo in a word which is in set #1 (i.e. the typo is elsewhere, too).
- She deletes a word which she plans to use again. After the deletion, the word is no longer in the sets #1 and #3, though.
The obvious strategy would be to rebuilt the word sets when a scene is saved and build the set #1 from a word-list file per scene.
So my question is: Is there a clever strategy to keep words which aren't used anywhere anymore but still be able to phase out typos? If possible, this strategy should work in the background without the user even noticing what is going on (i.e. I want to avoid to have to grab the mouse to select "add word to dictionary" from the menu).
[EDIT] Based on a comment from grieve
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
所以你想编写一个拼写检查器。 这是Peter Norvig 关于编写拼写校正器的论文。 它描述了一个简单而强大的拼写纠正器。 您可以使用本书中已编写的部分,以及语言模型的参考列表(例如来自免费词典)。
我还会使用现有的开源拼写检查器,例如 aspell 和 hunspell,获取一些想法。
So you want to write a spelling checker. Here's Peter Norvig's paper about writing a spelling corrector. It describes a simple and robust spelling corrector. You can use the already-written part of the book, plus a reference list (say from a free dictionary) for the language model.
I would also go to existing open-source spelling checkers, such as aspell and hunspell, to get some ideas.
您应该使用的结构是 trie。 尾部/后缀压缩将有助于记忆。 您可以使用伪引用计数 GC 来跟踪使用情况。
对于实际节点,您可能只需要 32 位整数,21 位用于 unicode,其余用于各种其他标签和信息。
The structure you should use is a trie. Tail/suffix compression will help with memory. You can use a pseudo reference counting GC for keeping track of usage.
For the actual nodes, you would probably need no more than a 32-bit integer, 21-bits for unicode, and the rest for various other tags and information.
让我想起现代 LISP 实现中关于垃圾收集的信息:
创建的数据被放入“池 1”中,
当需要进行垃圾收集时,垃圾收集器会在池 1 中查找未使用的条目并将其删除。
然后,所有剩余条目都会移至池 2。
仅当需要的内存多于池 1 可以释放的内存时,才会检查池 2。
来自池 2 的、在垃圾收集中幸存下来的数据被放入池 3 中,等等。
这个想法是将数据动态地放入与其生命周期相对应的池中......
Reminds me of what I have been told about garbage collecting in modern LISP implementations :
data when created is put in "pool 1",
when there is a need to garbage collect the garbage collector look in pool 1 for unused entries and remove them.
Then any remaining entry is moved to pool 2.
Pool 2 is examined only when there is a need to more memory than pool 1 can release.
Data from pool 2 that survive a garbage collection is put in pool 3 and ... so on.
The idea is to put dynamically the data in a pool corresponding to its lifetime...