Java 中数据标准化的拼写更正

发布于 2024-08-23 00:26:20 字数 559 浏览 6 评论 0原文

我正在寻找一个 Java 库来对用户生成的文本内容进行一些初始拼写检查/数据标准化,想象一下在 Facebook 个人资料中输入的兴趣。

该文本将在某个时刻被标记化(在拼写校正之前或之后,无论效果更好),并且其中一些文本用作搜索的键(完全匹配)。减少拼写错误等以产生更多匹配会很好。如果校正能够在比一个单词更长的标记上表现良好,那就更好了,例如“trinking Coffee”将变成“drinking Coffee”而不是“thinking Coffee”。

我找到了以下用于进行拼写纠正的 Java 库:

  1. JAZZY 似乎没有在积极开发中。此外,由于在社交网络配置文件和多词标记中使用非标准语言,基于字典距离的方法似乎不够充分。
  2. APACHE LUCENE 似乎有一个 统计拼写检查器 应该是更适合。这里的问题是如何创建一本好的词典? (我们不使用 Lucene,所以没有现有的索引。)

欢迎任何建议!

I am looking for a Java library to do some initial spell checking / data normalization on user generated text content, imagine the interests entered in a Facebook profile.

This text will be tokenized at some point (before or after spell correction, whatever works better) and some of it used as keys to search for (exact match). It would be nice to cut down misspellings and the like to produce more matches. It would be even better if the correction would perform well on tokens longer than just one word, e.g. "trinking coffee" would become "drinking coffee" and not "thinking coffee".

I found the following Java libraries for doing spelling correction:

  1. JAZZY does not seem to be under active development. Also, the dictionary-distance based approach seems inadequate because of the use of non-standard language in social network profiles and multi-word tokens.
  2. APACHE LUCENE seems to have a statistical spell checker that should be much more suited. Question here would how to create a good dictionary? (We are not using Lucene otherwise, so there is no existing index.)

Any suggestions are welcome!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

只有一腔孤勇 2024-08-30 00:26:20

您想要实现的不是拼写纠正器,而是模糊搜索。彼得·诺维格(Peter Norvig)的文章是一个很好的起点,可以根据字典检查候选人来建立模糊搜索。

或者看看 BK-Trees。

n-gram 索引(Lucene 使用)对于较长的单词会产生更好的结果。产生给定编辑距离候选的方法可能对于普通文本中的单词足够有效,但对于姓名、地址和科学文本则不够有效。不过,它会增加你的索引大小。

如果您对文本进行了索引,那么您就拥有了文本语料库(您的词典)。无论如何,只能找到您数据中的内容。您不需要使用外部词典。

一个很好的资源是信息检索简介 -字典和宽容检索。有上下文相关的拼写纠正的简短描述。

What you want to implement is not spelling corrector but a fuzzy search. Peter Norvig's essay is a good starting point to build a fuzzy search from candidates checked against a dictionary.

Alternatively have a look at BK-Trees.

An n-gram index (used by Lucene) produces better results for longer words. The approach to produce candidates up to a given edit distance will probably work good enough for words found in normal text but will not work good enough for names, addresses and scientific texts. It will increase you index size, though.

If you have the texts indexed you have your text corpus (your dictionary). Only what is in your data can be found anyway. You need not use an external dictionary.

A good resource is Introduction to Information Retrieval - Dictionaries and tolerant retrieval . There is a short description of context sensitive spelling correction.

晒暮凉 2024-08-30 00:26:20

对于填充 Lucene 索引作为拼写检查器的基础,这是解决问题的好方法。 Lucene 有一个开箱即用的 SpellChecker< /a> 你可以使用。

网上有很多单词词典,您可以下载并用作基础你的 lucene 索引。我建议还用一些特定领域的文本来补充这些内容,例如,如果您的用户是医生,那么可以用医学论文和出版物的源文本来补充词典。

With regards to populating a Lucene index as the basis of a spell checker, this is a good way to solve the problem. Lucene has an out the box SpellChecker you can use.

There are plenty of word dictionaries available on the net that you can download and use as the basis for your lucene index. I would suggest supplementing these with a number of domain specific texts as well e.g. if your users are medics then maybe supplement the dictionary with source texts from medical thesis and publications.

眼睛会笑 2024-08-30 00:26:20

您可以点击 古腾堡项目互联网档案,包含大量语料库。

另外,我认为维基词典可以帮助您。您甚至可以直接下载

You can hit the Gutenberg project or the Internet Archive for lots and lots of corpus.

Also, I think that the Wiktionary could help you. You can even make a direct download.

情释 2024-08-30 00:26:20

http://code.google.com/p/google-api-spelling- java 是一个很好的 Java 拼写检查库,但我同意 Thomas Jung 的观点,这可能不是您问题的答案。

http://code.google.com/p/google-api-spelling-java is a good Java spell checking library, but I agree with Thomas Jung, that may not be the answer to your problem.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文