识别英语句子中无意义的成分

发布于 2025-01-07 11:51:15 字数 681 浏览 5 评论 0原文

我想知道是否有一种算法或库可以帮助我识别英语中没有意义的组件?例如,非常严重语法错误?如果是这样,您能否解释一下它是如何工作的,因为我真的很想实现它或将其用于我自己的项目。

这是一个随机示例:

在句子中:“我关闭了等页面你好门。”

作为人类,我们可以很快识别出 [so etc page hello] 没有任何意义。机器是否有可能指出该字符串没有任何意义并且还包含语法错误?

如果有这样的解决方案,其精确度能达到多少?例如,给定一个英语句子的剪辑,算法是否有可能返回一个度量,表明该剪辑的意义或正确性?非常感谢!

PS:我查看了 CMU 的链接语法以及 NLTK 库。但我仍然不确定如何使用例如链接语法解析器来做我想做的事情,因为如果解析器不接受这个句子,我不知道如何调整它来告诉我它的哪一部分是不对的..而且我不确定 NLTK 是否支持这一点。

我解决这个问题的另一个想法是查看单词组合的频率。因为我目前只对纠正非常严重的错误感兴趣。如果我将“严重错误”定义为句子片段中的单词很少一起使用的情况,即组合的频率应该远低于句子中其他组合的频率。

例如,在上面的例子中:[so etc page hello]这四个词确实很少一起出现。我的想法的一个直觉来自于当我在谷歌中输入这样的组合时,没有相关的结果跳出来。那么有没有像谷歌那样为我提供此类频率信息的图书馆呢?这样的频率可以很好地暗示单词组合的正确性。

I'm wondering is there an algorithm or a library which helps me identify the components in an English which has no meaning? e.g., very serious grammar error? If so, could you explain how it works, because I would really like to implement that or use that for my own projects.

Here's a random example:

In the sentence: "I closed so etc page hello the door."

As a human, we can quickly identify that [so etc page hello] does not make any sense. Is it possible for a machine to point out that the string does not make any sense and also contains grammar errors?

If there's such a solution, how precise can that be? Is it possible, for example, given a clip of an English sentence, the algorithm returns a measure, indicating how meaningful, or correct that clip is? Thank you very much!

PS: I've looked at CMU's link grammar as well as the NLTK library. But still I'm not sure how to use for example link grammar parser to do what I would like to do as the if the parser doesn't accept the sentence, I don't know how to tweak it to tell me which part it is not right.. and I'm not sure whether NLTK supported that.

Another thought I had towards solving the problem is to look at the frequencies of the word combination. Since I'm currently interested in correcting very serious errors only. If I define the "serious error" to be the cases where words in a clip of a sentence are rarely used together, i.e., the frequency of the combo should be much lower than those of the other combos in the sentence.

For instance, in the above example: [so etc page hello] these four words really seldom occur together. One intuition of my idea comes from when I type such combo in Google, no related results jump out. So is there any library that provides me such frequency information like Google does? Such frequencies may give a good hint on the correctness of the word combo.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

凡尘雨 2025-01-14 11:51:15

我认为您正在寻找的是语言模型。语言模型为您的语言中出现的 k 个单词的每个句子分配一个概率。最简单的语言模型是 n-gram 模型:给定句子的前 i 个单词,观察到第 i+1 个单词的概率仅取决于 < code>n-1 之前的单词。

例如,对于二元模型 (n=2),句子 w1 w2 ... wk 的概率等于

P(w1 ... wk) = P(w1) P(w2 | w1) ... P(wk | w(k-1)).

计算概率 P( wi | w(i-1)),您只需计算二元组 w(i-1) wi 和单词 w(i -1) 在大型语料库上。

这是关于该主题的一篇很好的教程论文:A Bit of Progress in语言建模,作者:Joshua Goodman。

I think that what you are looking for is a language model. A language model assigns a probability to each sentence of k words appearing in your language. The simplest kind of language models are n-grams models: given the first i words of your sentence, the probability of observing the i+1th word only depends on the n-1 previous words.

For example, for a bigram model (n=2), the probability of the sentence w1 w2 ... wk is equal to

P(w1 ... wk) = P(w1) P(w2 | w1) ... P(wk | w(k-1)).

To compute the probabilities P(wi | w(i-1)), you just have to count the number of occurrence of the bigram w(i-1) wi and of the word w(i-1) on a large corpus.

Here is a good tutorial paper on the subject: A Bit of Progress in Language Modeling, by Joshua Goodman.

·深蓝 2025-01-14 11:51:15

是的,这样的事情确实存在。

您可以在维基百科上阅读相关内容。

您还可以在此处了解一些精度问题。

至于在确定句子有语法问题后确定哪个部分不正确,在不知道作者的意图的情况下基本上是不可能的。以“在他们的尸体上”和“在尸体上”为例。两者都不正确,可以通过添加/删除逗号或交换它们/那里来修复。然而,这些会产生非常不同的含义(是的,第二个不会是一个完整的句子,但在上下文中是可以接受/理解的)。

拼写检查之所以有效,是因为您可以检查单词以确定其是否有效(拼写正确)的单词数量有限。然而,可以构造无限的句子,具有无限的含义,因此,如果不知道其背后的含义是什么,就无法纠正写得不好的句子。

Yes, such things exist.

You can read about it on Wikipedia.

You can also read about some of the precision issues here.

As far as determining which part is not right after determining the sentence has a grammar issue, that is largely impossible without knowing the author's intended meaning. Take, for example, "Over their, dead bodies" and "Over there dead bodies". Both are incorrect, and could be fixed either by adding/removing the comma or swapping their/there. However, these result in very different meanings (yes, the second one would not be a complete sentence, but it would be acceptable/understandable in context).

Spell checking works because there are a limited number of words against which you can check a word to determine if it is valid (spelled correctly). However, there are infinite sentences that can be constructed, with infinite meanings, so there is no way to correct a poorly written sentence without knowing what the meaning behind it is.

喜爱纠缠 2025-01-14 11:51:15

我认为您正在寻找的是一个完善的库,可以处理自然语言并提取含义。

不幸的是,没有这样的库。正如您可能想象的那样,自然语言处理并不是一件容易的事。它仍然是一个非常活跃的研究领域。理解自然语言的算法和方法有很多,但据我所知,大多数只适用于特定应用或特定类型的单词。

而那些库,例如 CMU 库,似乎仍然相当初级。它不能做你想做的事(比如识别英语句子中的错误)。您必须使用他们提供的工具(例如句子解析器)开发算法来做到这一点。

如果您想了解更多信息,请访问 ai-class.com。他们有一些部分讨论处理语言和单词。

I think what you are looking for is a well-established library that can process natural language and extract the meanings.

Unfortunately, there's no such library. Natural language processing, as you probably can imagine, is not an easy task. It is still a very active research field. There are many algorithms and methods in understanding natural language, but to my knowledge, most of them only work well for specific applications or words of specific types.

And those libraries, such as the CMU one, seems to be still quite rudimental. It can't do what you want to do (like identifying errors in English sentence). You have to develop algorithm to do that using the tools that they provide (such as sentence parser).

If you want to learn about it check out ai-class.com. They have some sections that talks about processing language and words.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文