你能以编程方式检测英语单词的复数形式,并推导出单数形式吗?
给定一些(英语)单词,我们假设它是复数,是否有可能推导出单数形式?如果可能的话,我想避免查找/字典表。
一些例子:
Examples -> Example a simple 's' suffix Glitch -> Glitches 'es' suffix, as opposed to above Countries -> Country 'ies' suffix. Sheep -> Sheep no change: possible fallback for indeterminate values
使用 x
语言的库就可以,只要它们是开源的(即,以便有人可以检查它们以确定如何使用 y
语言执行此操作)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
这实际上取决于“以编程方式”的含义。英语的一部分遵循易于理解的规则,而另一部分则不然。它主要与频率有关。对于简要概述,您可以阅读平克的“单词和规则”,但帮自己一个忙,不要将整个语言学生成理论完全放在心上。那里有更多的经验主义,而不是真正有助于追求的思想流派。
很多英语都可以进行统计词形还原。顺便说一下,词干提取或词形还原就是您要寻找的术语。最有效的词形还原器之一是 Morpha 词形还原器。如果您的项目需要对表示英语特定术语的字符串进行这种类型的简化,您可以尝试一下。
还有一些更简单的方法在规范相关术语方面取得了很大的成就。看看 Porter Stemmer,它足够有效,可以将大多数内容聚集在一起 em> 英语术语。
It really depends on what you mean by 'programmatically'. Part of English works on easy to understand rules, and part doesn't. It has to do mainly with frequency. For a brief overview, you can read Pinker's "Words and Rules", but do yourself a favor and don't take the whole generative theory of linguistics entirely to heart. There's a lot more empiricism there than that school of thought really lends to the pursuit.
A lot of English can be statistically lemmatized. By the way, stemming or lemmatization is the term you're looking for. One of the most effective lemmatizers which work off of statistical rules bootstrapped with frequency-based exceptions is the Morpha Lemmatizer. You can give this a shot if you have a project that requires this type of simplification of strings which represent specific terms in English.
There are even more naive approaches that accomplish much with respect to normalizing related terms. Take a look at the Porter Stemmer, which is effective enough to cluster together most terms in English.
从单数到复数,与我稍微熟悉的其他一些欧洲语言相比,英语复数形式实际上相当规则。例如,在德语中,计算复数形式非常复杂(例如 Land -> Länder)。我认为大约有 20-30 个例外,其余的遵循一个相当简单的规则集:
也就是说,复数到单数形式变得更加困难,因为相反的情况有歧义。例如:
所以这是可以做到的,但是你将有一个更大的例外列表,并且你将不得不存储大量误报(即看起来是复数但实际上不是复数的东西)。
Going from singular to plural, English plural form is actually pretty regular compared to some other European languages I have a passing familiarity with. In German for example, working out the plural form is really complicated (eg Land -> Länder). I think there are roughly 20-30 exceptions and the rest follow a fairly simple ruleset:
That being said, plural to singular form becomes that much harder because the reverse cases have ambiguities. For example:
So it can be done but you're going to have a much larger list of exceptions and you're going to have to store a lot of false positives (ie things that appear plural but aren't).
“axes”是“ax”还是“axis”的复数形式?即使是人类也无法在没有上下文的情况下判断。
Is "axes" the plural of "ax" or of "axis"? Even a human cannot tell without context.
你可以看看 Inflector.net - 我的 Rails 变形类端口。
You can take a look at Inflector.net - my port of Rails' inflection class.
不 - 英语不是一种遵守许多规则的语言。
我认为你最好的选择是:
No - English isn't a language which sticks to many rules.
I think your best bet is either:
正如 nickf 已经说过的那样,这是不可能的。对于你所描述的单词类别来说这很简单,但是所有自然以 s 结尾的单词呢?例如,我的名字 Marius 就不是 Mariu 的复数形式。我想与巴士相同。英语单词的复数化是一种单向函数(哈希函数),通常需要句子或段落的其余部分来了解上下文。
It is not possible, as nickf has already said. It would be simple for the classes of words you have described, but what about all the words that end with s naturally? My name, Marius, for example, is not plural of Mariu. Same with Bus I guess. Pluralization of words in English is a one way function (a hash function), and you usually need the rest of the sentence or paragraph for context.