你能以编程方式检测英语单词的复数形式,并推导出单数形式吗?

发布于 2024-08-03 15:48:12 字数 550 浏览 17 评论 0 原文

给定一些(英语)单词,我们假设它是复数,是否有可能推导出单数形式?如果可能的话,我想避免查找/字典表。

一些例子:

Examples  -> Example    a simple 's' suffix
Glitch    -> Glitches   'es' suffix, as opposed to above
Countries -> Country    'ies' suffix.
Sheep     -> Sheep      no change: possible fallback for indeterminate values

或者,这似乎是一个相当详尽的列表。

使用 x 语言的库就可以,只要它们是开源的(即,以便有人可以检查它们以确定如何使用 y 语言执行此操作)

Given some (English) word that we shall assume is a plural, is it possible to derive the singular form? I'd like to avoid lookup/dictionary tables if possible.

Some examples:

Examples  -> Example    a simple 's' suffix
Glitch    -> Glitches   'es' suffix, as opposed to above
Countries -> Country    'ies' suffix.
Sheep     -> Sheep      no change: possible fallback for indeterminate values

Or, this seems to be a fairly exhaustive list.

Suggestions of libraries in language x are fine, as long as they are open-source (ie, so that someone can examine them to determine how to do it in language y)

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(6

白况 2024-08-10 15:48:12

这实际上取决于“以编程方式”的含义。英语的一部分遵循易于理解的规则,而另一部分则不然。它主要与频率有关。对于简要概述,您可以阅读平克的“单词和规则”,但帮自己一个忙,不要将整个语言学生成理论完全放在心上。那里有更多的经验主义,而不是真正有助于追求的思想流派。

很多英语都可以进行统计词形还原。顺便说一下,词干提取或词形还原就是您要寻找的术语。最有效的词形还原器之一是 Morpha 词形还原器。如果您的项目需要对表示英语特定术语的字符串进行这种类型的简化,您可以尝试一下。

还有一些更简单的方法在规范相关术语方面取得了很大的成就。看看 Porter Stemmer,它足够有效,可以将大多数内容聚集在一起 em> 英语术语。

It really depends on what you mean by 'programmatically'. Part of English works on easy to understand rules, and part doesn't. It has to do mainly with frequency. For a brief overview, you can read Pinker's "Words and Rules", but do yourself a favor and don't take the whole generative theory of linguistics entirely to heart. There's a lot more empiricism there than that school of thought really lends to the pursuit.

A lot of English can be statistically lemmatized. By the way, stemming or lemmatization is the term you're looking for. One of the most effective lemmatizers which work off of statistical rules bootstrapped with frequency-based exceptions is the Morpha Lemmatizer. You can give this a shot if you have a project that requires this type of simplification of strings which represent specific terms in English.

There are even more naive approaches that accomplish much with respect to normalizing related terms. Take a look at the Porter Stemmer, which is effective enough to cluster together most terms in English.

小…楫夜泊 2024-08-10 15:48:12

从单数到复数,与我稍微熟悉的其他一些欧洲语言相比,英语复数形式实际上相当规则。例如,在德语中,计算复数形式非常复杂(例如 Land -> Länder)。我认为大约有 20-30 个例外,其余的遵循一个相当简单的规则集:

  • -y -> -ies(家庭 -> 家庭)
  • -us -> -i(仙人掌 -> 仙人掌)
  • -s -> -ses (loss ->loss)
  • 否则添加 -s

也就是说,复数到单数形式变得更加困难,因为相反的情况有歧义。例如:

  • pies:是py还是pie?
  • 滑雪:“skus”是单数还是复数?
  • 糖蜜:“糖蜜”或“糖蜜”是单数还是复数?

所以这是可以做到的,但是你将有一个更大的例外列表,并且你将不得不存储大量误报(即看起来是复数但实际上不是复数的东西)。

Going from singular to plural, English plural form is actually pretty regular compared to some other European languages I have a passing familiarity with. In German for example, working out the plural form is really complicated (eg Land -> Länder). I think there are roughly 20-30 exceptions and the rest follow a fairly simple ruleset:

  • -y -> -ies (family -> families)
  • -us -> -i (cactus -> cacti)
  • -s -> -ses (loss -> losses)
  • otherwise add -s

That being said, plural to singular form becomes that much harder because the reverse cases have ambiguities. For example:

  • pies: is it py or pie?
  • ski: is it singular or plural for 'skus'?
  • molasses: is it singular or plural for 'molasse' or 'molass'?

So it can be done but you're going to have a much larger list of exceptions and you're going to have to store a lot of false positives (ie things that appear plural but aren't).

春夜浅 2024-08-10 15:48:12

“axes”是“ax”还是“axis”的复数形式?即使是人类也无法在没有上下文的情况下判断。

Is "axes" the plural of "ax" or of "axis"? Even a human cannot tell without context.

半边脸i 2024-08-10 15:48:12

你可以看看 Inflector.net - 我的 Rails 变形类端口。

You can take a look at Inflector.net - my port of Rails' inflection class.

回梦 2024-08-10 15:48:12

不 - 英语不是一种遵守许多规则的语言。

我认为你最好的选择是:

  • 使用常用单词及其复数的字典(或按复数规则对它们进行分组,例如:将仅添加 S 的单词分组、添加 ES 的单词、删除 Y 的单词和添加 IES...)
  • 重新考虑您的应用程序

No - English isn't a language which sticks to many rules.

I think your best bet is either:

  • use a dictionary of common words and their plurals (or group them by their plural rule, eg: group words where you just add an S, words where you add ES, words where you drop a Y and add IES...)
  • rethink your application
嗫嚅 2024-08-10 15:48:12

正如 nickf 已经说过的那样,这是不可能的。对于你所描述的单词类别来说这很简单,但是所有自然以 s 结尾的单词呢?例如,我的名字 Marius 就不是 Mariu 的复数形式。我想与巴士相同。英语单词的复数化是一种单向函数(哈希函数),通常需要句子或段落的其余部分来了解上下文。

It is not possible, as nickf has already said. It would be simple for the classes of words you have described, but what about all the words that end with s naturally? My name, Marius, for example, is not plural of Mariu. Same with Bus I guess. Pluralization of words in English is a one way function (a hash function), and you usually need the rest of the sentence or paragraph for context.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文