拉丁语词形变化:
我有一个单词数据库(包括名词和动词)。现在我想生成这些名词和动词的所有不同(变形)形式。做到这一点的最佳策略是什么?
由于拉丁语是一种高度变形的语言,因此存在:
a) 名词的词尾变化
b) < a href="http://en.wikipedia.org/wiki/Latin_conjugation" rel="nofollow noreferrer">动词的变化
请参阅此翻译页面以获取动词变化的示例(“mandare”): 结合
我不想手动输入所有单词的所有这些形式。
我怎样才能自动生成它们?最好的方法是什么?
- 一系列复杂的规则如何变形所有单词
- 贝叶斯方法
- ......
有一个名为“William Whitaker's Words”的程序。它也为拉丁词创建词形变化,所以它正是在做我想做的事情。
维基百科说该程序的工作原理如下:
Words 使用一组基于自然前置、中置和后缀、词尾变化和词形变化的规则来确定条目的可能性。由于这种分析单词结构的方法,即使程序找到给定单词的可能含义,也不能保证这些单词曾经在拉丁文学或演讲中使用过。
该程序的源代码也可在此处获取。但我真的不明白这是如何运作的。你能帮助我吗?也许这将是我的问题的解决方案......
I have a database of words (including nouns and verbs). Now I would like to generate all the different (inflected) forms of those nouns and verbs. What would be the best strategy to do this?
As Latin is a highly inflected language, there is:
See this translated page for an example of a verb's conjugation ("mandare"): conjugation
I don't want to type in all those forms for all the words manually.
How can I generate them automatically? What is the best approach?
- a list of complex rules how to inflect all the words
- Bayesian methods
- ...
There's a program called "William Whitaker's Words". It creates inflections for Latin words as well, so it's exactly doing what I want to do.
Wikipedia says that the program works like this:
Words uses a set of rules based on natural pre-, in-, and suffixation, declension, and conjugation to determine the possibility of an entry. As a consequence of this approach of analysing the structure of words, there is no guarantee that these words were ever used in Latin literature or speech, even if the program finds a possible meaning to a given word.
The program's source is also available here. But I don't really understand how this is to work. Can you help me? Maybe this would be the solution to my question ...
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
您可以执行类似于 hunspell 字典格式的操作(请参阅 http://www.manpagez.com/man /4/hunspell/)
您定义了 2 个表。一个包含词根(永远不会改变的部分),另一个包含对给定类的修改。对于给定的类,对于每个词尾变化(或词形变化),它告诉在词根末尾(或开头)添加哪些字符。它甚至可以指定替换给定数量的字符。现在,要获得特定词尾变化的单词,您需要获取词根,应用它所属类别的转换,然后瞧!
例如,对于 mandare,根将是 mand,并且该类将包含诸如 o、as、ate, amous, atis... 用于主动指示性现在时。
You could do something similar to hunspell dictionary format (see http://www.manpagez.com/man/4/hunspell/)
You define 2 tables. One contains roots of the words (the part that never change), and the other contains modifications for a given class. For a given class, for each declension (or conjugation), it tells what characters to add at the end (or the beginning) of the root. It even can specify to replace a given number of characters. Now, to get a word at a specific declension, you take the root, apply the transformation from the class it belongs, and voilà!
For example, for mandare, the root would be mand, and the class would contains suffixes like o, as, ate, amous, atis... for active indicative present.
我将使用名词作为例子,但它也适用于动词。
首先,我将创建两个类:
Regular
和Irregular
。对于Regular
名词,我将为三个词尾变化创建三个类,并使它们全部实现 Declensable(或者无论如何这个词是英语:)接口 (FirstDeclension extends Regular Implements Declensable< /代码>)。该接口将定义两个
静态枚举
(NOMINATIVE
、VOCATIVE
等,以及SINGULAR
、PLURAL )。
所有这些都有一个根字符串和后缀的静态哈希图。然后,
FirstDeclension#get (case, number)
方法将根据哈希图附加正确的后缀。Irregular
类应该必须为每个单词定义一个本地哈希图,然后实现相同的 Declensable 接口。这有什么意义吗?
附录:澄清一下,
class Regular
的构造函数是I'll use as example the nouns, but it applies also to verbs.
First, I would create two classes:
Regular
andIrregular
. For theRegular
nouns, I would make three classes for the three declensions, and make them all implement a Declensable (or however the word is in English :) interface (FirstDeclension extends Regular implements Declensable
). The interface would define twostatic enum
s (NOMINATIVE
,VOCATIVE
, etc, andSINGULAR
,PLURAL
).All would have a string for the root and a static hashmap of suffixes. The method
FirstDeclension#get (case, number)
would then append the right suffix based on the hashmap.The
Irregular
class should have to define a local hashmap for each word and then implement the same Declensable interface.Does it make any sense?
Addendum: To clarify, the constructor of
class Regular
would be也许,您可以在实现中遵循 AOT 路线。 (它属于 LGPL。)
AOT 中没有拉丁语形态,只有俄语、德语、英语,其中俄语当然是屈折形态的一个例子像拉丁语一样复杂,因此 AOT 应该准备好作为实现它的框架。
尽管如此,我相信在继续编程之前,人们必须拥有一个已经明确定义的形态的复杂精确的形式系统。至于俄语,我想,大多数工作形态计算机系统都是基于安德烈·扎利兹尼亚克(Andrey Zalizniak)以及《俄语语法词典》和相关著作中对俄语形态的认真分析。
Perhaps, you could follow the line of AOT in your implementation. (It's under LGPL.)
There's no Latin morphology in AOT, rather only Russian, German, English, where Russian is of course an example of an inflectional morphology as complex as Latin, so AOT should be ready as a framework for implementing it.
Still, I believe one has to have an elaborate precise formal system for the morphology already clearly defined before one goes on to programming. As for Russian, I guess, most of the working morphological computer systems are based on the serious analysis of Russian morphology done by Andrey Zalizniak and in the Grammatical Dictionary of Russian and related works.