有免费的德语词法分析库吗?
我正在寻找一个可以对德语单词进行形态分析的库,即将任何单词转换为其根形式并提供有关所分析单词的元信息。
例如:
gegessen -> essen
wurde [...] gefasst -> fassen
Häuser -> Haus
Hunde -> Hund
我的愿望清单:
- 它必须与名词和动词一起使用。
- 我知道鉴于德语的复杂性,这是一项非常艰巨的任务,因此我也在寻找仅提供近似值或可能仅提供 80% 准确度的库。
- 我更喜欢不与字典一起使用的库,但鉴于情况,我再次愿意妥协。
- 我还更喜欢 C/C++/Delphi Windows 库,因为这将使它们更容易集成,但 .NET、Java...也可以。
- 它必须是一个免费的图书馆。 (L)GPL,MPL,...
编辑:我知道,由于单词不规则,根本无法在没有任何字典的情况下进行形态分析。 当我说,我更喜欢没有字典的图书馆时,我指的是那些映射每个单词的完整字典:
arbeite -> arbeiten
arbeitest -> arbeiten
arbeitet -> arbeiten
arbeitete -> arbeiten
arbeitetest -> arbeiten
arbeiteten -> arbeiten
arbeitetet -> arbeiten
gearbeitet -> arbeiten
arbeite -> arbeiten
...
这些字典有几个缺点,包括巨大的尺寸和无法处理未知单词。
当然,所有异常只能用字典来处理:(
esse -> essen
isst -> essen
eßt -> essen
aß -> essen
aßt -> essen
aßen -> essen
...
我现在脑子在旋转:))
I'm looking for a library which can perform a morphological analysis on German words, i.e. it converts any word into its root form and providing meta information about the analysed word.
For example:
gegessen -> essen
wurde [...] gefasst -> fassen
Häuser -> Haus
Hunde -> Hund
My wishlist:
- It has to work with both nouns and verbs.
- I'm aware that this is a very hard task given the complexity of the German language, so I'm also looking for libaries which provide only approximations or may only be 80% accurate.
- I'd prefer libraries which don't work with dictionaries, but again I'm open to compromise given the cirumstances.
- I'd also prefer C/C++/Delphi Windows libraries, because that would make them easier to integrate but .NET, Java, ... will also do.
- It has to be a free library. (L)GPL, MPL, ...
EDIT: I'm aware that there is no way to perform a morphological analysis without any dictionary at all, because of the irregular words.
When I say, I prefer a library without a dictionary I mean those full blown dictionaries which map each and every word:
arbeite -> arbeiten
arbeitest -> arbeiten
arbeitet -> arbeiten
arbeitete -> arbeiten
arbeitetest -> arbeiten
arbeiteten -> arbeiten
arbeitetet -> arbeiten
gearbeitet -> arbeiten
arbeite -> arbeiten
...
Those dictionaries have several drawbacks, including the huge size and the inability to process unknown words.
Of course all exceptions can only be handled with a dictionary:
esse -> essen
isst -> essen
eßt -> essen
aß -> essen
aßt -> essen
aßen -> essen
...
(My mind is spinning right now :) )
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(8)
我认为您正在寻找“词干算法”。
马丁·波特的方法在语言学家中众所周知。 波特词干分析器基本上是一种词缀剥离算法,结合了一些针对这些特殊情况的替换规则。
大多数词干分析器提供的词干在语言上是“不正确的”。 例如:“beautiful”和“beauty”都可以产生词干“beauti”,当然,这不是一个真正的词。 不过,如果您使用这些词干来改进信息检索系统中的搜索结果,这并不重要。 例如,Lucene 附带了对 Porter 词干分析器的支持。
波特还设计了一种用于开发词干分析器的简单编程语言,称为 Snowball。
Snowball 中还提供德语词干提取器。 网站上还提供了从 Snowball 源生成的 AC 版本,以及算法的纯文本解释。
这是 Snowball 中的德语词干分析器:http://snowball.tartarus.org/algorithms/german/ Stemmer.html
如果您要像在字典中那样查找某个单词的相应词干以及有关词性的信息,则应该在 Google 上搜索“词形还原”。
I think you are looking for a "stemming algorithm".
Martin Porter's approach is well known among linguists. The Porter stemmer is basically an affix stripping algorithm, combined with a few substitution rules for those special cases.
Most stemmers deliver stems that are linguistically "incorrect". For example: both "beautiful" and "beauty" can result in the stem "beauti", which, of course, is not a real word. This doesn't matter, though, if you're using those stems to improve search results in information retrieval systems. Lucene comes with support for the Porter stemmer, for instance.
Porter also devised a simple programming language for developing stemmers, called Snowball.
There are also stemmers for German available in Snowball. A C version, generated from the Snowball source, is also available on the website, along with a plain text explanation of the algorithm.
Here's the German stemmer in Snowball: http://snowball.tartarus.org/algorithms/german/stemmer.html
If you're looking for the corresponding stem of a word as you would find it in a dictionary, along with information on the part of speech, you should Google for "lemmatization".
(免责声明:我在这里链接我自己的开源项目)
这些以单词列表形式存在的数据可在 http://www. danielnaber.de/morphologie/。 它可以与分词器库(如 jwordsplitter)结合使用,以覆盖列表中未包含的复合名词。
或者只使用 Java 的 LanguageTool,它以紧凑的有限状态机的形式嵌入了单词列表(此外,它还包括复合分裂)。
(Disclaimer: I'm linking my own Open Source projects here)
This data in form of a word list is available at http://www.danielnaber.de/morphologie/. It could be combined with a word splitter library (like jwordsplitter) to cover compound nouns not in the list.
Or just use LanguageTool from Java, which has the word list embedded in form of a compact finite state machine (plus it also includes compound splitting).
您不久前问过这个问题,但您仍然可以使用 morphisto 尝试一下。
以下是如何在 Ubuntu 中执行此操作的示例:
安装 Stuttgart 有限状态传感器工具
$ sudo apt-get install sfst
下载 morphisto 形态学,例如 morphisto-02022011.a
压缩它,例如
$ fst-compact morphisto-02022011.a morphisto-02022011.ac
使用它! 以下是一些示例:
$ echo Hochzeit | fst-proc morphisto-02022011.ac
^Hochzeit/hohZeit<+NN>/hohZeit<+NN>/hohZeit<+NN>/hohZeit<+NN>/HochZeit<+NN>/HochZeit<+NN>/HochZeit<+NN>/HochZeit<+NN> /Hochzeit<+NN>/Hochzeit<+NN>/Hochzeit<+NN>/Hochzeit<+NN>$
$ echo gearbeitet | fst-proc morphisto-02022011.ac
^gearbeitet/arbeiten<+ADJ>/arbeiten<+ADJ>/arbeiten<+V>$
You asked this a while ago, but you might still give it a try with morphisto.
Here's an example on how to do it in Ubuntu:
Install the Stuttgart finite-state transducer tools
$ sudo apt-get install sfst
Download the morphisto morphology, e.g. morphisto-02022011.a
Compact it, e.g.
$ fst-compact morphisto-02022011.a morphisto-02022011.ac
Use it! Here are some examples:
$ echo Hochzeit | fst-proc morphisto-02022011.ac
^Hochzeit/hohZeit<+NN>/hohZeit<+NN>/hohZeit<+NN>/hohZeit<+NN>/HochZeit<+NN>/HochZeit<+NN>/HochZeit<+NN>/HochZeit<+NN>/Hochzeit<+NN>/Hochzeit<+NN>/Hochzeit<+NN>/Hochzeit<+NN>$
$ echo gearbeitet | fst-proc morphisto-02022011.ac
^gearbeitet/arbeiten<+ADJ>/arbeiten<+ADJ>/arbeiten<+V>$
看一下 LemmaGen (http://lemmatise.ijs.si/),这是一个旨在为词形还原提供标准化的开源多语言平台。 它正在做你想做的事情。
Have a look at LemmaGen (http://lemmatise.ijs.si/) which is a project that aims at providing standardized open source multilingual platform for lemmatisation. It is doing exactly what you want.
我认为没有字典就无法做到这一点。
基于规则的方法总是会遇到类似的问题
(不会说德语的人请注意:第二种情况的正确解决方案是“gehen”)。
I don't think that this can be done without a dictionary.
Rules-based approaches will invariably trip over things like
(note to people who don't speak german: the correct solution in the second case is "gehen").
看看 Leo。
他们提供了你想要的数据,也许会给你一些想法。
Have a look at Leo.
They offer the data which you are after, maybe it gives you some ideas.
可以将 morphisto 与 ParZu (https://github.com/rsennrich/parzu)。 ParZu 是德语的依存解析器。
这意味着 ParZu 还消除了 morphisto 输出的歧义
One can use morphisto with ParZu (https://github.com/rsennrich/parzu). ParZu is a dependency parser for German.
This means that the ParZu also disambiguate the output from morphisto
您可以使用一些工具,例如 morph。 Matetools、Morphisto 等中的组件。但痛苦是将它们集成到您的工具链中。 DKpro 是许多此类语言工具的一个非常好的包装器(https://dkpro.github.io /dkpro-core/),使用 UIMA 的框架。 它允许您使用来自不同资源的不同语言工具编写自己的预处理管道,这些工具都会自动下载到您的计算机上并相互通信。 您可以使用 Java 或 Groovy 甚至 Jython 来使用它。 DKPro 使您可以轻松访问两种形态分析器:MateMorphTagger 和 SfstAnnotator。
您不想使用像 Porter 这样的词干分析器,它会以一种在语言上没有任何意义并且没有您描述的行为的方式减少词形。 如果您只想找到基本形式,对于动词来说是不定式,对于名词来说是主格单数,那么您应该使用词形还原器。 您可以在此处找到德语词形还原器列表。 Treetagger 被广泛使用。 您还可以使用 SMORS 等形态分析器提供的更复杂的分析。 它会给你类似这样的信息(来自 SMORS 网站的示例):
There are some tools out there which you could use like the morph. component in the Matetools, Morphisto etc. But the pain is to integrate them in your tool chain. A very good wrapper around quite a lot of these linguistic tools is DKpro (https://dkpro.github.io/dkpro-core/), a framework using UIMA. It allows you to write your own preprocessing pipeline using different linguistic tools from different resources which are all downloaded automatically on your computer and speak to each other. You can use Java or Groovy or even Jython to use it. DKPro provides you easy access to two morphological analyzers, MateMorphTagger and SfstAnnotator.
You don't want to use a stemmer like Porter, it will reduce the word form in a way which does not make any sense linguistically and does not have the behaviour you describe. If you only want to find the basic form, for a verb that would be the infinitive and for a noun the nominative singular, then you should use a lemmatizer. You can find a list of German lemmatizers here. Treetagger is widely used. You can also use a more complex analysis provided by a morphological analyzer like SMORS. It will give you something like this (example from the SMORS website):