Java 自然语言解析器
我正在寻找一个java中的解析器(或生成的解析器),它能够执行以下操作:
- 我将提供已经标记词性的句子。我将使用我自己的标签集。
- 我没有任何统计数据。因此,如果解析器是统计的,我希望能够在没有此功能的情况下使用它。
- 轻松适应其他语言。学习曲线低
I am looking for a parser (or generated parser) in java that is capable of followings:
- I will provide sentences that are already part-of-speech tagged. I will use my own tag set.
- I don't have any statistical data. So if the parser is statistical, I want to be able to use it without this feature.
- Adaptable to other languages easily. Low learning curve
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
Stanford Parser(已在其他问题中列出)将执行您的所有操作列表。
您可以提供自己的 POS 标签,但如果它们尚未采用该格式,则需要对 Penn TreeBank 集进行一些转换。解析器要么是统计性的,要么不是。如果不是,您需要一套语法规则。除了作为玩具之外,不再真正以这种方式构建解析器,因为它们真的很糟糕™。因此,您可以依赖斯坦福解析器使用的统计数据(无需您进行额外的工作)。然而,这确实意味着有关您自己的标签的统计信息(如果它们没有直接映射到 Penn TreeBank 标签)将被忽略。但由于无论如何你都没有标签的统计数据,所以这应该是可以预料到的。
他们也有针对其他几种语言进行过培训的解析器,但是如果您想使用他们没有的语言,您将需要自己的标记数据。无论您使用哪种解析器,都无法解决这个问题。
如果您了解 Java(我假设您了解),Stanford Parser 非常简单且易于上手。他们的邮件列表也是一个很好的资源并且相当活跃。
The Stanford Parser (which was listed on that other SO question) will do everything you list.
You can provide your own POS tags, but you will need to do some translation to the Penn TreeBank set if they are not already in that format. Parsers are either statistical or they're not. If they're not, you need a set of grammar rules. No parsers are really built this way anymore, except as toys, because they are really Bad™. So, you can rely on the statistical data the Stanford Parser uses (with no additional work from you). This does mean, however, that statistics about your own tags (if they don't map directly to the Penn TreeBank tags) will be ignored. But since you don't have statistics for your tags anyway, that should be expected.
They have parsers trained for several other languages too, but you will need your own tagged data if you want to go to a language they don't have available. There's no getting around that, no matter which parser you use.
If you know Java (and I assume you do), the Stanford Parser is very straightforward and easy to get going. Also their mailing list is a great resource and is fairly active.
我不太清楚你想要什么,但我首先想到的是 Mallet:
http://mallet.cs.umass.edu/index.php
I'm not very clear on what you'd want, but the first thing I thought of was Mallet:
http://mallet.cs.umass.edu/index.php