MALLET 与 NLTK 中的主题建模
我刚刚读了一篇关于如何使用 MALLET 进行主题建模的精彩文章,但我在网上找不到任何将 MALLET 与 NLTK 进行比较的内容,而我已经有一些经验了。
它们之间的主要区别是什么? MALLET 是一个更“完整”的资源吗(例如,在幕后有更多的工具和算法)?或者哪里有一些回答前两个问题的好文章?
I just read a fascinating article about how MALLET could be used for topic modelling, but I couldn't find anything online comparing MALLET to NLTK, which I've already had some experience with.
What are the main differences between them? Is MALLET a more 'complete' resource (e.g. has more tools and algorithms under the hood)? Or where are some good articles answering these first two questions?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
这并不是说一个人比另一个人更完整,而是一个人拥有一些另一个人没有的东西的问题,反之亦然。这也是一个目标受众和目的的问题。
Mallet 是一个基于 Java 的机器学习工具包,旨在为各种自然语言处理任务提供健壮且快速的实现。
NLTK 使用 Python 构建,并附带许多额外的东西,例如 WordNet 等语料库。 NLTK 更多地针对学习 NLP 的人,因此更多地用作学习平台,而不是用作工程解决方案。
在我看来,两者之间的主要区别在于,NLTK 更适合作为对机器学习和 NLP 感兴趣的人的学习资源,因为它附带大量文档、示例、语料库等。Mallet
更针对研究人员以及在该领域工作并且已经知道自己想做什么的从业者。与 NLTK 广泛收集的一般 NLP 内容相比,它附带的文档较少(尽管它有很好的示例,并且 API 有详细的文档记录)。
更新:
描述这些的好文章是 http://mallet.cs.umass.edu/ 上的 Mallet 文档和示例 - 侧边栏有序列标记、主题建模等的链接
,对于 NLTK,NLTK 书 Natural Language Processing with Python 是一个很好的介绍NLTK 和 NLP。
更新
我最近发现了 sklearn Python 库。它更广泛地针对机器学习,不是直接用于 NLP,但也可以用于 NLP。它配备了非常多的建模工具可供选择,其中大部分似乎依赖于 NumPy,因此它应该非常快。我已经使用过它很多次了,可以说它写得非常好,文档也很好,并且有一个活跃的开发者社区在推动它的发展(至少截至 2013 年 5 月)。
更新 2
我现在也使用 mallet 一段时间了(特别是 mallet API),并且可以说,如果您计划将 mallet 集成到另一个项目中,您应该非常熟悉 Java 并做好准备花费大量时间调试几乎完全未记录的代码库。
如果您只想使用 mallet 命令行工具,那没问题,使用 API 需要大量挖掘 mallet 代码本身,并且通常还需要修复一些错误。请注意,mallet 附带了有关 API 的最少文档。
It's not that one is more complete than the other it is more a question of one having some stuff the other doesn't and vice versa. It also a question of intended audience and purpose.
Mallet is a Java based machine learning toolkit that aims to provide robust and fast implementations for various natural language processing tasks.
NLTK is built using Python and comes with a lot of extra stuff like corpora such as WordNet. NLTK is aimed more at people learning NLP, and as such is used more as a learning platform and perhaps less as an engineering solution.
In my opinion the main difference between the two is that NLTK is better positioned as a learning resource for people interested in machine learning and NLP as it comes with a whole ton of documentation, examples, corpora etc. etc.
Mallet is more aimed at researchers and practitioners that work in the field and already know what they want to do. It comes with less documentation (although it has good examples and the API is well documented) compared to NLTK's extensive collection of general NLP stuff.
UPDATE:
Good articles describing these would be the Mallet docs and examples at http://mallet.cs.umass.edu/ - the sidebar has links to sequence tagging, topic modelling etc.
and for NLTK the NLTK book Natural Language Processing with Python is a good introduction both to NLTK and to NLP.
UPDATE
I've recently found the sklearn Python library. This is aimed at machine learning more generally, not directly for NLP but can be used for that as well. It comes with a very large selection of modelling tools and most of it seems to rely on NumPy so it should be pretty fast. I've used it quite a bit and can say that it is very well written and documented and has an active developer community pushing it forward (as of May 2013 at least).
UPDATE 2
I've now also been using mallet for some time (specifically the mallet API) and can say that if you're planning on integrating mallet into another project you should be very familiar with Java and ready to spend a lot of time debugging an almost completely undocumented code base.
If all you want to do is to use the mallet command line tools, that's fine, using the API requires a lot of digging through the mallet code itself and usually fixing some bugs as well. Be warned mallet comes with minimal documentation with regards to the API.
问题是您使用的是 Python 还是 Java(或者以上都不是)。 Mallet 非常适合 Java(因此 Clojure 和 Scala),因为您可以轻松地用 Java 访问它的 API。 Mallet 还有一个很好的命令行界面,因此您可以在应用程序之外使用它。
出于与 Python 相同的原因,NLTK 对于 Python 来说非常有用,并且您不必做任何 Jython 疯狂的事情就可以让它们很好地协同工作。如果您使用 python,Gensim 刚刚添加了一个值得一试的 Mallet 包装器。目前,它基本上是一个基本的 alpha 功能,但它可能可以满足您的需要。
The question is whether you're working in Python or Java (or none of the above). Mallet is good for Java (therefore Clojure and Scala) since you can easily access it's API in Java. Mallet also has a nice commandline interface so you can use it outside of an application.
For the same reason with Python, NLTK is great for python, and you won't have to do any Jython craziness to get these to play well together. If you're using python, Gensim just added a Mallet wrapper that is worth checking out. Right now, it's basically a bare-bones alpha feature, but it may do what you need.
我对NLTK的主题建模工具包不熟悉,所以我不会尝试比较它。
Github 中的 Mallet 源代码包含多种算法(其中一些在“已发布”版本中不可用)。据我所知,有
它还具有
总而言之,它是一个用于试验主题模型的优秀工具包,具有易于使用的开源许可证 (CPL)。
I'm not familiar with NLTK's topic modeling toolkit, so I won't try to compare it.
The Mallet sources in Github contain several algorithms (some of which are not available in the 'released' version). To my knowledge, there are
It also has
All in all, it is a fine toolkit for experimenting with topic models, with a approachable open-source license (CPL).